Discovering Compound and Proper Nouns - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Discovering Compound and Proper Nouns

Description:

... to fulfill the grammatical pattern at the positions i and i 1. ... e. g.: 'box office during Friday' ... Chicago,Board,of,Trade (12,12) Pictures,unit,of, ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 37
Provided by: GPR60
Category:

less

Transcript and Presenter's Notes

Title: Discovering Compound and Proper Nouns


1
Discovering Compound and Proper Nouns Grzegorz
Protaziuk1, Marzena Kryszkiewicz1,Henryk
Rybinski1, Alexandre Delteil2 1Warsaw
University of Technology, 2France Telecome
RD gprotazi,mkr,hrb_at_ii.pw.edu.pl,
alexandre.delteil_at_orange-ft.com
2
Plan of the presentation
  • Definitions
  • Grammatical patterns
  • Method
  • Experiments

3
Multiword expressions
  • Compound nouns - represent notions, have a rigid
    syntactic structure.e.g. information
    retrieval, Warsaw University of Technology.
  • Idioms expressions the meaning of which almost
    never can be derived from the meaning of words
    constituting the expressions (a little
    possibility of modifying the syntax).
  • Collocations this class consists of associated
    words, i.e. words that frequently co-occur in
    text. (syntactic structure is not rigid).

4
Definitions (1)
  • Compound noun or multiword term sequence of at
    least 2 words used as a name (indicator) of one
    notion or one thing.
  • Examples
  • Warsaw University of Technology
  • object oriented programming
  • frequent pattern
  • association rule

5
Definitions (2)
  • Distance between the words in a sequence
  • The distance between two words w1 and w2
    included in a sequence ws, denoted as
    distancews(w1,w2), is defined as the number of
    words in the considered sequence between those
    words.
  • Example
  • Given the sentence It is very interesting
    problem, the distance between the word It and
    the word interesting is equal to 2, and the
    distance between is and very is equal to 0.

6
Grammatical patterns (1)
  • Formally, a grammatical pattern is a pair
    P,G,where P ltPOS1, POS2,, POSngt, where
    POSi pos1,,posn denotes a non-empty set of
    valid parts of speech at the ith position,
  • and G max_gap12, max_gap23,, max_gapn-1,n,
    where max_gapi,i1 is the maximal distance
    between some successive words in the sentence
    which are used to fulfill the grammatical
    pattern at the positions i and i1.

7
Grammatical patterns (2)
  • When a sentence meets the grammatical pattern.
  • A sentence snt ltw1, w2, , wmgt supports the
    grammatical pattern gp P,G, P ltPOS1,
    POS2,, POSngt, G max_gap12, max_gap23,,
    max_gapn-1,n,if exists a sequence of integers
    (i1, i2, ..., in), i1lti2lt...ltin, such that
    pos(wi,1) ? POS1, pos(wi,2)?POS2,, pos(wi,n)?
    POSn, where pos(w) denotes the part of speech in
    the context of the sentence snt to which the
    word w belongs
  • and distancesnt(wi,j, wi,j1) ? max_gapj,j1, 1?
    j ? n-1.

8
Grammatical patterns (3)
  • Example
  • A sentence
  • A pattern lt(noun, noun, preposition,
    noun), (0,0,0)gt
  • Search for the following proper noun Warsaw
    University of Technology
  • We select the following words Warsaw,
    University, of Technology.
  • These words do not fulfill the pattern.
  • The distance constraints are not fulfilled as
    e.g. distance( Warsaw, University) 5 gt
    max_gap12 0.

In Warsaw, in the buildings belonging to
University there are auditoria with audio-visual
equipment of advanced technology.
In Warsaw, in the buildings belonging to
University there are auditoria with audio-visual
equipment of advanced technology.
9
Assumptions (1)
  • The words building multiword terms or multiword
    proper nouns occur generally one by one in texts.
  • In most cases multiword terms consisting of
    nouns, prepositions, and adjectives.
  • Many proper nouns consist of only nouns and a
    preposition.

10
Assumptions (2)
  • Multiword term a sequence of words belonging to
    the following part of speech adjective, noun,
    preposition which occurs in given corpus at least
    in the minSup text units.
  • The minSup threshold is given explicate by a
    user.
  • Text unit may be either document or paragraph or
    sentence.

11
Problem statement
  • To discover frequent word sequences which meets
    the grammatical pattern, where frequent means
    that a sequence occurs in more text units than
    given threshold.

12
Input data
  • Documents written in English
  • the Reuters documents (rather short)
  • granularity at paragraph level
  • granularity at document level
  • the data mining scientific papers (rather long)
  • granularity at paragraph level

13
Algorithm
  • T-GSP algorithm developed from the GSP
    algorithm.
  • Main difference
  • dealing this text
  • application of grammatical patterns causes that
    monotonic property (If one subset of an itemset
    is not frequent, then the itemset itself cannot
    be frequent) wrt. word-sequences does not hold.
  • Example
  • gp1 (noun,noun,(0)), gp2
    (noun,preposition,noun), (0,0)
  • sequence ltUniversity, Technologygt is
    infrequent
  • sequence ltUniversity, of, Technologygt is
    frequent
  • prune step done based on POS tags

14
Data patterns versus text patterns
15
Experiment 1
  • Finding multiword terms composed of nouns.
  • The patterns including only nouns
  • gp1 ltnoun, noungt,lt0gt
  • searching for two consecutive nouns e.g.
    Newcastle United,
  • gp2 ltnoun, noun, noungt,lt0,0gt
  • searching for 3 consecutive nouns, e.g.
    information extraction system.

16
Experiment 1 - summary
  • E1.1 Input data paragraphs from the Reuters
    documents, minSupp 7, number of all the
    discovered patterns 563.
  • E1.2 Input data documents from the Reuters
    repository, minSupp 5, number of all the
    discovered patterns 744.
  • E1.3 Input data paragraphs from the papers,
    minSupp 4, number of all the discovered
    patterns 1406.

17
Experiment 1 - conclusion
  • The obtained results prove that the method allows
    finding
  • proper nouns, e.g. President Bill Clinton,
    Eastern Europe, and Columbia Pictures
  • multiword terms, e.g. carbon dioxide, credit
    card, or information retrieval system.

18
Experiment 2
  • Finding multiword terms composed of two nouns and
    one preposition
  • Grammatical patterns
  • gp3 ltnoun, preposition, noungt,lt0,0gt,
  • gp4 ltnoun, preposition, noungt,lt0,1gt,
  • gp5 ltnoun, preposition, noungt,lt0,2gt.
  • Searching for phrases such as Institute of
    Research with three various specifications of
    gaps between the preposition and noun.

19
Experiment 2 - summary
  • E2.1 Input data paragraphs extracted from the
    Reuters document, minSupp 7, number of all
    discovered patterns 37 for the gp3, 67 for the
    gp4, 77 for the gp5.
  • E2.2 Input data the Reuters document,minSupp
    5, number of all the discovered patterns 61 for
    the gp3, 99 for the gp4, 137 for the gp5.
  • E2.3 Input data paragraphs extracted from the DM
    scientific papers, minSupp 4, number of all
    discovered patterns 158 for the gp3, 402 for the
    gp4, 529 for the gp5.

20
Experiment 2 - conclusion
  • The Reuters repository
  • proper nouns e.g. Bank of Japan, Union of
    Kurdistan, Republic of China,
  • common phrases e.g. thousands of people,
    end of year, barrels of oil, and rate of
    percent.
  • The scientific DM papers
  • proper nouns (or parts of them) e.g. Workshop
    on Logics, University of Maryland,
    Conference on Artificial,
  • multiword terms - e.g. structures of
    ontologies, specification of
    conceptualization.
  • Phrases discovered by applying the gp4 pattern,
    but not discovered by applying the gp3 (such as
    rest of paper or Journal of Computer) may be
    incomplete.

21
Experiment 3
  • Finding multiword terms composed of three nouns
    and one preposition
  • Grammatical patterns
  • gp6 ltnoun ,noun,preposition,noungt,lt1,0,
    1gt,
  • gp7 ltnoun,noun,preposition,noungt,lt2,0,2
    gt.
  • searching for phrases such as Warsaw
    University of Research with two various
    specifications of gaps between the words.

22
Experiment 3 - summary
  • E3.1 Input data paragraphs extracted from the
    Reuters document, minSupp 7, number of all the
    discovered patterns 13 for the gp6, 21 for the
    gp7.
  • E3.2 Input data the Reuters document, minSupp
    5, number of all the discovered patterns 20 for
    the gp6, 33 for the gp4.
  • E3.3 Input data paragraphs extracted from DM
    papers,minSupp 5, number of all the discovered
    patterns 57 for the gp6, 101 for the gp7.

23
Experiment 3 - conclusion
  • the Reuters documents
  • mostly the proper nouns, e.g. Daiwa Institute
    of Research, or Patriotic Union of Kurdystan.
  • multiword terms e. g. box office during
    Friday.
  • the DM scientific papers mainly consists of parts
    of names of
  • conferences e.g. International Conference on
    Learning
  • publications e.g. Proceedings Workshop on
    Ontology, Lecture Notes in Computer
  • titles of the papers or some text units within
    the papers. Formal Ontology in Systems
  • multiword terms e.g. acquisition hyponyms
    from text, core system for german.

24
Further research
  • Method of improvement the process of selecting
    candidates for multiword terms and for proper
    nouns
  • Using the approach for discovering associations
    between words
  • Applying additional tags (e.g. categories) into
    texts analyzing process

25
Experiment 3.1
  • Input data paragraphs extracted from the Reuters
    document, minimal support 7, number of all the
    discovered patterns 13 for the gp6 pattern, 21
    for the gp7 pattern.
  • Selected results
  • ltPort,conditions,from,Shippinggt (12,12)
  • ltChicago,Board,of,Tradegt(12,12)
  • ltPort,conditions,from,Lloydsgt (12,12)
  • ltPictures,unit,of,Seagramgt (11,11)
  • ltConseil,suprieur,de,audiovisuelgt (8,8)
  • ltPictures,unit,of,Viacomgt (8,8)
    ltPictures,unit,of,Incgt (8,8)

26
Experiment 3.2
  • Input data the Reuters document, minimal
    support 5, number of all the discovered
    patterns 20 for the gp6 pattern, 33 for the gp4
    pattern.
  • Selected results
  • ltPort,conditions,from,Shippinggt (12,12)
  • ltPort,conditions,from,Lloydsgt (12,12)
  • ltChicago,Board,of,Tradegt (12,12)
  • ltPictures,unit,of,Seagramgt (11,11)
  • ltConseil,Suprieur,de,Audiovisuelgt(8,8)
  • ltPictures,unit,of,Viacomgt (8,8)
    ltPictures,unit,of,Incgt (8,8)

27
Experiment 3.3
  • Input data paragraphs extracted from DM papers,
    minimal support 5, number of all the discovered
    patterns 57 for the gp6 pattern, 101 for the gp7
    pattern.
  • Selected results
  • ltOntology,Infrastructure,for,Semanticgt (68,68)
  • ltWonderWeb,Infrastructure,for,Semanticgt(58,58)
  • ltClass,rdf,about,owlgt (17,17) ltowl,rdf,about,rdfs
    gt(16,16)
  • ltNational,Conference,on,Artificialgt(16,16)
  • ltNational,Conference,on,Intelligencegt(15,16)
  • ltInternational,Journal,of,Computergt (14,14)
  • ltInstitute,University,of,Karlsruhegt (12,12)
  • ltInternational,Journal,of,Human-gt(11,11)

28
Experiment 1.1
  • Input data paragraphs from the Reuters
    documents, minimal support 7, number of all the
    discovered patterns 563.
  • Selected results
  • ltUnited,Statesgt 263 ltHong,Konggt 155
  • ltPrime,Ministergt 143 ltNew,Yorkgt 113
  • ltinterest,ratesgt 76 ltSouth,Africagt 69
    ltWalt,Disneygt20
  • ltPresident,Billgt 33 ltPresident,Bill,Clintongt 33,

29
Experiment 1.2
  • Input data documents from the Reuters
    repository, minimal support 5, number of all the
    discovered patterns 744.
  • Selected results
  • ltUnited,Statesgt 151 ltPrime,Ministergt 114
  • ltnews,conferencegt 86 ltNew,Yorkgt 82
  • ltHong,Konggt 57 ltinterest,ratesgt 52
  • ltPresident,Billgt33 ltPresident,Bill,Clintongt33

30
Experiment 1.3
  • Input data paragraphs from the papers, minimal
    support 4, number of all the discovered patterns
    1406.
  • Selected results
  • ltSemantic,Webgt 186 lthttp,wwwgt 119
  • ltArtificial,Intelligencegt 89 ltowl,Classgt 75
  • ltInternational,Conferencegt 73 ltrdf,resourcegt 68
  • ltOntology,Infrastructuregt 68 ltknowledge,basegt
    68
  • ltProject,WonderWebgt 67 ltMachine,Learninggt 50
  • ltcon-,ceptsgt 29

31
Experiment 2.1
  • Input data paragraphs extracted from the Reuters
    document, minimal support 7, number of all
    discovered patterns 37 for the gp3 pattern, 67
    for the gp4 pattern, 77 for the gp5 pattern.
  • Selected results
  • ltpress,on,Tuesdaygt (34,34,34) ltBoard,of,Tradegt(13
    ,13,13)
  • ltrate,of,percentgt (12,14,17) ltmillions,of,dollars
    gt (12,13,13)
  • ltunit,of,Seagramgt (12,12,12) ltUnion,of,Kurdistangt
    (7,7,7)
  • ltpercent,in,yeargt (7)

32
Experiment 2.2
  • Input data the Reuters document, minimal
    support 5, number of all the discovered
    patterns 61 for the gp3 pattern, 99 for the gp4
    pattern, 137 for the gp5 pattern.
  • Selected results
  • ltpress,on,Tuesdaygt (34,34,34) ltBoard,of,Tradegt
    (12,12,12)
  • ltunit,of,Seagramgt(12,12,12) ltmillions,of,dollarsgt
    (11,12,12)
  • ltconditions,from,Lloydsgt (12,12,12)
  • ltrate,of,percentgt (11,13,16)
  • ltnewspapers,on,Tuesdaygt (11,11,11)
  • ltUnion,of,Kurdistangt (7,7,7)

33
Experiment 2.3
  • Input data paragraphs extracted from the DM
    scientific papers, minimal support 4, number of
    all discovered patterns 158 for the gp3 pattern,
    402 for the gp4 pattern, 529 for the gp5 pattern.
  • Selected results for the pattern gp3
  • ltConference,on,Artificialgt35 ltApplications,of,Ont
    ologiesgt7
  • ltpoint,of,viewgt 23 ltUniversity,of,Manchestergt 7
  • ltUniversity,of,Karlsruhegt 19 ltinstances,of,concep
    tsgt 6 ltDepartment,of,Computergt14
    ltcorpus,of,textsgt 6 ltWorkshop,on,Ontologiesgt 14
    ltManagement,of,Datagt 6
  • ltdomain,of,interestgt14 ltlevel,of,abstractiongt 6
    ltConference,on,Knowledgegt13 ltConference,on,Machin
    egt 6

34
Experiment 3.1
  • Input data paragraphs extracted from the Reuters
    document, minimal support 7, number of all the
    discovered patterns 13 for the gp6 pattern, 21
    for the gp7 pattern.
  • Selected results
  • ltPort,conditions,from,Shippinggt (12,12)
  • ltChicago,Board,of,Tradegt(12,12)
  • ltPort,conditions,from,Lloydsgt (12,12)
  • ltPictures,unit,of,Seagramgt (11,11)
  • ltConseil,suprieur,de,audiovisuelgt (8,8)
  • ltPictures,unit,of,Viacomgt (8,8)
    ltPictures,unit,of,Incgt (8,8)

35
Experiment 3.2
  • Input data the Reuters document, minimal
    support 5, number of all the discovered
    patterns 20 for the gp6 pattern, 33 for the gp4
    pattern.
  • Selected results
  • ltPort,conditions,from,Shippinggt (12,12)
  • ltPort,conditions,from,Lloydsgt (12,12)
  • ltChicago,Board,of,Tradegt (12,12)
  • ltPictures,unit,of,Seagramgt (11,11)
  • ltConseil,Suprieur,de,Audiovisuelgt(8,8)
  • ltPictures,unit,of,Viacomgt (8,8)
    ltPictures,unit,of,Incgt (8,8)

36
Experiment 3.3
  • Input data paragraphs extracted from DM papers,
    minimal support 5, number of all the discovered
    patterns 57 for the gp6 pattern, 101 for the gp7
    pattern.
  • Selected results
  • ltOntology,Infrastructure,for,Semanticgt (68,68)
  • ltWonderWeb,Infrastructure,for,Semanticgt(58,58)
  • ltClass,rdf,about,owlgt (17,17) ltowl,rdf,about,rdfs
    gt(16,16)
  • ltNational,Conference,on,Artificialgt(16,16)
  • ltNational,Conference,on,Intelligencegt(15,16)
  • ltInternational,Journal,of,Computergt (14,14)
  • ltInstitute,University,of,Karlsruhegt (12,12)
  • ltInternational,Journal,of,Human-gt(11,11)
Write a Comment
User Comments (0)
About PowerShow.com