Title: Discovering Compound and Proper Nouns
1Discovering Compound and Proper Nouns Grzegorz
Protaziuk1, Marzena Kryszkiewicz1,Henryk
Rybinski1, Alexandre Delteil2 1Warsaw
University of Technology, 2France Telecome
RD gprotazi,mkr,hrb_at_ii.pw.edu.pl,
alexandre.delteil_at_orange-ft.com
2Plan of the presentation
- Definitions
- Grammatical patterns
- Method
- Experiments
3Multiword expressions
- Compound nouns - represent notions, have a rigid
syntactic structure.e.g. information
retrieval, Warsaw University of Technology. - Idioms expressions the meaning of which almost
never can be derived from the meaning of words
constituting the expressions (a little
possibility of modifying the syntax). - Collocations this class consists of associated
words, i.e. words that frequently co-occur in
text. (syntactic structure is not rigid).
4Definitions (1)
- Compound noun or multiword term sequence of at
least 2 words used as a name (indicator) of one
notion or one thing. - Examples
- Warsaw University of Technology
- object oriented programming
- frequent pattern
- association rule
5Definitions (2)
- Distance between the words in a sequence
- The distance between two words w1 and w2
included in a sequence ws, denoted as
distancews(w1,w2), is defined as the number of
words in the considered sequence between those
words. - Example
- Given the sentence It is very interesting
problem, the distance between the word It and
the word interesting is equal to 2, and the
distance between is and very is equal to 0.
6Grammatical patterns (1)
- Formally, a grammatical pattern is a pair
P,G,where P ltPOS1, POS2,, POSngt, where
POSi pos1,,posn denotes a non-empty set of
valid parts of speech at the ith position, - and G max_gap12, max_gap23,, max_gapn-1,n,
where max_gapi,i1 is the maximal distance
between some successive words in the sentence
which are used to fulfill the grammatical
pattern at the positions i and i1.
7Grammatical patterns (2)
- When a sentence meets the grammatical pattern.
-
- A sentence snt ltw1, w2, , wmgt supports the
grammatical pattern gp P,G, P ltPOS1,
POS2,, POSngt, G max_gap12, max_gap23,,
max_gapn-1,n,if exists a sequence of integers
(i1, i2, ..., in), i1lti2lt...ltin, such that
pos(wi,1) ? POS1, pos(wi,2)?POS2,, pos(wi,n)?
POSn, where pos(w) denotes the part of speech in
the context of the sentence snt to which the
word w belongs - and distancesnt(wi,j, wi,j1) ? max_gapj,j1, 1?
j ? n-1.
8Grammatical patterns (3)
- Example
- A sentence
-
- A pattern lt(noun, noun, preposition,
noun), (0,0,0)gt - Search for the following proper noun Warsaw
University of Technology - We select the following words Warsaw,
University, of Technology. - These words do not fulfill the pattern.
- The distance constraints are not fulfilled as
e.g. distance( Warsaw, University) 5 gt
max_gap12 0.
In Warsaw, in the buildings belonging to
University there are auditoria with audio-visual
equipment of advanced technology.
In Warsaw, in the buildings belonging to
University there are auditoria with audio-visual
equipment of advanced technology.
9Assumptions (1)
- The words building multiword terms or multiword
proper nouns occur generally one by one in texts. - In most cases multiword terms consisting of
nouns, prepositions, and adjectives. - Many proper nouns consist of only nouns and a
preposition.
10Assumptions (2)
- Multiword term a sequence of words belonging to
the following part of speech adjective, noun,
preposition which occurs in given corpus at least
in the minSup text units. - The minSup threshold is given explicate by a
user. - Text unit may be either document or paragraph or
sentence.
11Problem statement
- To discover frequent word sequences which meets
the grammatical pattern, where frequent means
that a sequence occurs in more text units than
given threshold.
12Input data
- Documents written in English
- the Reuters documents (rather short)
- granularity at paragraph level
- granularity at document level
- the data mining scientific papers (rather long)
- granularity at paragraph level
13Algorithm
- T-GSP algorithm developed from the GSP
algorithm. - Main difference
- dealing this text
- application of grammatical patterns causes that
monotonic property (If one subset of an itemset
is not frequent, then the itemset itself cannot
be frequent) wrt. word-sequences does not hold. - Example
- gp1 (noun,noun,(0)), gp2
(noun,preposition,noun), (0,0) - sequence ltUniversity, Technologygt is
infrequent - sequence ltUniversity, of, Technologygt is
frequent - prune step done based on POS tags
14Data patterns versus text patterns
15Experiment 1
- Finding multiword terms composed of nouns.
- The patterns including only nouns
- gp1 ltnoun, noungt,lt0gt
- searching for two consecutive nouns e.g.
Newcastle United, - gp2 ltnoun, noun, noungt,lt0,0gt
- searching for 3 consecutive nouns, e.g.
information extraction system.
16Experiment 1 - summary
- E1.1 Input data paragraphs from the Reuters
documents, minSupp 7, number of all the
discovered patterns 563. - E1.2 Input data documents from the Reuters
repository, minSupp 5, number of all the
discovered patterns 744. - E1.3 Input data paragraphs from the papers,
minSupp 4, number of all the discovered
patterns 1406.
17Experiment 1 - conclusion
- The obtained results prove that the method allows
finding - proper nouns, e.g. President Bill Clinton,
Eastern Europe, and Columbia Pictures - multiword terms, e.g. carbon dioxide, credit
card, or information retrieval system.
18Experiment 2
- Finding multiword terms composed of two nouns and
one preposition - Grammatical patterns
- gp3 ltnoun, preposition, noungt,lt0,0gt,
- gp4 ltnoun, preposition, noungt,lt0,1gt,
- gp5 ltnoun, preposition, noungt,lt0,2gt.
-
- Searching for phrases such as Institute of
Research with three various specifications of
gaps between the preposition and noun.
19Experiment 2 - summary
- E2.1 Input data paragraphs extracted from the
Reuters document, minSupp 7, number of all
discovered patterns 37 for the gp3, 67 for the
gp4, 77 for the gp5. - E2.2 Input data the Reuters document,minSupp
5, number of all the discovered patterns 61 for
the gp3, 99 for the gp4, 137 for the gp5. - E2.3 Input data paragraphs extracted from the DM
scientific papers, minSupp 4, number of all
discovered patterns 158 for the gp3, 402 for the
gp4, 529 for the gp5.
20Experiment 2 - conclusion
- The Reuters repository
- proper nouns e.g. Bank of Japan, Union of
Kurdistan, Republic of China, - common phrases e.g. thousands of people,
end of year, barrels of oil, and rate of
percent. - The scientific DM papers
- proper nouns (or parts of them) e.g. Workshop
on Logics, University of Maryland,
Conference on Artificial, - multiword terms - e.g. structures of
ontologies, specification of
conceptualization. - Phrases discovered by applying the gp4 pattern,
but not discovered by applying the gp3 (such as
rest of paper or Journal of Computer) may be
incomplete.
21Experiment 3
- Finding multiword terms composed of three nouns
and one preposition - Grammatical patterns
- gp6 ltnoun ,noun,preposition,noungt,lt1,0,
1gt, - gp7 ltnoun,noun,preposition,noungt,lt2,0,2
gt. - searching for phrases such as Warsaw
University of Research with two various
specifications of gaps between the words.
22Experiment 3 - summary
- E3.1 Input data paragraphs extracted from the
Reuters document, minSupp 7, number of all the
discovered patterns 13 for the gp6, 21 for the
gp7. - E3.2 Input data the Reuters document, minSupp
5, number of all the discovered patterns 20 for
the gp6, 33 for the gp4. - E3.3 Input data paragraphs extracted from DM
papers,minSupp 5, number of all the discovered
patterns 57 for the gp6, 101 for the gp7.
23Experiment 3 - conclusion
- the Reuters documents
- mostly the proper nouns, e.g. Daiwa Institute
of Research, or Patriotic Union of Kurdystan. - multiword terms e. g. box office during
Friday. - the DM scientific papers mainly consists of parts
of names of - conferences e.g. International Conference on
Learning - publications e.g. Proceedings Workshop on
Ontology, Lecture Notes in Computer - titles of the papers or some text units within
the papers. Formal Ontology in Systems - multiword terms e.g. acquisition hyponyms
from text, core system for german.
24Further research
- Method of improvement the process of selecting
candidates for multiword terms and for proper
nouns - Using the approach for discovering associations
between words - Applying additional tags (e.g. categories) into
texts analyzing process
25Experiment 3.1
- Input data paragraphs extracted from the Reuters
document, minimal support 7, number of all the
discovered patterns 13 for the gp6 pattern, 21
for the gp7 pattern. - Selected results
- ltPort,conditions,from,Shippinggt (12,12)
- ltChicago,Board,of,Tradegt(12,12)
- ltPort,conditions,from,Lloydsgt (12,12)
- ltPictures,unit,of,Seagramgt (11,11)
- ltConseil,suprieur,de,audiovisuelgt (8,8)
- ltPictures,unit,of,Viacomgt (8,8)
ltPictures,unit,of,Incgt (8,8)
26Experiment 3.2
- Input data the Reuters document, minimal
support 5, number of all the discovered
patterns 20 for the gp6 pattern, 33 for the gp4
pattern. - Selected results
- ltPort,conditions,from,Shippinggt (12,12)
- ltPort,conditions,from,Lloydsgt (12,12)
- ltChicago,Board,of,Tradegt (12,12)
- ltPictures,unit,of,Seagramgt (11,11)
- ltConseil,Suprieur,de,Audiovisuelgt(8,8)
- ltPictures,unit,of,Viacomgt (8,8)
ltPictures,unit,of,Incgt (8,8)
27Experiment 3.3
- Input data paragraphs extracted from DM papers,
minimal support 5, number of all the discovered
patterns 57 for the gp6 pattern, 101 for the gp7
pattern. - Selected results
- ltOntology,Infrastructure,for,Semanticgt (68,68)
- ltWonderWeb,Infrastructure,for,Semanticgt(58,58)
- ltClass,rdf,about,owlgt (17,17) ltowl,rdf,about,rdfs
gt(16,16) - ltNational,Conference,on,Artificialgt(16,16)
- ltNational,Conference,on,Intelligencegt(15,16)
- ltInternational,Journal,of,Computergt (14,14)
- ltInstitute,University,of,Karlsruhegt (12,12)
- ltInternational,Journal,of,Human-gt(11,11)
28Experiment 1.1
- Input data paragraphs from the Reuters
documents, minimal support 7, number of all the
discovered patterns 563. - Selected results
- ltUnited,Statesgt 263 ltHong,Konggt 155
- ltPrime,Ministergt 143 ltNew,Yorkgt 113
- ltinterest,ratesgt 76 ltSouth,Africagt 69
ltWalt,Disneygt20 - ltPresident,Billgt 33 ltPresident,Bill,Clintongt 33,
29Experiment 1.2
- Input data documents from the Reuters
repository, minimal support 5, number of all the
discovered patterns 744. - Selected results
- ltUnited,Statesgt 151 ltPrime,Ministergt 114
- ltnews,conferencegt 86 ltNew,Yorkgt 82
- ltHong,Konggt 57 ltinterest,ratesgt 52
- ltPresident,Billgt33 ltPresident,Bill,Clintongt33
30Experiment 1.3
- Input data paragraphs from the papers, minimal
support 4, number of all the discovered patterns
1406. - Selected results
- ltSemantic,Webgt 186 lthttp,wwwgt 119
- ltArtificial,Intelligencegt 89 ltowl,Classgt 75
- ltInternational,Conferencegt 73 ltrdf,resourcegt 68
- ltOntology,Infrastructuregt 68 ltknowledge,basegt
68 - ltProject,WonderWebgt 67 ltMachine,Learninggt 50
- ltcon-,ceptsgt 29
31Experiment 2.1
- Input data paragraphs extracted from the Reuters
document, minimal support 7, number of all
discovered patterns 37 for the gp3 pattern, 67
for the gp4 pattern, 77 for the gp5 pattern. - Selected results
- ltpress,on,Tuesdaygt (34,34,34) ltBoard,of,Tradegt(13
,13,13) - ltrate,of,percentgt (12,14,17) ltmillions,of,dollars
gt (12,13,13) - ltunit,of,Seagramgt (12,12,12) ltUnion,of,Kurdistangt
(7,7,7) - ltpercent,in,yeargt (7)
32Experiment 2.2
- Input data the Reuters document, minimal
support 5, number of all the discovered
patterns 61 for the gp3 pattern, 99 for the gp4
pattern, 137 for the gp5 pattern. - Selected results
- ltpress,on,Tuesdaygt (34,34,34) ltBoard,of,Tradegt
(12,12,12) - ltunit,of,Seagramgt(12,12,12) ltmillions,of,dollarsgt
(11,12,12) - ltconditions,from,Lloydsgt (12,12,12)
- ltrate,of,percentgt (11,13,16)
- ltnewspapers,on,Tuesdaygt (11,11,11)
- ltUnion,of,Kurdistangt (7,7,7)
33Experiment 2.3
- Input data paragraphs extracted from the DM
scientific papers, minimal support 4, number of
all discovered patterns 158 for the gp3 pattern,
402 for the gp4 pattern, 529 for the gp5 pattern. - Selected results for the pattern gp3
- ltConference,on,Artificialgt35 ltApplications,of,Ont
ologiesgt7 - ltpoint,of,viewgt 23 ltUniversity,of,Manchestergt 7
- ltUniversity,of,Karlsruhegt 19 ltinstances,of,concep
tsgt 6 ltDepartment,of,Computergt14
ltcorpus,of,textsgt 6 ltWorkshop,on,Ontologiesgt 14
ltManagement,of,Datagt 6 - ltdomain,of,interestgt14 ltlevel,of,abstractiongt 6
ltConference,on,Knowledgegt13 ltConference,on,Machin
egt 6
34Experiment 3.1
- Input data paragraphs extracted from the Reuters
document, minimal support 7, number of all the
discovered patterns 13 for the gp6 pattern, 21
for the gp7 pattern. - Selected results
- ltPort,conditions,from,Shippinggt (12,12)
- ltChicago,Board,of,Tradegt(12,12)
- ltPort,conditions,from,Lloydsgt (12,12)
- ltPictures,unit,of,Seagramgt (11,11)
- ltConseil,suprieur,de,audiovisuelgt (8,8)
- ltPictures,unit,of,Viacomgt (8,8)
ltPictures,unit,of,Incgt (8,8)
35Experiment 3.2
- Input data the Reuters document, minimal
support 5, number of all the discovered
patterns 20 for the gp6 pattern, 33 for the gp4
pattern. - Selected results
- ltPort,conditions,from,Shippinggt (12,12)
- ltPort,conditions,from,Lloydsgt (12,12)
- ltChicago,Board,of,Tradegt (12,12)
- ltPictures,unit,of,Seagramgt (11,11)
- ltConseil,Suprieur,de,Audiovisuelgt(8,8)
- ltPictures,unit,of,Viacomgt (8,8)
ltPictures,unit,of,Incgt (8,8)
36Experiment 3.3
- Input data paragraphs extracted from DM papers,
minimal support 5, number of all the discovered
patterns 57 for the gp6 pattern, 101 for the gp7
pattern. - Selected results
- ltOntology,Infrastructure,for,Semanticgt (68,68)
- ltWonderWeb,Infrastructure,for,Semanticgt(58,58)
- ltClass,rdf,about,owlgt (17,17) ltowl,rdf,about,rdfs
gt(16,16) - ltNational,Conference,on,Artificialgt(16,16)
- ltNational,Conference,on,Intelligencegt(15,16)
- ltInternational,Journal,of,Computergt (14,14)
- ltInstitute,University,of,Karlsruhegt (12,12)
- ltInternational,Journal,of,Human-gt(11,11)