Title: Conclusion
1Conclusion
- There is no single best strategy to extract an
optimal set of candidate data from a corpus. - You need to know a least some structural and
distributional properties of the phenomena you
are searching for. - Preparation of candidate data influences
distributions. - Distributional properties determine the outcome
of AMs. - Know the distributional assumptions underlying
the AMs you use.
2Adj-N
Hello, I am called Richard Herring and this is my
web-site.
As for Java being slow as molasses, that's often
a red herring. In the first place, the less
genes, more behavioural flexibility argument is
a total red herring. Smadja gives the example
that the probability that any two adjacent words
in a text will be red herring is greater than the
probability of red times the probability of
herring. We can, therefore, use statistical
methods to find collocations Smadja, 1993.
Okay, I think I misled you by introducing a red
herring, for which I am very sorry.
Home page for the British Red Cross, includes
information on activities, donations, branch
addresses,... Welcome to Red Pepper, independent
magazine of the green and radical left. Hotel
Accommodations by Red Roof Inn where you will
find affordable Hotel Rates. Web site of the
Royal Air Force aerobatic team, the Red Arrows.
3Extraction of Coocurrence Data
4Basic Assumptions
- The collocates of a collocation cooccur more
frequently within text than arbitrary word
combinations. (Recurrence) - Stricter control of cooccurrence data leads to
more meaningful results in collocation extraction.
5Word (Co)occurrence
- Distribution of words and word combinations in
text approximately described by Zipfs law. - Distribution of combinations is more extreme
than that of individual words.
6Word (Co)occurrence
- Zipf 's law
- nm is the number of different words occurring m
times - i.e., there is a large number of low-frequency
words, and few high-frequency ones
7An Example
- corpus size 8 million words from the Frankfurter
Rundschau corpus - 569,310 PNV-combinations (types) have been
selected from the extraction corpus including
main verbs, modals and auxiliaries. - Considering only combinations with main verbs,
the number of PNV-types reduces to 372,212 (full
forms). - Corresponding to 454,088 instances
8Distribution of PNV types according to frequency
(372,212 types)
9Distribution of PNV types according to frequency
(10,430 types with fgt2)
10Word (Co)occurrence
- Collocations will be preferably found among
highly recurrent word combinations extracted from
text. - Large amounts of text need to be processed to
obtain sufficient number of high-frequency
combinations.
11Control of Candidate Data
- Extract collocations from relational bigrams
- Syntactic homogeneity of candidate data
- (Grammatical) cleanness of candidates
- e.g. NV pairs SubjectV vs. ObjectV
- Text type, domain, and size of source corpus
influence the outcome of collocation extraction
12Terminology
- Extraction corpustokenized, pos-tagged or
syntactically analysed text - Base datalist of bigrams found in corpus
- Cooccurrence databigrams with contingency tables
- Collocation candidatesranked bigrams
13Types and Tokens
- Frequency counts (from corpora)
- identify labelled units (tokens),e.g. words,
NPs, Adj-N pairs - set of different labels (types)
- type frequency number of tokens labelled with
this type - example ... what the black box does ...
14Types and Tokens
- Frequency counts (from corpora)
- identify labelled units (tokens),e.g. words,
NPs, Adj-N pairs - set of different labels (types)
- type frequency number of tokens labelled with
this type - example ... what the black box does ...
15Types and Tokens
- Counting cooccurrences
- bigram tokens pairs of word tokens
- bigram types pairs of word types
- contingency table four-way classification of
bigram tokens according to their components
16Contingency Tables
- contingency table for pair type (u,v)
17Collocation Extraction Processing Steps
- Corpus preprocessing
- tokenization (orthographic words)
- pos-tagging
- morphological analysis / lemmatization
- partial parsing
- (full parsing)
18Collocation Extraction Processing Steps
- Extraction of base data from corpus
- adjacent word pairs
- Adj-N pairs from NP chunks
- Object-V Subject-V from parse trees
- Calculation of cooccurrence data
- compute contingency table for each pair type
(u,v)
19Collocation Extraction Processing Steps
- Ranking of cooccurrence data by "association
scores" - measure statistical association between types u
and v - true collocations should obtain high scores
- using association measures (AM)
- N-best list listing of N highest-ranked
collocation candidates
20Base DataHow to get?
- Adj-N
- adjacency data
- numerical span
- NP chunking
- (lemmatized)
21Base DataHow to get?
- V-N
- adjacency data
- sentence window
- (partial) parsing
- identification of grammatical relations
- (lemmatized)
22Base DataHow to get?
- PP-V
- adjacency data
- PP chunking
- separable verb particles(in German)
- (full syntactic analysis)
- (lemmatization?)
23Adj-N
In the first place, the less genes, more
behavioural flexibility argument is a total red
herring. In/PRP the/ART first/ORD place/N ,/,
the/ART / less/ADJ genes/N ,/, more/ADJ
behavioural/ADJ flexibility/N / argument/N
is/V a/ART total/ADJ red/ADJ herring/N ./.
24Adj-Nposwi N
- span size 1 (adjacency)wj, j -1
- first/ORD place/N
- less/ADJ genes/N
- behavioural/ADJ flexibility/N
- / argument/N
- red/ADJ herring/N
25Adj-Nposwi N
-
- more/ADJ flexibility/N
- / argument/N
- flexibility/N argument/N
- red/ADJ herring/N
- total/ADJ herring/N
- span size 2wj, j -2, -1
- first/ORD place/N
- the/ART place/N
- less/ADJ genes/N
- / genes/N
- behavioural/ADJ flexibility/N
26Adj-N poswj ADJ, poswi N
-
- more/ADJ flexibility/N
- / argument/N
- flexibility/N argument/N
- red/ADJ herring/N
- total/ADJ herring/N
- span size 2wj, j -2, -1
- first/ORD place/N
- the/ART place/N
- less/ADJ genes/N
- / genes/N
- behavioural/ADJ flexibility/N
27Adj-N
(S (PP In/PRP (NP the/ART first/ORD place/N
) ) ,/, (NP the/ART / less/ADJ
genes/N ,/, more/ADJ behavioural/ADJ
flexibility/N / argument/N ) (VP is/V
(NP a/ART total/ADJ red/ADJ herring/N ) ) )
./.
28Adj-N
(S (PP-mod In/PRP (NP the/ART first/ORD
place/N ) ) ,/, (NP-subj
the/ART / less/ADJ genes/N ,/, more/ADJ
behavioural/ADJ flexibility/N / argument/N
) (VP-copula is/V (NP a/ART total/ADJ
red/ADJ herring/N ) ) ) ./.
29Adj-NNP chunks
- NP chunks
- (NP the/ART first/ORD place/N )
- (NP the/ART / less/ADJ genes/N ,/,
more/ADJ behavioural/ADJ flexibility/N /
argument/N ) - (NP a/ART total/ADJ red/ADJ herring/N )
- Adj-N Pairs
- less/ADJ genes/N
- more/ADJ flexibility/N
- behavioural/ADJ flexibility/N
- more/ADJ argument/N
- behavioural/ADJ argument/N
- total/ADJ herring/N
- red/ADJ herring/N
30N-V Object-VERB
- spill the beans
- Good for you for guessing the puzzle but from the
beans Mike spilled to me, I think those kind of
twists are more maddening than fun. - bury the hatchet
- Paul McCartney has buried the hatchet with Yoko
Ono after a dispute over the songwriting credits
of some of the best-known Beatles songs.
31N-V Object-Mod-VERB
- keep ltonesgt nose to the grindstone
- I'm very impressed with you for having kept your
nose to the grindstone, I'd like to offer you a
managerial position. - Weve learned from experience and kept our nose
to the grindstone to make sure our future remains
a bright one. - She keeps her nose to the grindstone.
32N-V Object-Mod-VERB
- keep ltonesgt nose to the grindstone
- (VP kept, keeps, ...
- (NP-obj your nose),
- (NP-obj our nose),
- (NP-obj her nose), ...
- (PP-mod to the grindstone) )
33PN-V P-Object-VERB
- zur Verfügung stellen (make available)Peter
stellt sein Auto Maria zur Verfügung (Peter
makes his car available to Maria) - in Frage stellen (question)
- Peter stellt Marias Loyalität in Frage (Peter
questions Marias loyalty) - in Verbindung setzen (to contact)
- Peter setzt sich mit Maria in Verbindung
- (Peter contacts Maria)
34Contingency Tablesfor Relational Cooccurrences
- (big, dog)
- (black, box)
- (black, dog)
- (small, cat)
- (small, box)
- (black, box)
- (old, box)
- (tabby, cat)
pair type (u,v) (black, box)
35Contingency Tablesfor Relational Cooccurrences
- (big, dog)
- (black, box)
- (black, dog)
- (small, cat)
- (small, box)
- (black, box)
- (old, box)
- (tabby, cat)
pair type (u,v) (black, box)
36Contingency Tablesfor Relational Cooccurrences
- (big, dog)
- (black, box)
- (black, dog)
- (small, cat)
- (small, box)
- (black, box)
- (old, box)
- (tabby, cat)
pair type (u,v) (black, box)
37Contingency Tablesfor Relational Cooccurrences
- (big, dog)
- (black, box)
- (black, dog)
- (small, cat)
- (small, box)
- (black, box)
- (old, box)
- (tabby, cat)
pair type (u,v) (black, box)
38Contingency Tablesfor Relational Cooccurrences
- (big, dog)
- (black, box)
- (black, dog)
- (small, cat)
- (small, box)
- (black, box)
- (old, box)
- (tabby, cat)
pair type (u,v) (black, box)
39Contingency Tablesfor Relational Cooccurrences
- (big, dog)
- (black, box)
- (black, dog)
- (small, cat)
- (small, box)
- (black, box)
- (old, box)
- (tabby, cat)
pair type (u,v) (black, box)
40Contingency Tablesfor Relational Cooccurrences
- (big, dog)
- (black, box)
- (black, dog)
- (small, cat)
- (small, box)
- (black, box)
- (old, box)
- (tabby, cat)
f(u,v) 2 f1(u) 3 f2(v) 4 N 8
41Contingency Tablesfor Relational Cooccurrences
f(u,v) 2 f1(u) 3 f2(v) 4 N 8
42Contingency Tablesfor Relational Cooccurrences
f(u,v) 2 f1(u) 3 f2(v) 4 N 8
43Contingency Tablesfor Relational Cooccurrences
real data from the BNC(adjacent adj-noun pairs,
lemmatised)
44Contingency Tables in Perl
- F ()
- F1 ()
- F2 ()
- N 0
- while ((u, v)get_pair())
- F"u,v"
- F1u
- F2v
- N
45Contingency Tables in Perl
- foreach pair (keys F)
- (u,v) split /,/, pair
- f Fpair
- f1 F1u
- f2 F2v
- O11 f
- O12 f1 - f
- O21 f2 - f
- O22 N - f1 - f2 - f
- ...
46Reminder Contingency Tablewith Row and Column
Sums
47Why are Positional Cooccurrences Different?
- adjectives and nous cooccurring within sentences
- "I saw a black dog"? (black, dog)f(black,
dog)1, f1(black)1, f2(dog)1 - "The old man with the silly brown hat saw a
black dog"? (old, dog), (silly, dog), (brown,
dog), (black, dog), ... , (black, man), (black,
hat)f(black, dog)1, f1(black)3, f2(dog)4
48Why are PositionalCooccurrences Different?
- "wrong" combinations could be considered as
extraction noise(? association measures
distinguish noise from recurrent combinations) - but very large amount of noise
- statistical models assume that noise is
completely random - but marginal frequencies often increase in large
steps
49Contingency Tables for Segment-Based
Cooccurrences
- within pre-determined segments (e.g. sentences)
- components of cooccurring pairs may be
syntactically restricted(e.g. adj-noun,
nounSg-verb3.Sg) - for given pair type (u,v),set of all sentences
is classified into four categories
50Contingency Tables for Segment-Based
Cooccurrences
- u ? S at least one occurrence of u in sentence
S - u ? S no occurrences of u in sentence S
- v ? S at least one occurrence of v in sentence
S - v ? S no occurrences of v in sentence S
51Contingency Tables for Segment-Based
Cooccurrences
- fS(u,v) number of sentences containing both u
and v - fS(u) number of sentences containing u
- fS(v) number of sentences containing v
- NS total number of sentences
52Frequency Counts for Segment-Based Cooccurrences
- adjectives and nous cooccurring within sentences
- "I saw a black dog"? (black, dog)fS(black,
dog)1, fS(black)1, fS(dog)1 - "The old man with the silly brown hat saw a
black dog"? (old, dog), (silly, dog), (brown,
dog), (black, dog), ... , (black, man), (black,
hat)fS(black, dog)1, fS(black)1, fS(dog)1
53Segment-Based Cooccurrences in Perl
- foreach S (_at_sentences)
- words map _ gt 1 words(S)
- pairs map _ gt 1 pairs(S)
- foreach w (keys words)
- FS_ww
-
- foreach p (keys pairs)
- FS_pp
-
- NS
54Contingency Tables for Distance-Based
Cooccurrences
- problems are similar to segment-based
cooccurrence data - but no pre-defined segments
- accurate counting is difficult
- here sketched for special case
- all orthographic words
- numerical span nL left, nR right
- no stop word lists
55Contingency Tables for Distance-Based
Cooccurrences
56Contingency Tables for Distance-Based
Cooccurrences
- nL 3, nR 2
- occurrences of v
57Contingency Tables for Distance-Based
Cooccurrences
- nL 3, nR 2
- occurrences of v
- window W(v) around them
58Contingency Tables for Distance-Based
Cooccurrences
- nL 3, nR 2
- occurrences of v
- window W(v) around them
- occurrences of u
59Contingency Tables for Distance-Based
Cooccurrences
- nL 3, nR 2
- occurrences of v
- window W(v) around them
- occurrences of u
- cross-classify occurrences of u against window
W(v)
60Contingency Tables for Distance-Based
Cooccurrences
61Contingency Tables for Distance-Based
Cooccurrences
62N-V Subject-Verb
63Cooccurrence Data
64Extraction Strategies
- required
- PoS-tagging
- basic phrase chunking
- infinitives with zu (to) are treated like single
words, - separated verb prefixes are reattached to the verb
65Extraction Strategies
- Full forms or base forms ?
- depends on language and collocation type
- required
- morphological analysis
663 Extraction Strategies
- Strategy 1 Retrieval of n-grams from word forms
only (wi) - Strategy 2 Retrieval of n-grams from
part-of-speech annotated word forms (wti) - Strategy 3 Retrieval of n-grams from word forms
with particular parts-of-speech, at particular
positions in syntactic structure (wticj )
67Spans tested
-
- wi wi1
- wi wi1 wi2
- wi wi2 wi3
- wi wi3 wi4
68Results of Strategy 1
- Retrieval of PP-verb collocations from word forms
only is clearly inappropriate as function words
like articles, prepositions, conjunctions,
pronouns, etc. outnumber content words such as
nouns, adjectives and verbs. - Blunt use of stop word lists leads to the loss of
collocation-relevant information, as
accessibility of prepositions and determiners may
be crucial for the distinction of collocational
and noncollocational word combinations.
69Results of Strategy 1
- most useful/informative span wi wi1 wi2
- examples
bis 17 Uhr 2222 FRANKFURT A.
M. 949 in diesem Jahr 915 um
20 Uhr 855 Di. bis Fr
807 10 bis 17 779 Tips und
Termine 597 in der Nacht 582
70we have learned
- useful/informative span size is language specific
- we find a number of different constructions
- e.g.
- NP, PP, ...
- names, time phrases, conventionalized
constructions, ...
71Results of Strategy 2 wti wti1 with
preposition ti and noun ti1
- PPs with arbitrary preposition-noun
co-occurrences such as - am Samstag (on Saturday),
- am Wochenende (at the weekend),
- für Kinder (for children)
- Fixed/conventionalized? PPs such as
- zum Beispiel (for example)
72Results of Strategy 2 wti wti1 with
preposition ti and noun ti1
- PPs with a strong tendency for particular
continuation such as - nach Angaben NPgen (according to'),
- im Jahr Card (in the year).
- Potential PP-collocates of verb-object
collocations such as - zur Verfügung (at the disposal)
73Results of Strategy 2 wti wti2 with
preposition ti and noun ti1
- typically cover PPs with pre-nominal modification
- Cardinal, for instance, is the most probable
modifier category co-occurring with - bis ... Uhr (until oclock)
- Adjective is the predominant modifier category
related to - im ... Jahr (1272 of 1276 cases total),
- vergangenen (Adj, last, 466 instances)
74Results of Strategy 2 wti wti3 with
preposition ti and noun ti1
- typically exceeds phrase boundaries
- im Jahres (indat yeargen), for instance,
originates from PP NPgen - e.g. im September dieses Jahres (in the
September of this year)
75Results of Strategy 2 wti wti1 wti2 with
preposition ti and noun ti1 and verb ti2
- Frequent preposition-noun-participle or
-infinitive sequences are good indicators for
PP-verb collocations, especially for collocations
that function as predicates such as support-verb
constructions and a number of figurative
expressions. - zur Verfügung gestellt (made available)
- in Frage gestellt (questioned)
- in Verbindung setzen (to contact)
76Results of Strategy 2 wti wti2 wti3 wti wti3
wti4 with preposition ti and noun ti2 and verb
ti3 with preposition ti and noun ti3 and verb
ti4
- a variety of PPs with prenominal modification are
covered - but also phrase boundaries are more likely to be
exceeded - durch Frauen helfen ? durch X (Y) Frauen helfen
77Results of Strategy 3 wtick wtjck wtlcm
78Conclusion
- There is no single best strategy to extract an
optimal set of candidate data from a corpus. - You need to know a least some structural and
distributional properties of the phenomena you
are searching for. - Preparation of candidate data influences
distributions. - Distributional properties determine the outcome
of AMs. - Know the distributional assumptions underlying
the AMs you use.