Conclusion - PowerPoint PPT Presentation

About This Presentation

Title:

Conclusion

Description:

As for Java being slow as molasses, that's often a red herring. ... Hotel Accommodations by Red Roof Inn where you will find affordable Hotel. Rates. ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 62

Provided by: brig154

Category:

more less

Transcript and Presenter's Notes

Title: Conclusion

1
Conclusion

There is no single best strategy to extract an
optimal set of candidate data from a corpus.
You need to know a least some structural and
distributional properties of the phenomena you
are searching for.
Preparation of candidate data influences
distributions.
Distributional properties determine the outcome
of AMs.
Know the distributional assumptions underlying
the AMs you use.

2
Adj-N
Hello, I am called Richard Herring and this is my
web-site.
As for Java being slow as molasses, that's often
a red herring. In the first place, the less
genes, more behavioural flexibility argument is
a total red herring. Smadja gives the example
that the probability that any two adjacent words
in a text will be red herring is greater than the
probability of red times the probability of
herring. We can, therefore, use statistical
methods to find collocations Smadja, 1993.
Okay, I think I misled you by introducing a red
herring, for which I am very sorry.
Home page for the British Red Cross, includes
information on activities, donations, branch
addresses,... Welcome to Red Pepper, independent
magazine of the green and radical left. Hotel
Accommodations by Red Roof Inn where you will
find affordable Hotel Rates. Web site of the
Royal Air Force aerobatic team, the Red Arrows.
3
Extraction of Coocurrence Data
4
Basic Assumptions

The collocates of a collocation cooccur more
frequently within text than arbitrary word
combinations. (Recurrence)
Stricter control of cooccurrence data leads to
more meaningful results in collocation extraction.

5
Word (Co)occurrence

Distribution of words and word combinations in
text approximately described by Zipfs law.
Distribution of combinations is more extreme
than that of individual words.

6
Word (Co)occurrence

Zipf 's law
nm is the number of different words occurring m
times
i.e., there is a large number of low-frequency
words, and few high-frequency ones

7
An Example

corpus size 8 million words from the Frankfurter
Rundschau corpus
569,310 PNV-combinations (types) have been
selected from the extraction corpus including
main verbs, modals and auxiliaries.
Considering only combinations with main verbs,
the number of PNV-types reduces to 372,212 (full
forms).
Corresponding to 454,088 instances

8
Distribution of PNV types according to frequency
(372,212 types)
9
Distribution of PNV types according to frequency
(10,430 types with fgt2)
10
Word (Co)occurrence

Collocations will be preferably found among
highly recurrent word combinations extracted from
text.
Large amounts of text need to be processed to
obtain sufficient number of high-frequency
combinations.

11
Control of Candidate Data

Extract collocations from relational bigrams
Syntactic homogeneity of candidate data
(Grammatical) cleanness of candidates
e.g. NV pairs SubjectV vs. ObjectV
Text type, domain, and size of source corpus
influence the outcome of collocation extraction

12
Terminology

Extraction corpustokenized, pos-tagged or
syntactically analysed text
Base datalist of bigrams found in corpus
Cooccurrence databigrams with contingency tables
Collocation candidatesranked bigrams

13
Types and Tokens

Frequency counts (from corpora)
identify labelled units (tokens),e.g. words,
NPs, Adj-N pairs
set of different labels (types)
type frequency number of tokens labelled with
this type
example ... what the black box does ...

14
Types and Tokens

Frequency counts (from corpora)
identify labelled units (tokens),e.g. words,
NPs, Adj-N pairs
set of different labels (types)
type frequency number of tokens labelled with
this type
example ... what the black box does ...

15
Types and Tokens

Counting cooccurrences
bigram tokens pairs of word tokens
bigram types pairs of word types
contingency table four-way classification of
bigram tokens according to their components

16
Contingency Tables

contingency table for pair type (u,v)

17
Collocation Extraction Processing Steps

Corpus preprocessing
tokenization (orthographic words)
pos-tagging
morphological analysis / lemmatization
partial parsing
(full parsing)

18
Collocation Extraction Processing Steps

Extraction of base data from corpus
adjacent word pairs
Adj-N pairs from NP chunks
Object-V Subject-V from parse trees
Calculation of cooccurrence data
compute contingency table for each pair type
(u,v)

19
Collocation Extraction Processing Steps

Ranking of cooccurrence data by "association
scores"
measure statistical association between types u
and v
true collocations should obtain high scores
using association measures (AM)
N-best list listing of N highest-ranked
collocation candidates

20
Base DataHow to get?

Adj-N
adjacency data
numerical span
NP chunking
(lemmatized)

21
Base DataHow to get?

V-N
adjacency data
sentence window
(partial) parsing
identification of grammatical relations
(lemmatized)

22
Base DataHow to get?

PP-V
adjacency data
PP chunking
separable verb particles(in German)
(full syntactic analysis)
(lemmatization?)

23
Adj-N
In the first place, the less genes, more
behavioural flexibility argument is a total red
herring. In/PRP the/ART first/ORD place/N ,/,
the/ART / less/ADJ genes/N ,/, more/ADJ
behavioural/ADJ flexibility/N / argument/N
is/V a/ART total/ADJ red/ADJ herring/N ./.
24
Adj-Nposwi N

span size 1 (adjacency)wj, j -1
first/ORD place/N
less/ADJ genes/N
behavioural/ADJ flexibility/N
/ argument/N
red/ADJ herring/N

25
Adj-Nposwi N

more/ADJ flexibility/N
/ argument/N
flexibility/N argument/N
red/ADJ herring/N
total/ADJ herring/N

span size 2wj, j -2, -1
first/ORD place/N
the/ART place/N
less/ADJ genes/N
/ genes/N
behavioural/ADJ flexibility/N

26
Adj-N poswj ADJ, poswi N

more/ADJ flexibility/N
/ argument/N
flexibility/N argument/N
red/ADJ herring/N
total/ADJ herring/N

span size 2wj, j -2, -1
first/ORD place/N
the/ART place/N
less/ADJ genes/N
/ genes/N
behavioural/ADJ flexibility/N

27
Adj-N
(S (PP In/PRP (NP the/ART first/ORD place/N
) ) ,/, (NP the/ART / less/ADJ
genes/N ,/, more/ADJ behavioural/ADJ
flexibility/N / argument/N ) (VP is/V
(NP a/ART total/ADJ red/ADJ herring/N ) ) )
./.
28
Adj-N
(S (PP-mod In/PRP (NP the/ART first/ORD
place/N ) ) ,/, (NP-subj
the/ART / less/ADJ genes/N ,/, more/ADJ
behavioural/ADJ flexibility/N / argument/N
) (VP-copula is/V (NP a/ART total/ADJ
red/ADJ herring/N ) ) ) ./.
29
Adj-NNP chunks

NP chunks
(NP the/ART first/ORD place/N )
(NP the/ART / less/ADJ genes/N ,/,
more/ADJ behavioural/ADJ flexibility/N /
argument/N )
(NP a/ART total/ADJ red/ADJ herring/N )
Adj-N Pairs
less/ADJ genes/N
more/ADJ flexibility/N
behavioural/ADJ flexibility/N
more/ADJ argument/N
behavioural/ADJ argument/N
total/ADJ herring/N
red/ADJ herring/N

30
N-V Object-VERB

spill the beans
Good for you for guessing the puzzle but from the
beans Mike spilled to me, I think those kind of
twists are more maddening than fun.
bury the hatchet
Paul McCartney has buried the hatchet with Yoko
Ono after a dispute over the songwriting credits
of some of the best-known Beatles songs.

31
N-V Object-Mod-VERB

keep ltonesgt nose to the grindstone
I'm very impressed with you for having kept your
nose to the grindstone, I'd like to offer you a
managerial position.
Weve learned from experience and kept our nose
to the grindstone to make sure our future remains
a bright one.
She keeps her nose to the grindstone.

32
N-V Object-Mod-VERB

keep ltonesgt nose to the grindstone
(VP kept, keeps, ...
(NP-obj your nose),
(NP-obj our nose),
(NP-obj her nose), ...
(PP-mod to the grindstone) )

33
PN-V P-Object-VERB

zur Verfügung stellen (make available)Peter
stellt sein Auto Maria zur Verfügung (Peter
makes his car available to Maria)
in Frage stellen (question)
Peter stellt Marias Loyalität in Frage (Peter
questions Marias loyalty)
in Verbindung setzen (to contact)
Peter setzt sich mit Maria in Verbindung
(Peter contacts Maria)

34
Contingency Tablesfor Relational Cooccurrences

(big, dog)
(black, box)
(black, dog)
(small, cat)
(small, box)
(black, box)
(old, box)
(tabby, cat)

pair type (u,v) (black, box)
35
Contingency Tablesfor Relational Cooccurrences

(big, dog)
(black, box)
(black, dog)
(small, cat)
(small, box)
(black, box)
(old, box)
(tabby, cat)

pair type (u,v) (black, box)
36
Contingency Tablesfor Relational Cooccurrences

(big, dog)
(black, box)
(black, dog)
(small, cat)
(small, box)
(black, box)
(old, box)
(tabby, cat)

pair type (u,v) (black, box)
37
Contingency Tablesfor Relational Cooccurrences

(big, dog)
(black, box)
(black, dog)
(small, cat)
(small, box)
(black, box)
(old, box)
(tabby, cat)

pair type (u,v) (black, box)
38
Contingency Tablesfor Relational Cooccurrences

(big, dog)
(black, box)
(black, dog)
(small, cat)
(small, box)
(black, box)
(old, box)
(tabby, cat)

pair type (u,v) (black, box)
39
Contingency Tablesfor Relational Cooccurrences

(big, dog)
(black, box)
(black, dog)
(small, cat)
(small, box)
(black, box)
(old, box)
(tabby, cat)

pair type (u,v) (black, box)
40
Contingency Tablesfor Relational Cooccurrences

(big, dog)
(black, box)
(black, dog)
(small, cat)
(small, box)
(black, box)
(old, box)
(tabby, cat)

f(u,v) 2 f1(u) 3 f2(v) 4 N 8
41
Contingency Tablesfor Relational Cooccurrences
f(u,v) 2 f1(u) 3 f2(v) 4 N 8
42
Contingency Tablesfor Relational Cooccurrences
f(u,v) 2 f1(u) 3 f2(v) 4 N 8
43
Contingency Tablesfor Relational Cooccurrences
real data from the BNC(adjacent adj-noun pairs,
lemmatised)
44
Contingency Tables in Perl

F ()
F1 ()
F2 ()
N 0
while ((u, v)get_pair())
F"u,v"
F1u
F2v
N

45
Contingency Tables in Perl

foreach pair (keys F)
(u,v) split /,/, pair
f Fpair
f1 F1u
f2 F2v
O11 f
O12 f1 - f
O21 f2 - f
O22 N - f1 - f2 - f
...

46
Reminder Contingency Tablewith Row and Column
Sums
47
Why are Positional Cooccurrences Different?

adjectives and nous cooccurring within sentences
"I saw a black dog"? (black, dog)f(black,
dog)1, f1(black)1, f2(dog)1
"The old man with the silly brown hat saw a
black dog"? (old, dog), (silly, dog), (brown,
dog), (black, dog), ... , (black, man), (black,
hat)f(black, dog)1, f1(black)3, f2(dog)4

48
Why are PositionalCooccurrences Different?

"wrong" combinations could be considered as
extraction noise(? association measures
distinguish noise from recurrent combinations)
but very large amount of noise
statistical models assume that noise is
completely random
but marginal frequencies often increase in large
steps

49
Contingency Tables for Segment-Based
Cooccurrences

within pre-determined segments (e.g. sentences)
components of cooccurring pairs may be
syntactically restricted(e.g. adj-noun,
nounSg-verb3.Sg)
for given pair type (u,v),set of all sentences
is classified into four categories

50
Contingency Tables for Segment-Based
Cooccurrences

u ? S at least one occurrence of u in sentence
S
u ? S no occurrences of u in sentence S
v ? S at least one occurrence of v in sentence
S
v ? S no occurrences of v in sentence S

51
Contingency Tables for Segment-Based
Cooccurrences

fS(u,v) number of sentences containing both u
and v
fS(u) number of sentences containing u
fS(v) number of sentences containing v
NS total number of sentences

52
Frequency Counts for Segment-Based Cooccurrences

adjectives and nous cooccurring within sentences
"I saw a black dog"? (black, dog)fS(black,
dog)1, fS(black)1, fS(dog)1
"The old man with the silly brown hat saw a
black dog"? (old, dog), (silly, dog), (brown,
dog), (black, dog), ... , (black, man), (black,
hat)fS(black, dog)1, fS(black)1, fS(dog)1

53
Segment-Based Cooccurrences in Perl

foreach S (_at_sentences)
words map _ gt 1 words(S)
pairs map _ gt 1 pairs(S)
foreach w (keys words)
FS_ww
foreach p (keys pairs)
FS_pp
NS

54
Contingency Tables for Distance-Based
Cooccurrences

problems are similar to segment-based
cooccurrence data
but no pre-defined segments
accurate counting is difficult
here sketched for special case
all orthographic words
numerical span nL left, nR right
no stop word lists

55
Contingency Tables for Distance-Based
Cooccurrences

nL 3, nR 2

56
Contingency Tables for Distance-Based
Cooccurrences

nL 3, nR 2
occurrences of v

57
Contingency Tables for Distance-Based
Cooccurrences

nL 3, nR 2
occurrences of v
window W(v) around them

58
Contingency Tables for Distance-Based
Cooccurrences

nL 3, nR 2
occurrences of v
window W(v) around them
occurrences of u

59
Contingency Tables for Distance-Based
Cooccurrences

nL 3, nR 2
occurrences of v
window W(v) around them
occurrences of u
cross-classify occurrences of u against window
W(v)

60
Contingency Tables for Distance-Based
Cooccurrences
61
Contingency Tables for Distance-Based
Cooccurrences
62
N-V Subject-Verb

Beispiele

63
Cooccurrence Data

Token-Type Distinction

64
Extraction Strategies

required
PoS-tagging
basic phrase chunking
infinitives with zu (to) are treated like single
words,
separated verb prefixes are reattached to the verb

65
Extraction Strategies

Full forms or base forms ?
depends on language and collocation type
required
morphological analysis

66
3 Extraction Strategies

Strategy 1 Retrieval of n-grams from word forms
only (wi)
Strategy 2 Retrieval of n-grams from
part-of-speech annotated word forms (wti)
Strategy 3 Retrieval of n-grams from word forms
with particular parts-of-speech, at particular
positions in syntactic structure (wticj )

67
Spans tested

wi wi1
wi wi1 wi2
wi wi2 wi3
wi wi3 wi4

68
Results of Strategy 1

Retrieval of PP-verb collocations from word forms
only is clearly inappropriate as function words
like articles, prepositions, conjunctions,
pronouns, etc. outnumber content words such as
nouns, adjectives and verbs.
Blunt use of stop word lists leads to the loss of
collocation-relevant information, as
accessibility of prepositions and determiners may
be crucial for the distinction of collocational
and noncollocational word combinations.

69
Results of Strategy 1

most useful/informative span wi wi1 wi2
examples

bis 17 Uhr 2222 FRANKFURT A.
M. 949 in diesem Jahr 915 um
20 Uhr 855 Di. bis Fr
807 10 bis 17 779 Tips und
Termine 597 in der Nacht 582
70
we have learned

useful/informative span size is language specific
we find a number of different constructions
e.g.
NP, PP, ...
names, time phrases, conventionalized
constructions, ...

71
Results of Strategy 2 wti wti1 with
preposition ti and noun ti1

PPs with arbitrary preposition-noun
co-occurrences such as
am Samstag (on Saturday),
am Wochenende (at the weekend),
für Kinder (for children)
Fixed/conventionalized? PPs such as
zum Beispiel (for example)

72
Results of Strategy 2 wti wti1 with
preposition ti and noun ti1

PPs with a strong tendency for particular
continuation such as
nach Angaben NPgen (according to'),
im Jahr Card (in the year).
Potential PP-collocates of verb-object
collocations such as
zur Verfügung (at the disposal)

73
Results of Strategy 2 wti wti2 with
preposition ti and noun ti1

typically cover PPs with pre-nominal modification
Cardinal, for instance, is the most probable
modifier category co-occurring with
bis ... Uhr (until oclock)
Adjective is the predominant modifier category
related to
im ... Jahr (1272 of 1276 cases total),
vergangenen (Adj, last, 466 instances)

74
Results of Strategy 2 wti wti3 with
preposition ti and noun ti1

typically exceeds phrase boundaries
im Jahres (indat yeargen), for instance,
originates from PP NPgen
e.g. im September dieses Jahres (in the
September of this year)

75
Results of Strategy 2 wti wti1 wti2 with
preposition ti and noun ti1 and verb ti2

Frequent preposition-noun-participle or
-infinitive sequences are good indicators for
PP-verb collocations, especially for collocations
that function as predicates such as support-verb
constructions and a number of figurative
expressions.
zur Verfügung gestellt (made available)
in Frage gestellt (questioned)
in Verbindung setzen (to contact)

76
Results of Strategy 2 wti wti2 wti3 wti wti3
wti4 with preposition ti and noun ti2 and verb
ti3 with preposition ti and noun ti3 and verb
ti4

a variety of PPs with prenominal modification are
covered
but also phrase boundaries are more likely to be
exceeded
durch Frauen helfen ? durch X (Y) Frauen helfen

77
Results of Strategy 3 wtick wtjck wtlcm
78
Conclusion

There is no single best strategy to extract an
optimal set of candidate data from a corpus.
You need to know a least some structural and
distributional properties of the phenomena you
are searching for.
Preparation of candidate data influences
distributions.
Distributional properties determine the outcome
of AMs.
Know the distributional assumptions underlying
the AMs you use.