Conclusion - PowerPoint PPT Presentation

About This Presentation
Title:

Conclusion

Description:

As for Java being slow as molasses, that's often a red herring. ... Hotel Accommodations by Red Roof Inn where you will find affordable Hotel. Rates. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 62
Provided by: brig154
Category:
Tags: conclusion | inn | red | roof

less

Transcript and Presenter's Notes

Title: Conclusion


1
Conclusion
  • There is no single best strategy to extract an
    optimal set of candidate data from a corpus.
  • You need to know a least some structural and
    distributional properties of the phenomena you
    are searching for.
  • Preparation of candidate data influences
    distributions.
  • Distributional properties determine the outcome
    of AMs.
  • Know the distributional assumptions underlying
    the AMs you use.

2
Adj-N
Hello, I am called Richard Herring and this is my
web-site.
As for Java being slow as molasses, that's often
a red herring. In the first place, the less
genes, more behavioural flexibility argument is
a total red herring. Smadja gives the example
that the probability that any two adjacent words
in a text will be red herring is greater than the
probability of red times the probability of
herring. We can, therefore, use statistical
methods to find collocations Smadja, 1993.
Okay, I think I misled you by introducing a red
herring, for which I am very sorry.
Home page for the British Red Cross, includes
information on activities, donations, branch
addresses,... Welcome to Red Pepper, independent
magazine of the green and radical left. Hotel
Accommodations by Red Roof Inn where you will
find affordable Hotel Rates. Web site of the
Royal Air Force aerobatic team, the Red Arrows.
3
Extraction of Coocurrence Data
4
Basic Assumptions
  • The collocates of a collocation cooccur more
    frequently within text than arbitrary word
    combinations. (Recurrence)
  • Stricter control of cooccurrence data leads to
    more meaningful results in collocation extraction.

5
Word (Co)occurrence
  • Distribution of words and word combinations in
    text approximately described by Zipfs law.
  • Distribution of combinations is more extreme
    than that of individual words.

6
Word (Co)occurrence
  • Zipf 's law
  • nm is the number of different words occurring m
    times
  • i.e., there is a large number of low-frequency
    words, and few high-frequency ones

7
An Example
  • corpus size 8 million words from the Frankfurter
    Rundschau corpus
  • 569,310 PNV-combinations (types) have been
    selected from the extraction corpus including
    main verbs, modals and auxiliaries.
  • Considering only combinations with main verbs,
    the number of PNV-types reduces to 372,212 (full
    forms).
  • Corresponding to 454,088 instances

8
Distribution of PNV types according to frequency
(372,212 types)
9
Distribution of PNV types according to frequency
(10,430 types with fgt2)
10
Word (Co)occurrence
  • Collocations will be preferably found among
    highly recurrent word combinations extracted from
    text.
  • Large amounts of text need to be processed to
    obtain sufficient number of high-frequency
    combinations.

11
Control of Candidate Data
  • Extract collocations from relational bigrams
  • Syntactic homogeneity of candidate data
  • (Grammatical) cleanness of candidates
  • e.g. NV pairs SubjectV vs. ObjectV
  • Text type, domain, and size of source corpus
    influence the outcome of collocation extraction

12
Terminology
  • Extraction corpustokenized, pos-tagged or
    syntactically analysed text
  • Base datalist of bigrams found in corpus
  • Cooccurrence databigrams with contingency tables
  • Collocation candidatesranked bigrams

13
Types and Tokens
  • Frequency counts (from corpora)
  • identify labelled units (tokens),e.g. words,
    NPs, Adj-N pairs
  • set of different labels (types)
  • type frequency number of tokens labelled with
    this type
  • example ... what the black box does ...

14
Types and Tokens
  • Frequency counts (from corpora)
  • identify labelled units (tokens),e.g. words,
    NPs, Adj-N pairs
  • set of different labels (types)
  • type frequency number of tokens labelled with
    this type
  • example ... what the black box does ...

15
Types and Tokens
  • Counting cooccurrences
  • bigram tokens pairs of word tokens
  • bigram types pairs of word types
  • contingency table four-way classification of
    bigram tokens according to their components

16
Contingency Tables
  • contingency table for pair type (u,v)

17
Collocation Extraction Processing Steps
  • Corpus preprocessing
  • tokenization (orthographic words)
  • pos-tagging
  • morphological analysis / lemmatization
  • partial parsing
  • (full parsing)

18
Collocation Extraction Processing Steps
  • Extraction of base data from corpus
  • adjacent word pairs
  • Adj-N pairs from NP chunks
  • Object-V Subject-V from parse trees
  • Calculation of cooccurrence data
  • compute contingency table for each pair type
    (u,v)

19
Collocation Extraction Processing Steps
  • Ranking of cooccurrence data by "association
    scores"
  • measure statistical association between types u
    and v
  • true collocations should obtain high scores
  • using association measures (AM)
  • N-best list listing of N highest-ranked
    collocation candidates

20
Base DataHow to get?
  • Adj-N
  • adjacency data
  • numerical span
  • NP chunking
  • (lemmatized)

21
Base DataHow to get?
  • V-N
  • adjacency data
  • sentence window
  • (partial) parsing
  • identification of grammatical relations
  • (lemmatized)

22
Base DataHow to get?
  • PP-V
  • adjacency data
  • PP chunking
  • separable verb particles(in German)
  • (full syntactic analysis)
  • (lemmatization?)

23
Adj-N
In the first place, the less genes, more
behavioural flexibility argument is a total red
herring. In/PRP the/ART first/ORD place/N ,/,
the/ART / less/ADJ genes/N ,/, more/ADJ
behavioural/ADJ flexibility/N / argument/N
is/V a/ART total/ADJ red/ADJ herring/N ./.
24
Adj-Nposwi N
  • span size 1 (adjacency)wj, j -1
  • first/ORD place/N
  • less/ADJ genes/N
  • behavioural/ADJ flexibility/N
  • / argument/N
  • red/ADJ herring/N

25
Adj-Nposwi N
  • more/ADJ flexibility/N
  • / argument/N
  • flexibility/N argument/N
  • red/ADJ herring/N
  • total/ADJ herring/N
  • span size 2wj, j -2, -1
  • first/ORD place/N
  • the/ART place/N
  • less/ADJ genes/N
  • / genes/N
  • behavioural/ADJ flexibility/N

26
Adj-N poswj ADJ, poswi N
  • more/ADJ flexibility/N
  • / argument/N
  • flexibility/N argument/N
  • red/ADJ herring/N
  • total/ADJ herring/N
  • span size 2wj, j -2, -1
  • first/ORD place/N
  • the/ART place/N
  • less/ADJ genes/N
  • / genes/N
  • behavioural/ADJ flexibility/N

27
Adj-N
(S (PP In/PRP (NP the/ART first/ORD place/N
) ) ,/, (NP the/ART / less/ADJ
genes/N ,/, more/ADJ behavioural/ADJ
flexibility/N / argument/N ) (VP is/V
(NP a/ART total/ADJ red/ADJ herring/N ) ) )
./.
28
Adj-N
(S (PP-mod In/PRP (NP the/ART first/ORD
place/N ) ) ,/, (NP-subj
the/ART / less/ADJ genes/N ,/, more/ADJ
behavioural/ADJ flexibility/N / argument/N
) (VP-copula is/V (NP a/ART total/ADJ
red/ADJ herring/N ) ) ) ./.
29
Adj-NNP chunks
  • NP chunks
  • (NP the/ART first/ORD place/N )
  • (NP the/ART / less/ADJ genes/N ,/,
    more/ADJ behavioural/ADJ flexibility/N /
    argument/N )
  • (NP a/ART total/ADJ red/ADJ herring/N )
  • Adj-N Pairs
  • less/ADJ genes/N
  • more/ADJ flexibility/N
  • behavioural/ADJ flexibility/N
  • more/ADJ argument/N
  • behavioural/ADJ argument/N
  • total/ADJ herring/N
  • red/ADJ herring/N

30
N-V Object-VERB
  • spill the beans
  • Good for you for guessing the puzzle but from the
    beans Mike spilled to me, I think those kind of
    twists are more maddening than fun.
  • bury the hatchet
  • Paul McCartney has buried the hatchet with Yoko
    Ono after a dispute over the songwriting credits
    of some of the best-known Beatles songs.

31
N-V Object-Mod-VERB
  • keep ltonesgt nose to the grindstone
  • I'm very impressed with you for having kept your
    nose to the grindstone, I'd like to offer you a
    managerial position.
  • Weve learned from experience and kept our nose
    to the grindstone to make sure our future remains
    a bright one.
  • She keeps her nose to the grindstone.

32
N-V Object-Mod-VERB
  • keep ltonesgt nose to the grindstone
  • (VP kept, keeps, ...
  • (NP-obj your nose),
  • (NP-obj our nose),
  • (NP-obj her nose), ...
  • (PP-mod to the grindstone) )

33
PN-V P-Object-VERB
  • zur Verfügung stellen (make available)Peter
    stellt sein Auto Maria zur Verfügung (Peter
    makes his car available to Maria)
  • in Frage stellen (question)
  • Peter stellt Marias Loyalität in Frage (Peter
    questions Marias loyalty)
  • in Verbindung setzen (to contact)
  • Peter setzt sich mit Maria in Verbindung
  • (Peter contacts Maria)

34
Contingency Tablesfor Relational Cooccurrences
  • (big, dog)
  • (black, box)
  • (black, dog)
  • (small, cat)
  • (small, box)
  • (black, box)
  • (old, box)
  • (tabby, cat)

pair type (u,v) (black, box)
35
Contingency Tablesfor Relational Cooccurrences
  • (big, dog)
  • (black, box)
  • (black, dog)
  • (small, cat)
  • (small, box)
  • (black, box)
  • (old, box)
  • (tabby, cat)

pair type (u,v) (black, box)
36
Contingency Tablesfor Relational Cooccurrences
  • (big, dog)
  • (black, box)
  • (black, dog)
  • (small, cat)
  • (small, box)
  • (black, box)
  • (old, box)
  • (tabby, cat)

pair type (u,v) (black, box)
37
Contingency Tablesfor Relational Cooccurrences
  • (big, dog)
  • (black, box)
  • (black, dog)
  • (small, cat)
  • (small, box)
  • (black, box)
  • (old, box)
  • (tabby, cat)

pair type (u,v) (black, box)
38
Contingency Tablesfor Relational Cooccurrences
  • (big, dog)
  • (black, box)
  • (black, dog)
  • (small, cat)
  • (small, box)
  • (black, box)
  • (old, box)
  • (tabby, cat)

pair type (u,v) (black, box)
39
Contingency Tablesfor Relational Cooccurrences
  • (big, dog)
  • (black, box)
  • (black, dog)
  • (small, cat)
  • (small, box)
  • (black, box)
  • (old, box)
  • (tabby, cat)

pair type (u,v) (black, box)
40
Contingency Tablesfor Relational Cooccurrences
  • (big, dog)
  • (black, box)
  • (black, dog)
  • (small, cat)
  • (small, box)
  • (black, box)
  • (old, box)
  • (tabby, cat)

f(u,v) 2 f1(u) 3 f2(v) 4 N 8
41
Contingency Tablesfor Relational Cooccurrences
f(u,v) 2 f1(u) 3 f2(v) 4 N 8
42
Contingency Tablesfor Relational Cooccurrences
f(u,v) 2 f1(u) 3 f2(v) 4 N 8
43
Contingency Tablesfor Relational Cooccurrences
real data from the BNC(adjacent adj-noun pairs,
lemmatised)
44
Contingency Tables in Perl
  • F ()
  • F1 ()
  • F2 ()
  • N 0
  • while ((u, v)get_pair())
  • F"u,v"
  • F1u
  • F2v
  • N

45
Contingency Tables in Perl
  • foreach pair (keys F)
  • (u,v) split /,/, pair
  • f Fpair
  • f1 F1u
  • f2 F2v
  • O11 f
  • O12 f1 - f
  • O21 f2 - f
  • O22 N - f1 - f2 - f
  • ...

46
Reminder Contingency Tablewith Row and Column
Sums
47
Why are Positional Cooccurrences Different?
  • adjectives and nous cooccurring within sentences
  • "I saw a black dog"? (black, dog)f(black,
    dog)1, f1(black)1, f2(dog)1
  • "The old man with the silly brown hat saw a
    black dog"? (old, dog), (silly, dog), (brown,
    dog), (black, dog), ... , (black, man), (black,
    hat)f(black, dog)1, f1(black)3, f2(dog)4

48
Why are PositionalCooccurrences Different?
  • "wrong" combinations could be considered as
    extraction noise(? association measures
    distinguish noise from recurrent combinations)
  • but very large amount of noise
  • statistical models assume that noise is
    completely random
  • but marginal frequencies often increase in large
    steps

49
Contingency Tables for Segment-Based
Cooccurrences
  • within pre-determined segments (e.g. sentences)
  • components of cooccurring pairs may be
    syntactically restricted(e.g. adj-noun,
    nounSg-verb3.Sg)
  • for given pair type (u,v),set of all sentences
    is classified into four categories

50
Contingency Tables for Segment-Based
Cooccurrences
  • u ? S at least one occurrence of u in sentence
    S
  • u ? S no occurrences of u in sentence S
  • v ? S at least one occurrence of v in sentence
    S
  • v ? S no occurrences of v in sentence S

51
Contingency Tables for Segment-Based
Cooccurrences
  • fS(u,v) number of sentences containing both u
    and v
  • fS(u) number of sentences containing u
  • fS(v) number of sentences containing v
  • NS total number of sentences

52
Frequency Counts for Segment-Based Cooccurrences
  • adjectives and nous cooccurring within sentences
  • "I saw a black dog"? (black, dog)fS(black,
    dog)1, fS(black)1, fS(dog)1
  • "The old man with the silly brown hat saw a
    black dog"? (old, dog), (silly, dog), (brown,
    dog), (black, dog), ... , (black, man), (black,
    hat)fS(black, dog)1, fS(black)1, fS(dog)1

53
Segment-Based Cooccurrences in Perl
  • foreach S (_at_sentences)
  • words map _ gt 1 words(S)
  • pairs map _ gt 1 pairs(S)
  • foreach w (keys words)
  • FS_ww
  • foreach p (keys pairs)
  • FS_pp
  • NS

54
Contingency Tables for Distance-Based
Cooccurrences
  • problems are similar to segment-based
    cooccurrence data
  • but no pre-defined segments
  • accurate counting is difficult
  • here sketched for special case
  • all orthographic words
  • numerical span nL left, nR right
  • no stop word lists

55
Contingency Tables for Distance-Based
Cooccurrences
  • nL 3, nR 2

56
Contingency Tables for Distance-Based
Cooccurrences
  • nL 3, nR 2
  • occurrences of v

57
Contingency Tables for Distance-Based
Cooccurrences
  • nL 3, nR 2
  • occurrences of v
  • window W(v) around them

58
Contingency Tables for Distance-Based
Cooccurrences
  • nL 3, nR 2
  • occurrences of v
  • window W(v) around them
  • occurrences of u

59
Contingency Tables for Distance-Based
Cooccurrences
  • nL 3, nR 2
  • occurrences of v
  • window W(v) around them
  • occurrences of u
  • cross-classify occurrences of u against window
    W(v)

60
Contingency Tables for Distance-Based
Cooccurrences
61
Contingency Tables for Distance-Based
Cooccurrences
62
N-V Subject-Verb
  • Beispiele

63
Cooccurrence Data
  • Token-Type Distinction

64
Extraction Strategies
  • required
  • PoS-tagging
  • basic phrase chunking
  • infinitives with zu (to) are treated like single
    words,
  • separated verb prefixes are reattached to the verb

65
Extraction Strategies
  • Full forms or base forms ?
  • depends on language and collocation type
  • required
  • morphological analysis

66
3 Extraction Strategies
  • Strategy 1 Retrieval of n-grams from word forms
    only (wi)
  • Strategy 2 Retrieval of n-grams from
    part-of-speech annotated word forms (wti)
  • Strategy 3 Retrieval of n-grams from word forms
    with particular parts-of-speech, at particular
    positions in syntactic structure (wticj )

67
Spans tested
  • wi wi1
  • wi wi1 wi2
  • wi wi2 wi3
  • wi wi3 wi4

68
Results of Strategy 1
  • Retrieval of PP-verb collocations from word forms
    only is clearly inappropriate as function words
    like articles, prepositions, conjunctions,
    pronouns, etc. outnumber content words such as
    nouns, adjectives and verbs.
  • Blunt use of stop word lists leads to the loss of
    collocation-relevant information, as
    accessibility of prepositions and determiners may
    be crucial for the distinction of collocational
    and noncollocational word combinations.

69
Results of Strategy 1
  • most useful/informative span wi wi1 wi2
  • examples

bis 17 Uhr 2222 FRANKFURT A.
M. 949 in diesem Jahr 915 um
20 Uhr 855 Di. bis Fr
807 10 bis 17 779 Tips und
Termine 597 in der Nacht 582
70
we have learned
  • useful/informative span size is language specific
  • we find a number of different constructions
  • e.g.
  • NP, PP, ...
  • names, time phrases, conventionalized
    constructions, ...

71
Results of Strategy 2 wti wti1 with
preposition ti and noun ti1
  • PPs with arbitrary preposition-noun
    co-occurrences such as
  • am Samstag (on Saturday),
  • am Wochenende (at the weekend),
  • für Kinder (for children)
  • Fixed/conventionalized? PPs such as
  • zum Beispiel (for example)

72
Results of Strategy 2 wti wti1 with
preposition ti and noun ti1
  • PPs with a strong tendency for particular
    continuation such as
  • nach Angaben NPgen (according to'),
  • im Jahr Card (in the year).
  • Potential PP-collocates of verb-object
    collocations such as
  • zur Verfügung (at the disposal)

73
Results of Strategy 2 wti wti2 with
preposition ti and noun ti1
  • typically cover PPs with pre-nominal modification
  • Cardinal, for instance, is the most probable
    modifier category co-occurring with
  • bis ... Uhr (until oclock)
  • Adjective is the predominant modifier category
    related to
  • im ... Jahr (1272 of 1276 cases total),
  • vergangenen (Adj, last, 466 instances)

74
Results of Strategy 2 wti wti3 with
preposition ti and noun ti1
  • typically exceeds phrase boundaries
  • im Jahres (indat yeargen), for instance,
    originates from PP NPgen
  • e.g. im September dieses Jahres (in the
    September of this year)

75
Results of Strategy 2 wti wti1 wti2 with
preposition ti and noun ti1 and verb ti2
  • Frequent preposition-noun-participle or
    -infinitive sequences are good indicators for
    PP-verb collocations, especially for collocations
    that function as predicates such as support-verb
    constructions and a number of figurative
    expressions.
  • zur Verfügung gestellt (made available)
  • in Frage gestellt (questioned)
  • in Verbindung setzen (to contact)

76
Results of Strategy 2 wti wti2 wti3 wti wti3
wti4 with preposition ti and noun ti2 and verb
ti3 with preposition ti and noun ti3 and verb
ti4
  • a variety of PPs with prenominal modification are
    covered
  • but also phrase boundaries are more likely to be
    exceeded
  • durch Frauen helfen ? durch X (Y) Frauen helfen

77
Results of Strategy 3 wtick wtjck wtlcm
78
Conclusion
  • There is no single best strategy to extract an
    optimal set of candidate data from a corpus.
  • You need to know a least some structural and
    distributional properties of the phenomena you
    are searching for.
  • Preparation of candidate data influences
    distributions.
  • Distributional properties determine the outcome
    of AMs.
  • Know the distributional assumptions underlying
    the AMs you use.
Write a Comment
User Comments (0)
About PowerShow.com