8' Lexical Acquisition - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

8' Lexical Acquisition

Description:

Develop algorithms and statistical techniques for filling the holes ... Dwelling/abode. Two words from the same semantic domain or topic. Doctor, nurse, fever ... – PowerPoint PPT presentation

Number of Views:224

Avg rating:3.0/5.0

Slides: 33

Provided by: hoojun

Category:

more less

Transcript and Presenter's Notes

Title: 8' Lexical Acquisition

1
8. Lexical Acquisition

Manning Schutze, Foundations of Statistical
NLP

2
Contents

Lexical Acquisition
Evaluation Measures
Verb Subcategorization
Attachment Ambiguity
Selectional Preferences
Semantic Similarity
The Role of Lexical Acquisition in Statistical NLP

3
Lexical Acquisition

Acquisition of more complex syntactic and
semantic properties of words (than chapter 5)
Selectional preference
Subcategorization frames
Semantic categorization
Develop algorithms and statistical techniques for
filling the holes in existing MRD by looking at
the occurrence patterns of words in corpora

4
Evaluation Measures

Ultimate demonstration of success is showing
improved performance
Precision Recall
Precision tp / (tpfp)
Recall tp / (tpfn)
F measure
Fallout fp / (fp tn)
Measure of how hard it is to build a system that
produces few false positives
ROC curve recall fallout tradeoffs

5
Verb Subcategorization

Verbs subcategorize for different syntactic
categories
Verbs express their semantic arguments with
different syntactic means
Subcategorization frame a particular set of
syntactic categories that a verb can appear
(example Table 8.2)

Donated a large sum of money to the church
Gave the church a large sum of money

6
Some subcategorization frames (Table 8.2)
7
Verb Subcategorization

Verb subcategorization frame helps parsing
She told the man where Peter grew up.
She found the place where Peter grew up.
Unfortunately, most dictionariesdo not contain
information onsubcategorization frames
Do not cover all subcategorization frames
Do not have quantitative information
Acquisition of subcategorization information from
corpora is necessary
Cope with the productivity of language
Supplement dictionaries

Subcategorization Frame
8
Verb Subcategorization

Learning Algorithm proposed by Brent 1993
(Lerner)
Suppose we want to decide based on corpus
evidence whether verb v takes frame f. Lerner
makes this decision in two steps
Cues Define a regular pattern which indicates
the presence of the frame with high certainty
Hypothesis testing Initially assume that the
frame is not appropriate for the verb (null
hypothesis H0) This hypothesis is rejected if the
cue indicate with high probability that our H0 is
wrong
Cues Regular patterns to find subcategorization
frame
Cue for frame NP NP
(OBJ SUBJ_OBJ CAP) (PUNC CC)
a. I greet Peter,
b. I came Thursday, before the storm started

9
Verb Subcategorization

Hypothesis testing
If pE lt a, then we reject H0 (Permit f j as a
frame of vi)
Precision close to 100 (when a0.02) Recall
47 to 100

n of times vi occurs vi(f j)0 Verb vi does
not permit frame f j C(vi,c j) of times that
vi occurs with cue c j ej error rate for cue f j
10
Verb Subcategorization

Mannings (1993) Method
Use tagger and run the cue detection on the
output of the tagger
How reliable a cue is not matter!
Even an unreliable indicator can help to
determine the subcategorization frame of a verb
reliably if it occurs often enough
Allowing low-reliability cue and additional cues
based on tagger output increases the number of
cues significantly
Result sample Table 8.3
3 errors
2 PPs bridge between, retire in
Assign vi frame to remark (And..same problems,
Mr. Smith remarked)
Precision 90 (complete set of 40 verbs)Recall
43

11
Attachment Ambiguity

Moscow sent more than 100,000 soldiers into
Afghanistan
How to solve?
Simple Model
Uses lexical preferences (lexical statistics)
co-occurrence counts b/w v and prep, and b/w n
and prep
Ignores a bias for noun attachment in cases where
a preposition is equally compatible with verb and
the noun
Chrysler confirmed that it would end its troubled
venture with Maserati

12
Attachment Ambiguity

Hindle and Rooth 1993
Event space Vt N . PP
Only model the behavior of the first PP
Determine attachment counts from an unlabeled
corpus
Build a initial model by counting all unambiguous
cases
She sent him into the nursery to gather up his
toys (obvious attachment)
Apply the initial model to all ambiguous cases
and assign them to the appropriate count if ?
exceeds a threshold
Divide the remaining ambiguous cases evenly
between the counts
P80 R100 P91.7, R55.2 (?s
threshold3.0)

13
Attachment Ambiguity

Limitation of the models
Only consider the identity of the preposition,
noun and the verb
Sometimes, other information is important (ex
noun in the PP)
I examined the man with a stethoscope
I examined the man with a broken leg
Consider only the most basic case of PP
immediately after an NP object which is modifying
either the immediately preceding n or v.
The board approved its acquisition by Royal
Trustco Ltd. of Toronto for 27 a share at
its monthly meeting
Other attachment issues
Attachment ambiguity in noun compounds
Door bell manufacturer left-branching
Woman aid worker right-branching

14
Attachment Ambiguity

Large Proportion of PP exhibits indeterminacy
with respect to attachment
We have not signed a settlement agreement with
them
Motivates us to explore new ways of determining
the contribution a PP makes to the meaning of
sentence
Suggests that it may not be a good idea to
require that PP meaning always be mediated
through a NP or a VP as current syntactic
formalisms do

15
Selectional Preference

Selectional restriction
Preference of arguments of a particular type for
verbs
Ex) Object of the verb eat tend to be food item
Subjects of bark tend to be dogs
Preference ! Rule
Why it is important?
Infer unknown words meaning
Susan had never eaten a fresh durian before.
Rank the possible parses of a sentence
Give higher scores to parses where the verb has
natural arguments
(Here, well consider only the case of
verb-direct object)

16
Selectional Preference

Resniks Model
Two notions
Selectional preference strength
measures how strongly the verb constraints its
direct object.
the KL divergence between the prior distribution
of direct objects and the distribution of direct
objects of the verb we are trying to characterize

P(C) overall probability distribution of noun
classes P(Cv) probability distribution of noun
classes in the direct object position
of v
17
Selectional Preference Strength (based on
hypothetical data)
18
Selectional Preference

Selectional Association
Association between a verb and a class
Proportion that its summand
contributes to the overall
preference strength S(v)
Association strength to nouns
Example
A(eat,food) 1.08, A(find, action) -0.13

19
Selectional Preference

Estimating the probability P(cv) P(v,c) / P(v)
N total number of verb-object pairs in the
corpus
words(c) set of all nouns in class c
classes(n) number of noun classes that
contain n as a member
C(v,n) number of verb-object pairs with v as
the verb and n as the head of the object NP
Bypasses the problem of disambiguating nouns

20
Selectional Preference

Association strength distinguishing a verbs
plausible and implausible objects(Actual Data)
Table 8.6
Left half typical objects
Right half atypical objects
In most cases, association strength A(v,n) is a
good predictor of object typicality
Most errors the model makes are due to the fact
that it performs a form of disambiguation, by
choosing the highest A(v,c) for A(v,n)
Implicit Object Alternation Prediction
Mike ate the cake.
Mike ate.
The more constraints a verb puts on its object,
the more likely it is to permit the
implicit-object construction
SPS is seen as the more basic phenomenon which
explains the occurrence of implicit-objects as
well as association strength

21
Semantic Similarity

Acquisition of meaning
Final goal of lexical acquisition
But, how to represent meaning(that can be
operationally used by an automatic system)?
Semantic similarity
Automatically acquiring a relative measure of how
similar a new word is to known words is much
easier than determining what the meaning actually
is
Most often used for generalization under the
assumption that semantically similar words behave
similarly
Also used for query expansion astronaut ?
cosmonaut
Used for k nearest neighbors classification

22
Semantic Similarity

Notion of semantic similarity
Extension of synonym or near-synonym
Dwelling/abode
Two words from the same semantic domain or topic
Doctor, nurse, fever
Contextual interchangeable words (Miller and
Charles 1999)
Word similar to the appropriate sense (for
ambiguous word)
Litigation/suit(!clothes)
Similarity Measures
Vector space measures
Probabilistic measures

23
Semantic Similarity

Vector space measure
Conceptualized measure of semantic similarity
Words whose semantic similarity to be computed
are represented as vectors in a multi-dimensional
array
Doc - Word matrix (Figure 8.3)
Words are similar if they occur in the same
documents
Word Word matrix (Figure 8.4)
Words are similar when they co-occur with same
words
Modifier Head matrix (Figure 8.5)
Hears are similar when they are modified by the
same modifiers
Different spaces get at different types of
semantic similarity
Doc-Word, Word-Word spaces capture topical
similarity
Modifier-Head space captures more fine grained
similarity
Similarity of rows
Matrix A documents similarity
Matrix C modifier similarity

24
Semantic Similarity

Similarity measures (in precise) for binary
vectors
Matching coefficient
Dice coefficient
Take into account the length of the vectors and
the total number of non zero entries
Jaccard (or Tanimoto) coefficient
Penalizes a small number of shared entries more
than the Dice coefficient does
Overlap coefficient
Cosine
Penalizes less in cases where the number of
non-zero entries is very different

25
Semantic Similarity

Real-valued vector space
More powerful representation than binary vector
space

26
Semantic Similarity

Table 8.8 cosine similarities computed for the
NYT corpus
Word-by-word matrix (20,000 by 1000 matrix)
Co-occurrence two words occurs with in 25 words
of each other
Summary
Have been used in IR for a long time
Advantages
Intuitively simple! easy to visualize
Computationally efficient!
Disadvantages
Operate on binary data except for cosine
Cosine has its own problem
Cosine assumes a Euclidean space
Euclidean space is not well-motivated choice if
the vectors we are dealing with are vectors of
probability or counts

27
Semantic Similarity

Probabilistic Measure
Transform semantic similarity into the similarity
of two probability distribution
Transform matrices of counts in Figure 8.3, 8.4
and 8.5 into matrices of conditional probability
Ex) (American, Astronaut) ? P(Americanastronaut)
½ 0.5

Part of Figure 8.4
28
Semantic Similarity

Measures of dissimilarity between probability
distributions - Table 8.9 (Dagan 1997b)
KL divergence
Measures how much information is lost if we
assume distribution q when the true distribution
is p
Practical Problems
Get value of infinity when qi0 and pi ! 0
Asymmetric ( D(pq) ! D(qp) )
Information radius (IRAD)
Measures how much information is lost if we
describe the two words that correspond to p and q
with their average distribution

29
Semantic Similarity

L1 (Manhattan) norm
Measure of expected proportion of events that are
going to be different between the distributions p
and q
Example
p1 P(Soviet cosmonaut) 0.5
p2 0
p3 P(spacewalking cosmonaut)0.5
q1 0
q2 P(American astronaut) 0.5
q3 P(spacewalking astronaut) 0.5

30
Semantic Similarity

Comparison between three dissimilarity measures
Test Selectional preference problem
Find appropriate verb as predicates for given
noun
Example Get similarity of make and take to
determine that make is the right verb to use with
plans
Dagan et al.(1997b) show that IRad consistently
performs better than KL and L1

ß can be tuned for optimal performance
31
The Role of Lexical Acquisition in Statistical NLP

Lexical acquisition plays a key role in
statistical NLP
Reasons
Cost of building lexical resources manually
Collect quantitative information (humans are bad
at this)
Many lexical resources were designed for human
consumption
Inherent productivity of language
Lexical Coverage
Sampson 1989s analysis
Tests 45,000 words corpus with 70,000 entries
dictionary
3 of tokens were not in the dictionary (Table
8.10)
Half of the missing words were proper noun
Started take center stage in the late 80s

32
The Role of Lexical Acquisition in Statistical NLP

Now and Future
Look harder for sources of prior knowledge that
can constrain the process of lexical acquisition
Linguistic theory can be a source of prior
knowledge
Utilize encyclopedias, thesauri, gazetteers,
collections of technical vocabulary and any other
reference work or DB in addition to dictionaries
and text corpora

Write a Comment

User Comments (0)