Three Approaches to Unsupervised WSD - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Three Approaches to Unsupervised WSD

Description:

Generate context vectors (from co-occurrence matrix) ... Local: select words from the contexts of the ... lawsuit. w1. w0. This representation conflates senses ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 24
Provided by: stan7
Category:

less

Transcript and Presenter's Notes

Title: Three Approaches to Unsupervised WSD


1
Three Approaches to Unsupervised WSD
  • Dmitriy Dligach

2
Unsupervised WSD
  • No training corpora needed
  • No predefined tag set needed
  • Three approaches
  • Context-group Discrimination (Schutze, 1998)
  • Graph-based Algorithms (Agirre et al., 2006)
  • HyperLex (Veronis, 2004)
  • PageRank (Brin and Page, 1998)
  • Predominant Sense (McCarthy, 2006)
  • Thesaurus generation
  • Method in (Lin, 1998)
  • Earlier version in (Hindle, 1990)

3
Context-group Discrimination Algorithm
  • Sense Representations
  • Generate word vectors
  • Generate context vectors (from co-occurrence
    matrix)
  • Generate sense vectors (by clustering context
    vectors)
  • Disambiguate by computing proximity

4
Word Vectors
  • wi
  • Two strategies to select dimensions
  • Local select words from the contexts of the
    ambiguous word within a 50-word window
  • Either 1,000 most frequent words, or
  • Use ?2 measure of dependence to pick 1,000 words
  • Global select from the entire corpus regardless
    of the target word
  • Select 20,000 most frequent words as features
  • 2,000 as dimensions
  • 20,000-by-2,000 co-occurrence matrix

5
Context Vectors
  • This representation conflates senses
  • Represent context as the centroid of the word
    vectors
  • IDF-valued vectors

6
Sense Vectors
  • Cluster approx. 2,000 context vectors
  • Use a combination of group-average agglomerative
    clustering and EM
  • Choose a random sample of 50 (?2000) and cluster
    using GAAC O(n2)
  • Centroids of the resulting clusters become the
    input to the EM
  • The procedure is still linear
  • Perform an SVD on context vectors
  • Re-represent context vectors by their values on
    the 100 principal dimensions

7
Evaluation
  • Hand-labeled corpus of 10 naturally ambiguous and
    10 artificial words
  • Throw out low-frequency senses and leave only 2
    most frequent
  • Number of clusters
  • 2 clusters use gold standard to evaluate
  • 10 clusters no gold standard use purity
  • Sense-based IR

8
Results (highlights)
  • Overall performance for pseudo-words is higher
    than for naturally ambiguous words
  • Some pseudowords (wide range/consulting firm) and
    words (space in area, volume sense) show poor
    performance due to being topically amorphous
  • IR evaluation
  • vector-space model with senses as dimensions
  • 7.4 improvement on TREC-1 collection

9
Graph-based Algorithms
  • Build a co-occurrence matrix
  • View it as a graph
  • Small world properties
  • Most nodes have few connections
  • Few are highly connected
  • Look for densely populated regions
  • Known as High-Density Components
  • Map ambiguous instances to one of these regions

10
A Sample Co-Occurrence Graph
  • barrage dam, play-off, barrier, roadblock,
    police cordon, barricade

11
Algorithm Details
  • Nodes correspond to words
  • Edges reflect the degree of semantic association
    between words
  • Model with conditional probabilities
  • wA,B 1 maxp(AB), p(BA)
  • Detect high-density components
  • Sort nodes by their degree
  • Take the top one (root hub) and remove along with
    all its neighbors (hoping to eliminate the entire
    component)
  • Iterate until all the high-density components are
    found

12
E.g.
13
Disambiguation
  • Delineate high-density components
  • Need to attach them back to the root hubs
  • Attach the target word to all root hubs
  • Compute the MST
  • Map the ambiguous instance to one of the
    components
  • Examine each word in its context
  • Compute the distance from each of these words to
    each root hub (each word is under exactly one
    hub)
  • Compute the total score for each hub

14
PageRank
  • Based on PageRank (Brin and Page, 1998) and
    adopted for weighted graphs
  • An alternative way to rank nodes
  • Algorithm
  • Initialize nodes to random values
  • Compute PageRank
  • Iterate a fixed number of times

15
Evaluation
  • First need to optimize 10 parameters
  • P1. Minimum frequency of edges (occurrences)
  • P2. Minimum frequency of vertices (words)
  • P3. Edges with weights above this value are
    removed
  • Train on Senseval2 using unsupervised metrics
  • Entropy, Purity, and Fscore
  • Evaluate on Senseval3
  • Lexical sample data
  • 10 point gain over the MFS baseline
  • Beat by 1 point a supervised system with lexical
    features
  • All-words task
  • Little training data
  • Supervised systems barely beat the MFS baseline
  • This system is less than 1 point below the best
    system
  • The difference in performance is not
    statistically significant

16
Finding Predominant Sense
  • Predominant senses in WordNet are derived from
    SemCor (a relatively small subset of Brown)
  • Idiosyncrasies
  • tiger (audacious person not the animal)
  • star (depending on context celebrity or celestial
    body)

17
Distributional Similarity
  • Nouns that occur in object positions of the same
    verbs are similar (e.g. beer and vodka as objects
    of to drink)
  • Can automatically generate thesaurus-like
    neighborhood list for the target word (Hindle
    1990), (Lin 1998)
  • w0s0, w1s1, , wnsn
  • neighborhood list conflates different senses
  • quality and quantity of neighbors must relate to
    the predominant sense
  • need to compute the proximity of each neighbor to
    each of the senses of the target word (e.g. lesk,
    jcn)

18
Algorithm
  • w the target word
  • Nw n1, n2, , nk the ordered set of top k
    most similar neighbors of the target word
  • dss(w, n1), dss(w, n2), , dss(w, nk)
    distributional similarity score for each of the k
    neighbors
  • wsi ? senses(w) senses of the target word
  • wnss(wsi, nj) WordNet similarity score between
    WordNet sense i of the target word and the sense
    nj of the neighbor j that maximizes this score
  • PrevalenceScore(wsi) ranking of the sense i of
    the target word as being the predominant sense.

19
Experiment 1
  • Derive a thesaurus from BNC
  • SemCor experiments
  • Metric accuracy of finding the MFS
  • Metric WSD accuracy
  • Baseline random accuracy
  • Upper bound for WSD task is 67
  • Both experiments beat the random baseline (54
    and 48 respectively)
  • Hand Examination
  • some error due to genre and time period variations

20
Experiment 2
  • Use Senseval2 all-words task
  • Label with first sense computed
  • automatically
  • according to SemCor
  • Senseval2 data itself (upper bound)
  • Automatic precision/recall are only a few points
    less than SemCors

21
Experiment 3
  • Investigate how the MFS changes across domains
  • SPORTS and FINANCE domains of the Reuters corpus
  • No hand annotated data, so hand-examine
  • Most words displayed the expected change in MFS
  • tie changes from draw to affiliation

22
Discussion Algorithms
  • Context
  • Bag-of-words Schutze and Agirre et. al.
  • Syntactic McCarthy et. al.
  • Is bag-of-words sufficient?
  • E.g. topically amorphous words
  • Co-occurrence
  • Co-occurrence matrix Schutze and Agirre et al.
  • Used to to look for similar nouns McCarthy et.al
  • Order of co-occurrence
  • First order all three papers
  • Second order Schutze and McCarthy et. al.
  • Higher-order Agirre
  • PageRank computes global rankings
  • MST links all nodes to the root
  • Advantage of the graph-based methods

23
Discussion Evaluation
  • Testbeds little ground for cross-comparison
  • Schutze his own corpus
  • Agirre et al train parameters on Senseval2 and
    test on Senseval3 data
  • McCarthy et al test on SemCor, Senseval2,
    Reuters
  • Methodology
  • Map clusters to the gold standard (Schutze and
    Agirre et. al.)
  • Unsupervised evaluation (Schutze and Agirre et.
    al)
  • Compare to various baselines (MFS, Lesk, Random
    baseline)
  • Use an application (Schutze)
Write a Comment
User Comments (0)
About PowerShow.com