Distributional Clustering of English Words - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Distributional Clustering of English Words

Description:

Start with low beta and a single c in C. Search for lowest beta that splits c into two or more leaf c's. ... Introduce human judgment, i.e. a more supervised approach ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 25
Provided by: Juan163
Category:

less

Transcript and Presenter's Notes

Title: Distributional Clustering of English Words


1
Distributional Clustering of English Words
  • Fernando Pereira- ATT Bell Laboratories, 600
  • Naftali Tishby- Dept. of Computer Science, Hebrew
    University
  • Lillian Lee- Dept. of Computer Science, Cornell
    University
  • Presenter- Juan Ramos, Dept. of Computer Science,
    Rutgers Universtiy, juramos_at_cs.rutgers.edu

2
Overview
  • Purpose evaluate a method for clustering words
    according to their distribution in particular
    syntactic contexts.
  • Methodology find lowest distortion sets of
    clusters of words to determine models of word
    coocurrence.

3
Applications
  • Scientific POV lexical acquisition of words
  • Practical POV classification concerns data
    sparseness in garmmar models.
  • Address clusters in large corpus of documents

4
Definitions
  • Context function of given word in its sentence.
  • Eg a noun as a direct object
  • Sense class hidden model describing word
    association tendencies
  • Mix of cluster and cluster probability given a
    word
  • Cluster probabilistic concept of a sense class

5
Problem Setting
  • Restrict problem to verbs (V) and nouns (N) in
    main verb-direct object relationship
  • f (v, n) frequencies of occurrence of verb,
    noun pairs
  • Text must be pre-formatted to fit specifications
  • For given noun n, conditional distribution p(n,
    v) f(v,n)/(sum (v, f(v,n))

6
Problem Setting cont.
  • Goal create set C of clusters and probabilityies
    p(cn).
  • Each c in C associated to cluster centroid p(c)
  • p(c) average of p(n) over all v in V.

7
Distributional Similarity
  • Given two distributions p, q, KL distance is D(p
    q) sum (x, p(x) log (p(x)/q(x)))
  • D(p q) 0 implies p q
  • Small D(p q) implies two distributions are
    likely instances of a centroid p(c).
  • D(p q) measures loss of data by using p(c).

8
Theoretical Foundation
  • Given unstructured V, N, training data of X
    independent pairs of verbs and nouns.
  • Problem learn joint distribution of pairs given
    X
  • Not quite unsupervised, not quite supervised
  • No internal structure in pairs
  • Learn underlying distribution

9
Distributional Clustering
  • Approximately decompose p(n,v) to p(n,v) sum
    (c in C, p(cn)p(c, v)).
  • p(cn) membership probability of n in c
  • p(c,v) p(vc) probability of v given centroid
    for c
  • Assuming p(n), p(v) coincide, p(n,v) sum(c in
    C, p(c)p(nc)p(vc))

10
Maximum Likelihood Cluster Centroids
  • Used to maximize goodness of fit between data and
    p(n,v)
  • For sequence of pairs S, Ss model log prob. is
    l(S) sum(N, log (sum (c in C, p(n,v)))).
  • Maximize according to p(nc) and p(vc).
  • Variation of l(S)

11
Maximum Entropy Cluster Membership
  • Assume independence between variations of p(nc)
    and p(vc).
  • Can find Bayes inverses of p(nc) given p(vc)
    and p(vn)
  • p(vc) that maximize l(S) also minimize average
    distortion between cluster model and data

12
Entropy Cluster Membership cont.
  • Average cluster distortion
  • Entropy

13
Entropy Cluster Membership cont.
  • Class and membership distributions
  • Z(c) and Z(n) are normalization sums
  • Previous equations reduce log-likelihood to
  • At maximum, variation vanishes

14
KL Distortion
  • Attempt to minimize KL distortion through
    variation of KL distances
  • Results in weighted average of noun distributions.

15
Free Energy Function
  • Combined minimum distortion and max entropy
    equivalent to minimum of free energy F ltDgt -
    H/beta
  • F determines ltDgt and H through partial
    derivatives
  • Min of F determines balance between disordering
    max entropy and ordering distortion min.

16
Hierarchical Clustering
  • Number of clusters is determined through sequence
    of increases of beta.
  • Higher beta implies more local influence of noun
    on definition of centroids.
  • Start with low beta and a single c in C
  • Search for lowest beta that splits c into two or
    more leaf cs.
  • Repeat until C reaches desired size.

17
Experimental Results
  • Classify 64 nouns appearing as direct objects of
    verb fire in Associated Press documents, 1988,
    where V 2147.
  • Four words most similar to cluster centroid and
    KL distances for first splits.
  • Split 1 cluster of fire as discharging weapons
    vs. cluster of fire as releasing employees
  • Split 2 weapons as projectiles vs. weapons as
    guns.

18
Clustering on Verb fire
19
Evaluation
20
Evaluation cont.
21
Conclusions
  • Clustering is efficient, informative, and returns
    good predictions
  • Future work
  • Make clustering method more rigorous
  • Introduce human judgment, i.e. a more supervised
    approach
  • Extend model to other word relationships

22
References
23
References cont.
24
More References
Write a Comment
User Comments (0)
About PowerShow.com