Language%20Independent%20Methods%20of%20Clustering%20Similar%20Contexts%20(with%20applications) presentation

About This Presentation

Title:

Language%20Independent%20Methods%20of%20Clustering%20Similar%20Contexts%20(with%20applications)

Description:

Do not utilize dictionaries or other manually created ... The oyster shell is very hard and black. I can hear the ocean in that shell. EACL-2006 Tutorial ... –

Number of Views:279

Avg rating:3.0/5.0

Slides: 127

Provided by: TedPed8

Learn more at: https://www.d.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Language%20Independent%20Methods%20of%20Clustering%20Similar%20Contexts%20(with%20applications)

1
Language Independent Methods of Clustering
Similar Contexts (with applications)

Ted Pedersen
University of Minnesota, Duluth
http//www.d.umn.edu/tpederse
tpederse_at_d.umn.edu

2
Language Independent Methods

Do not utilize syntactic information
No parsers, part of speech taggers, etc. required
Do not utilize dictionaries or other manually
created lexical resources
Based on lexical features selected from corpora
Assumption word segmentation can be done by
looking for white spaces between strings
No manually annotated data of any kind, methods
are completely unsupervised in the strictest sense

3
Clustering Similar Contexts

A context is a short unit of text
often a phrase to a paragraph in length, although
it can be longer
Input N contexts
Output K clusters
Where each member of a cluster is a context that
is more similar to each other than to the
contexts found in other clusters

4
Applications

Headed contexts (contain target word)
Name Discrimination
Word Sense Discrimination
Headless contexts
Email Organization
Document Clustering
Paraphrase identification
Clustering Sets of Related Words

5
Tutorial Outline

Identifying lexical features
Measures of association tests of significance
Context representations
First second order
Dimensionality reduction
Singular Value Decomposition
Clustering
Partitional techniques
Cluster stopping
Cluster labeling
Hands On Exercises

6
General Info

Please fill out short survey
Break from 400-430pm
Finish at 6pm
Reception tonight at 7pm at Castle (?)
Slides and video from tutorial will be posted (I
will send you email when that is ready)
Questions are welcome
Now, or via email to me or SenseClusters list.
Comments, observations, criticisms are all
welcome
Knoppix CD, will give you Linux and SenseClusters
when computer is booted from the CD.

7
SenseClusters

A package for clustering contexts
http//senseclusters.sourceforge.net
SenseClusters Live! (Knoppix CD)
Integrates with various other tools
Ngram Statistics Package
CLUTO
SVDPACKC

8
Many thanks

Amruta Purandare (M.S., 2004)
Founding developer of SenseClusters (2002-2004)
Now PhD student in Intelligent Systems at the
University of Pittsburgh http//www.cs.pitt.edu/a
mruta/
Anagha Kulkarni (M.S., 2006, expected)
Enhancing SenseClusters since Fall 2004!
http//www.d.umn.edu/kulka020/
National Science Foundation (USA) for supporting
Amruta, Anagha and me via CAREER award 0092784

9
Background and Motivations
10
Headed and Headless Contexts

A headed context includes a target word
Our goal is to cluster the target words based on
their surrounding contexts
Target word is center of context and our
attention
A headless context has no target word
Our goal is to cluster the contexts based on
their similarity to each other
The focus is on the context as a whole

11
Headed Contexts (input)

I can hear the ocean in that shell.
My operating system shell is bash.
The shells on the shore are lovely.
The shell command line is flexible.
The oyster shell is very hard and black.

12
Headed Contexts (output)

Cluster 1
My operating system shell is bash.
The shell command line is flexible.
Cluster 2
The shells on the shore are lovely.
The oyster shell is very hard and black.
I can hear the ocean in that shell.

13
Headless Contexts (input)

The new version of Linux is more stable and
better support for cameras.
My Chevy Malibu has had some front end troubles.
Osborne made on of the first personal computers.
The brakes went out, and the car flew into the
house.
With the price of gasoline, I think Ill be
taking the bus more often!

14
Headless Contexts (output)

Cluster 1
The new version of Linux is more stable and
better support for cameras.
Osborne made one of the first personal computers.
Cluster 2
My Chevy Malibu has had some front end troubles.
The brakes went out, and the car flew into the
house.
With the price of gasoline, I think Ill be
taking the bus more often!

15
Web Search as Application

Web search results are headed contexts
Search term is target word (found in snippets)
Web search results are often disorganized two
people sharing same name, two organizations
sharing same abbreviation, etc. often have their
pages mixed up
If you click on search results or follow links in
pages found, you will encounter headless contexts
too

16
Name Discrimination
17
George Millers!
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Email Foldering as Application

Email (public or private) is made up of headless
contexts
Short, usually focused
Cluster similar email messages together
Automatic email foldering
Take all messages from sent-mail file or inbox
and organize into categories

24
(No Transcript)
25
(No Transcript)
26
Clustering News as Application

News articles are headless contexts
Entire article or first paragraph
Short, usually focused
Cluster similar articles together

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
What is it to be similar?

You shall know a word by the company it keeps
Firth, 1957 (Studies in Linguistic Analysis)
Meanings of words are (largely) determined by
their distributional patterns (Distributional
Hypothesis)
Harris, 1968 (Mathematical Structures of
Language)
Words that occur in similar contexts will have
similar meanings (Strong Contextual Hypothesis)
Miller and Charles, 1991 (Language and Cognitive
Processes)
Various extensions
Similar contexts will have similar meanings, etc.
Names that occur in similar contexts will refer
to the same underlying person, etc.

31
General Methodology

Represent contexts to be clustered using first or
second order feature vectors
Lexical features
Reduce dimensionality to make vectors more
tractable and/or understandable
Singular value decomposition
Cluster the context vectors
Find the number of clusters
Label the clusters
Evaluate and/or use the contexts!

32
Identifying Lexical Features

Measures of Association and
Tests of Significance

33
What are features?

Features represent the (hopefully) salient
characteristics of the contexts to be clustered
Eventually we will represent each context as a
vector, where the dimensions of the vector are
associated with features
Vectors/contexts that include many of the same
features will be similar to each other

34
Where do features come from?

In unsupervised clustering, it is common for the
feature selection data to be the same data that
is to be clustered
This is not cheating, since data to be clustered
does not have any labeled classes that can be
used to assist feature selection
It may also be necessary, since we may need to
cluster all available data, and not hold out some
for a separate feature identification step
Email or news articles

35
Feature Selection

Test data the contexts to be clustered
Assume that the feature selection data is the
same as the test data, unless otherwise indicated
Training data a separate corpus of held out
feature selection data (that will not be
clustered)
may need to use if you have a small number of
contexts to cluster (e.g., web search results)
This sense of training due to Schütze (1998)

36
Lexical Features

Unigram a single word that occurs more than a
given number of times
Bigram an ordered pair of words that occur
together more often than expected by chance
Consecutive or may have intervening words
Co-occurrence an unordered bigram
Target Co-occurrence a co-occurrence where one
of the words is the target word

37
Bigrams

fine wine (window size of 2)
baseball bat
house of representatives (window size of 3)
president of the republic (window size of 4)
apple orchard
Selected using a small window size (2-4 words),
trying to capture a regular (localized) pattern
between two words (collocation?)

38
Co-occurrences

tropics water
boat fish
law president
train travel
Usually selected using a larger window (7-10
words) of context, hoping to capture pairs of
related words rather than collocations

39
Bigrams and Co-occurrences

Pairs of words tend to be much less ambiguous
than unigrams
bank versus river bank and bank card
dot versus dot com and dot product
Three grams and beyond occur much less frequently
(Ngrams very Zipfian)
Unigrams are noisy, but bountiful

40
occur together more often than expected by
chance

Observed frequencies for two words occurring
together and alone are stored in a 2x2 matrix
Throw out bigrams that include one or two stop
words
Expected values are calculated, based on the
model of independence and observed values
How often would you expect these words to occur
together, if they only occurred together by
chance?
If two words occur significantly more often
than the expected value, then the words do not
occur together by chance.

41
2x2 Contingency Table
Intelligence !Intelligence
Artificial 100 400
!Artificial
300 100,000
42
2x2 Contingency Table
Intelligence !Intelligence
Artificial 100 300 400
!Artificial 200 99,400 99,600
300 99,700 100,000
43
2x2 Contingency Table
Intelligence !Intelligence
Artificial 100.0 000.12 300.0 398.8 400
!Artificial 200.0 298.8 99,400.0 99,301.2 99,600
300 99,700 100,000
44
Measures of Association
45
Measures of Association
46
Interpreting the Scores

G2 and X2 are asymptotically approximated by
the chi-squared distribution
This meansif you fix the marginal totals of a
table, randomly generate internal cell values in
the table, calculate the G2 or X2 scores for
each resulting table, and plot the distribution
of the scores, you should get

47
(No Transcript)
48
Interpreting the Scores

Values above a certain level of significance can
be considered grounds for rejecting the null
hypothesis
H0 the words in the bigram are independent
3.841 is associated with 95 confidence that the
null hypothesis should be rejected

49
Measures of Association

There are numerous measures of association that
can be used to identify bigram and co-occurrence
features
Many of these are supported in the Ngram
Statistics Package (NSP)
http//www.d.umn.edu/tpederse/nsp.html

50
Measures Supported in NSP

Log-likelihood Ratio (ll)
True Mutual Information (tmi)
Pearsons Chi-squared Test (x2)
Pointwise Mutual Information (pmi)
Phi coefficient (phi)
T-test (tscore)
Fishers Exact Test (leftFisher, rightFisher)
Dice Coefficient (dice)
Odds Ratio (odds)

51
NSP

Will explore NSP during practical session
Integrated into SenseClusters, may also be used
in stand-alone mode
Can be installed easily on a Linux/Unix system
from CD or download from
http//www.d.umn.edu/tpederse/nsp.html
Im told it can also be installed on Windows (via
cygwin or ActivePerl), but I have no personal
experience of this

52
Summary

Identify lexical features based on frequency
counts or measures of association either in the
data to be clustered or in a separate set of
feature selection data
Language independent
Unigrams usually only selected by frequency
Remember, no labeled data from which to learn, so
somewhat less effective as features than in
supervised case
Bigrams and co-occurrences can also be selected
by frequency, or better yet measures of
association
Bigrams and co-occurrences need not be
consecutive
Stop words should be eliminated
Frequency thresholds are helpful (e.g.,
unigram/bigram that occurs once may be too rare
to be useful)

53
Related Work

Moore, 2004 (EMNLP) follow-up to Dunning and
Pedersen on log-likelihood and exact tests
http//acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moo
re.pdf
Pedersen, 1996 (SCSUG) explanation of exact
tests, and comparison to log-likelihood
http//arxiv.org/abs/cmp-lg/9608010
(also see Pedersen, Kayaalp, and Bruce,
AAAI-1996)
Dunning, 1993 (Computational Linguistics)
introduces log-likelihood ratio for collocation
identification
http//acl.ldc.upenn.edu/J/J93/J93-1003.pdf

54
Context Representations

First and Second Order Methods

55
Once features selected

We will have a set of unigrams, bigrams,
co-occurrences or target co-occurrences that we
believe are somehow interesting and useful
We also have any frequency and measure of
association score that have been used in their
selection
Convert contexts to be clustered into a vector
representation based on these features

56
First Order Representation

Each context is represented by a vector with M
dimensions, each of which indicates whether or
not a particular feature occurred in that context
Value may be binary, a frequency count, or an
association score
Context by Feature representation

57
Contexts

Cxt1 There was an island curse of black magic
cast by that voodoo child.
Cxt2 Harold, a known voodoo child, was gifted in
the arts of black magic.
Cxt3 Despite their military might, it was a
serious error to attack.
Cxt4 Military might is no defense against a
voodoo child or an island curse.

58
Unigram Feature Set

island 1000
black 700
curse 500
magic 400
child 200
(assume these are frequency counts obtained from
some corpus)

59
First Order Vectors of Unigrams
island black curse magic child
Cxt1 1 1 1 1 1
Cxt2 0 1 0 1 1
Cxt3 0 0 0 0 0
Cxt4 1 0 1 0 1
60
Bigram Feature Set

island curse 189.2
black magic 123.5
voodoo child 120.0
military might 100.3
serious error 89.2
island child 73.2
voodoo might 69.4
military error 54.9
black child 43.2
serious curse 21.2
(assume these are log-likelihood scores based on
frequency counts from some corpus)

61
First Order Vectors of Bigrams
black magic island curse military might serious error voodoo child
Cxt1 1 1 0 0 1
Cxt2 1 0 0 0 1
Cxt3 0 0 1 1 0
Cxt4 0 1 1 0 1
62
First Order Vectors

Can have binary values or weights associated with
frequency, etc.
Forms a context by feature matrix
May optionally be smoothed/reduced with Singular
Value Decomposition
More on that later
The contexts are ready for clustering
More on that later

63
Second Order Features

First order features encode the occurrence of a
feature in a context
Feature occurrence represented by binary value
Second order features encode something extra
about a feature that occurs in a context
Feature occurrence represented by word
co-occurrences
Feature occurrence represented by context
occurrences

64
Second Order Representation

First, build word by word matrix from features
Based on bigrams or co-occurrences
First word is row, second word is column, cell is
score
(optionally) reduce dimensionality w/SVD
Each row forms a vector of first order
co-occurrences
Second, replace each word in a context with its
row/vector as found in the word by word matrix
Average all the word vectors in the context to
create the second order representation
Due to Schütze (1998), related to LSI/LSA

65
Word by Word Matrix
magic curse might error child
black 123.5 0 0 0 43.2
island 0 189.2 0 0 73.2
military 0 0 100.3 54.9 0
serious 0 21.2 0 89.2 0
voodoo 0 0 69.4 0 120.0
66
Word by Word Matrix

can also be used to identify sets of related
words
In the case of bigrams, rows represent the first
word in a bigram and columns represent the second
word
Matrix is asymmetric
In the case of co-occurrences, rows and columns
are equivalent
Matrix is symmetric
The vector (row) for each word represent a set of
first order features for that word
Each word in a context to be clustered for which
a vector exists (in the word by word matrix) is
replaced by that vector in that context

67
There was an island curse of black magic cast by
that voodoo child.
magic curse might error child
black 123.5 0 0 0 43.2
island 0 189.2 0 0 73.2
voodoo 0 0 69.4 0 120.0
68
Second Order Co-Occurrences

Word vectors for black and island show
similarity as both occur with child
black and island are second order
co-occurrence with each other, since both occur
with child but not with each other (i.e.,
black island is not observed)

69
Second Order Representation

There was an curse, child curse of magic,
child magic cast by that might, child child
curse, child magic, child might, child

70
There was an island curse of black magic cast by
that voodoo child.
magic curse might error child
Cxt1 41.2 63.1 24.4 0 78.8
71
Second Order Representation

Results in a Context by Feature (Word)
Representation
Cell values do not indicate if feature occurred
in context. Rather, they show the strength of
association of that feature with other words that
occur with a word in the context.

72
Summary

First order representations are intuitive, but
Can suffer from sparsity
Contexts represented based on the features that
occur in those contexts
Second order representations are harder to
visualize, but
Allow a word to be represented by the words it
co-occurs with (i.e., the company it keeps)
Allows a context to be represented by the words
that occur with the words in the context
Helps combat sparsity

73
Related Work

Pedersen and Bruce 1997 (EMNLP) presented first
order method of discrimination
http//acl.ldc.upenn.edu/W/W97/W97-0322.pdf
Schütze 1998 (Computational Linguistics)
introduced second order method
http//acl.ldc.upenn.edu/J/J98/J98-1004.pdf
Purandare and Pedersen 2004 (CoNLL) compared
first and second order methods
http//acl.ldc.upenn.edu/hlt-naacl2004/conll0
4/pdf/purandare.pdf
First order better if you have lots of data
Second order better with smaller amounts of data

74
Dimensionality Reduction

Singular Value Decomposition

75
Motivation

First order matrices are very sparse
Context by feature
Word by word
NLP data is noisy
No stemming performed
synonyms

76
Many Methods

Singular Value Decomposition (SVD)
SVDPACKC http//www.netlib.org/svdpack/
Multi-Dimensional Scaling (MDS)
Principal Components Analysis (PCA)
Independent Components Analysis (ICA)
Linear Discriminant Analysis (LDA)
etc

77
Effect of SVD

SVD reduces a matrix to a given number of
dimensions This may convert a word level space
into a semantic or conceptual space
If dog and collie and wolf are
dimensions/columns in a word co-occurrence
matrix, after SVD they may be a single dimension
that represents canines

78
Effect of SVD

The dimensions of the matrix after SVD are
principal components that represent the meaning
of concepts
Similar columns are grouped together
SVD is a way of smoothing a very sparse matrix,
so that there are very few zero valued cells
after SVD

79
How can SVD be used?

SVD on first order contexts will reduce a context
by feature representation down to a smaller
number of features
Latent Semantic Analysis typically performs SVD
on a feature by context representation, where the
contexts are reduced
SVD used in creating second order context
representations
Reduce word by word matrix

80
Word by Word Matrix
apple blood cells ibm data box tissue graphics memory organ plasma
pc 2 0 0 1 3 1 0 0 0 0 0
body 0 3 0 0 0 0 2 0 0 2 1
disk 1 0 0 2 0 3 0 1 2 0 0
petri 0 2 1 0 0 0 2 0 1 0 1
lab 0 0 3 0 2 0 2 0 2 1 3
sales 0 0 0 2 3 0 0 1 2 0 0
linux 2 0 0 1 3 2 0 1 1 0 0
debt 0 0 0 2 3 4 0 2 0 0 0
81
Singular Value DecompositionAUDV
82
U
.35 .09 -.2 .52 -.09 .40 .02 .63 .20 -.00 -.02
.05 -.49 .59 .44 .08 -.09 -.44 -.04 -.6 -.02 -.01
.35 .13 .39 -.60 .31 .41 -.22 .20 -.39 .00 .03
.08 -.45 .25 -.02 .17 .09 .83 .05 -.26 -.01 .00
.29 -.68 -.45 -.34 -.31 .02 -.21 .01 .43 -.02 -.07
.37 -.01 -.31 .09 .72 -.48 -.04 .03 .31 -.00 .08
.46 .11 -.08 .24 -.01 .39 .05 .08 .08 -.00 -.01
.56 .25 .30 -.07 -.49 -.52 .14 -.3 -.30 .00 -.07
83
D
9.19
6.36
3.99
3.25
2.52
2.30
1.26
0.66
0.00
0.00
0.00
84
V
.21 .08 -.04 .28 .04 .86 -.05 -.05 -.31 -.12 .03
.04 -.37 .57 .39 .23 -.04 .26 -.02 .03 .25 .44
.11 -.39 -.27 -.32 -.30 .06 .17 .15 -.41 .58 .07
.37 .15 .12 -.12 .39 -.17 -.13 .71 -.31 -.12 .03
.63 -.01 -.45 .52 -.09 -.26 .08 -.06 .21 .08 -.02
.49 .27 .50 -.32 -.45 .13 .02 -.01 .31 .12 -.03
.09 -.51 .20 .05 -.05 .02 .29 .08 -.04 -.31 -.71
.25 .11 .15 -.12 .02 -.32 .05 -.59 -.62 -.23 .07
.28 -.23 -.14 -.45 .64 .17 -.04 -.32 .31 .12 -.03
.04 -.26 .19 .17 -.06 -.07 -.87 -.10 -.07 .22 -.20
.11 -.47 -.12 -.18 -.27 .03 -.18 .09 .12 -.58 .50
85
Word by Word Matrix After SVD
apple blood cells ibm data tissue graphics memory organ plasma
pc .73 .00 .11 1.3 2.0 .01 .86 .77 .00 .09
body .00 1.2 1.3 .00 .33 1.6 .00 .85 .84 1.5
disk .76 .00 .01 1.3 2.1 .00 .91 .72 .00 .00
germ .00 1.1 1.2 .00 .49 1.5 .00 .86 .77 1.4
lab .21 1.7 2.0 .35 1.7 2.5 .18 1.7 1.2 2.3
sales .73 .15 .39 1.3 2.2 .35 .85 .98 .17 .41
linux .96 .00 .16 1.7 2.7 .03 1.1 1.0 .00 .13
debt 1.2 .00 .00 2.1 3.2 .00 1.5 1.1 .00 .00
86
Second Order Representation

I got a new disk today!
What do you think of linux?

apple blood cells ibm data tissue graphics memory organ Plasma
disk .76 .00 .01 1.3 2.1 .00 .91 .72 .00 .00
linux .96 .00 .16 1.7 2.7 .03 1.1 1.0 .00 .13

These two contexts share no words in common, yet
they are similar! disk and linux both occur with
Apple, IBM, data, graphics, and memory
The two contexts are similar because they share
many second order co-occurrences

87
Relationship to LSA

Latent Semantic Analysis uses feature by context
first order representation
Indicates all the contexts in which a feature
occurs
Use SVD to reduce dimensions (contexts)
Cluster features based on similarity of contexts
in which they occur
Represent sentences using an average of feature
vectors

88
Feature by Context Representation
Cxt1 Cxt2 Cxt3 Cxt4
black magic 1 1 0 1
island curse 1 0 0 1
military might 0 0 1 0
serious error 0 0 1 0
voodoo child 1 1 0 1
89
References

Deerwester, S. and Dumais, S.T. and Furnas, G.W.
and Landauer, T.K. and Harshman, R., Indexing by
Latent Semantic Analysis, Journal of the American
Society for Information Science, vol. 41, 1990
Landauer, T. and Dumais, S., A Solution to
Plato's Problem The Latent Semantic Analysis
Theory of Acquisition, Induction and
Representation of Knowledge, Psychological
Review, vol. 104, 1997
Schütze, H, Automatic Word Sense Discrimination,
Computational Linguistics, vol. 24, 1998
Berry, M.W. and Drmac, Z. and Jessup,
E.R.,Matrices, Vector Spaces, and Information
Retrieval, SIAM Review, vol 41, 1999

90
Clustering

Partitional Methods
Cluster Stopping
Cluster Labeling

91
Many many methods

Cluto supports a wide range of different
clustering methods
Agglomerative
Average, single, complete link
Partitional
K-means (Direct)
Hybrid
Repeated bisections
SenseClusters integrates with Cluto
http//www-users.cs.umn.edu/karypis/cluto/

92
General Methodology

Represent contexts to be clustered in first or
second order vectors
Cluster the context vectors directly
vcluster
or convert to similarity matrix and then
cluster
scluster

93
Agglomerative Clustering

Create a similarity matrix of contexts to be
clustered
Results in a symmetric instance by instance
matrix, where each cell contains the similarity
score between a pair of instances
Typically a first order representation, where
similarity is based on the features observed in
the pair of instances

94
Measuring Similarity

Integer Values
Matching Coefficient
Jaccard Coefficient
Dice Coefficient
Real Values
Cosine

95
Agglomerative Clustering

Apply Agglomerative Clustering algorithm to
similarity matrix
To start, each context is its own cluster
Form a cluster from the most similar pair of
contexts
Repeat until the desired number of clusters is
obtained
Advantages high quality clustering
Disadvantages computationally expensive, must
carry out exhaustive pair wise comparisons

96
Average Link Clustering
S1 S2 S3 S4
S1 3 4 2
S2 3 2 0
S3 4 2 1
S4 2 0 1
S1S3 S2 S4
S1S3
S2 0
S4 0
S1S3S2 S4
S1S3S2
S4
97
Partitional Methods

Randomly create centroids equal to the number of
clusters you wish to find
Assign each context to nearest centroid
After all contexts assigned, re-compute centroids
best location decided by criterion function
Repeat until stable clusters found
Centroids dont shift from iteration to iteration

98
Partitional Methods

Advantages fast
Disadvantages
Results can be dependent on the initial placement
of centroids
Must specify number of clusters ahead of time
maybe not

99
Vectors to be clustered
100
Random Initial Centroids (k2)
101
Assignment of Clusters
102
Recalculation of Centroids
103
Reassignment of Clusters
104
Recalculation of Centroid
105
Reassignment of Clusters
106
Partitional Criterion Functions

Intra-Cluster (Internal) similarity/distance
How close together are members of a cluster?
Closer together is better
Inter-Cluster (External) similarity/distance
How far apart are the different clusters?
Further apart is better

107
Intra Cluster Similarity

Ball of String (I1)
How far is each member from each other member
Flower (I2)
How far is each member of cluster from centroid

108
Contexts to be Clustered
109
Ball of String (I1 Internal Criterion Function)
110
Flower(I2 Internal Criterion Function)
111
Inter Cluster Similarity

The Fan (E1)
How far is each centroid from the centroid of the
entire collection of contexts
Maximize that distance

112
The Fan(E1 External Criterion Function)
113
Hybrid Criterion Functions

Balance internal and external similarity
H1 I1/E1
H2 I2/E1
Want internal similarity to increase, while
external similarity decreases
Want internal distances to decrease, while
external distances increase

114
Cluster Stopping
115
Cluster Stopping

Many Clustering Algorithms require that the user
specify the number of clusters prior to
clustering
But, the user often doesnt know the number of
clusters, and in fact finding that out might be
the goal of clustering

116
Criterion Functions Can Help

Run partitional algorithm for k1 to deltaK
DeltaK is a user estimated or automatically
determined upper bound for the number of clusters
Find the value of k at which the criterion
function does not significantly increase at k1
Clustering can stop at this value, since no
further improvement in solution is apparent with
additional clusters (increases in k)

117
SenseClusters Approach to Cluster Stopping

Will be subject of Demo at EACL
Demo Session 25th April, 1430-1600
Ted Pedersen and Anagha Kulkarni Selecting the
"Right" Number of Senses Based on Clustering
Criterion Functions

118
H2 versus kT. Blair V. Putin S. Hussein
119
PK2

Based on Hartigan, 1975
When ratio approaches 1, clustering is at a
plateau
Select value of k which is closest to but outside
of standard deviation interval

120
PK2 predicts 3 sensesT. Blair V. Putin S.
Hussein
121
PK3

Related to Salvador and Chan, 2004
Inspired by Dice Coefficient
Values close to 1 mean clustering is improving
Select value of k which is closest to but outside
of standard deviation interval

122
PK3 predicts 3 sensesT. Blair V. Putin S.
Hussein
123
References

Hartigan, J. Clustering Algorithms, Wiley, 1975
basis for SenseClusters stopping method PK2
Mojena, R., Hierarchical Grouping Methods and
Stopping Rules An Evaluation, The Computer
Journal, vol 20, 1977
basis for SenseClusters stopping method PK1
Milligan, G. and Cooper, M., An Examination of
Procedures for Determining the Number of Clusters
in a Data Set, Psychometrika, vol. 50, 1985
Very extensive comparison of cluster stopping
methods
Tibshirani, R. and Walther, G. and Hastie, T.,
Estimating the Number of Clusters in a Dataset
via the Gap Statistic,Journal of the Royal
Statistics Society (Series B), 2001
Pedersen, T. and Kulkarni, A. Selecting the
"Right" Number of Senses Based on Clustering
Criterion Functions, Proceedings of the Posters
and Demo Program of the Eleventh Conference of
the European Chapter of the Association for
Computational Linguistics, 2006
Describes SenseClusters stopping methods

124
Cluster Labeling
125
Cluster Labeling

Once a cluster is discovered, how can you
generate a description of the contexts of that
cluster automatically?
In the case of contexts, you might be able to
identify significant lexical features from the
contents of the clusters, and use those as a
preliminary label

126
Results of Clustering

Each cluster consists of some number of contexts
Each context is a short unit of text
Apply measures of association to the contents of
each cluster to determine N most significant
bigrams
Use those bigrams as a label for the cluster

127
Label Types

The N most significant bigrams for each cluster
will act as a descriptive label
The M most significant bigrams that are unique to
each cluster will act as a discriminating label

128
Evaluation Techniques

Comparison to gold standard data

129
Evaluation

If Sense tagged text is available, can be used
for evaluation
But dont use sense tags for clustering or
feature selection!
Assume that sense tags represent true clusters,
and compare these to discovered clusters
Find mapping of clusters to senses that attains
maximum accuracy

130
Evaluation

Pseudo words are especially useful, since it is
hard to find data that is discriminated
Pick two words or names from a corpus, and
conflate them into one name. Then see how well
you can discriminate.
http//www.d.umn.edu/tpederse/tools.html
Baseline Algorithm group all instances into one
cluster, this will reach accuracy equal to
majority classifier

131
Evaluation

Pseudo words are especially useful, since it is
hard to find data that is discriminated
Pick two words or names from a corpus, and
conflate them into one name. Then see how well
you can discriminate.
http//www.d.umn.edu/kulka020/kanaghaName.html

132
Baseline Algorithm

Baseline Algorithm group all instances into one
cluster, this will reach accuracy equal to
majority classifier
What if the clustering said everything should be
in the same cluster?

133
Baseline Performance
S1 S2 S3 Totals
C1 0 0 0 0
C2 0 0 0 0
C3 80 35 55 170
Totals 80 35 55 170
S3 S2 S1 Totals
C1 0 0 0 0
C2 0 0 0 0
C3 55 35 80 170
Totals 55 35 80 170

(0055)/170 .32 if C3 is S1
(0080)/170 .47 if C3 is S3

134
Evaluation

Suppose that C1 is labeled S1, C2 as S2, and C3
as S3
Accuracy (10 0 10) / 170 12
Diagonal shows how many members of the cluster
actually belong to the sense given on the column
Can the columns be rearranged to improve the
overall accuracy?
Optimally assign clusters to senses

S1 S2 S3 Totals
C1 10 30 5 45
C2 20 0 40 60
C3 50 5 10 65
Totals 80 35 55 170
135
Evaluation

The assignment of C1 to S2, C2 to S3, and C3 to
S1 results in 120/170 71
Find the ordering of the columns in the matrix
that maximizes the sum of the diagonal.
This is an instance of the Assignment Problem
from Operations Research, or finding the Maximal
Matching of a Bipartite Graph from Graph Theory.

S2 S3 S1 Totals
C1 30 5 10 45
C2 0 40 20 60
C3 5 10 50 65
Totals 35 55 80 170
136
Analysis

Unsupervised methods may not discover clusters
equivalent to the classes learned in supervised
learning
Evaluation based on assuming that sense tags
represent the true cluster are likely a bit
harsh. Alternatives?
Humans could look at the members of each cluster
and determine the nature of the relationship or
meaning that they all share
Use the contents of the cluster to generate a
descriptive label that could be inspected by a
human

137
Practical Session

Experiments with SenseClusters

138
Things to Try

Feature Identification
Type of Feature
Measures of association
Context Representation (1st or 2nd order)
Automatic Stopping (or not)
SVD (or not)
Clustering Algorithm and Criterion Function
Evaluation
Labeling

139
Experimental Data

Available on Web Site
http//senseclusters.sourceforge.net
Available on LIVE CD
Mostly Name Conflate data

140
Creating Experimental Data

NameConflate program
Creates name conflated data from English GigaWord
corpus
Text2Headless program
Convert plain text into headless contexts
http//www.d.umn.edu/tpederse/tools.html

141
Headed Clustering

Name Discrimination
Tom Hanks
Russell Crowe

142
(No Transcript)
143
(No Transcript)
144
(No Transcript)
145
(No Transcript)
146
Headless Contexts

Email / 20 newsgroups data
Spanish Text

147
(No Transcript)
148
(No Transcript)
149
Thank you!

Questions or comments on tutorial or
SenseClusters are welcome at any time
tpederse_at_d.umn.edu
SenseClusters is freely available via LIVE CD,
the Web, and in source code form
http//senseclusters.sourceforge.net
SenseClusters papers available at
http//www.d.umn.edu/tpederse/senseclusters-pubs.
html

Write a Comment

User Comments (0)

About PowerShow.com