Title: Crosspartition Clustering:
1Cross-partition Clustering
Revealing Analogous Themes across Related Topics
Zvika Marx, Ido Dagan, Eli Shamir
2Outline
- Motivation identifying analogies /
correspondences across different domains a
fascinating aspect of intelligence ! - Computational framework revealing concepts /
themes through Word Clustering - Probabilistic clustering the Information
Distortion / Information Bottleneck methods - The Cross Partition Clustering Method
- Algorithm and underlying principles
- Application to cross-religion comparison
3What is Analogy?
- Non-obvious similarity or correspondence
- Principled, deep, systematic relation (in
contrast to mere appearance) - Analogies are discovered or revealed
- related with problem solving
- require insight, creativity, fluid thinking
- Concept formation , Feature selection ?
4Example Orchestra and Army
General
5Cross-partition Clustering Setting
- Given data pre-partitioned to distinct subsets
w1,, wN, N ? 2 - Goal revealing themes that cut across all subsets
Subset w1
Subset w2
Subset wN
- soft (probabilistic) assignments
- (the formalism may even allows soft
pre-partitioning)
partition
partition
partition
Previous works Dagan, Marx Shamir, CoNLL
2002 Marx, Dagan Shamir, NIPS 2003
6Background Probabilistic Clustering
- Input
- for each x?X (clustered element) relative freq
p(x) - for each x and each y?Y (feature)
- conditional occurrence probability p(yx)
- Output
- for each x and each cluster c assignment
probability p(cx) - ( and for each c, distribution over the y's
p(yc) a centroid representation of c
in the feature space )
7Prob. Clustering Iterative Loop
- K-means style iterative loop
- (1) Recalculate assignment probabilities pt(cx)
- pt(cx) ? pt-1(c) exp ??KLp(yx)pt?1(yc)
- exponentially inverse to KL divergence between
the representative feature distributions of x and
c - (2) Recalculate cluster centroids pt(yc) for
each cluster c - pt(yc) ? ?x p(x) pt(cx) p(yx)
t
8Information Distortion / Bottleneck
- Data clustering interpreted as a means for
conveying the relevant information in the data - It maximizes information the clusters provide
about whats relevant (i.e. the feature variable
Y) - Subject to the above, minimize the information
provided by the data about the clusters ? Maximum
Entropy - Optimization problem
- argmin H(C) ? H(CX) ? H(YC)
- The optional H(C) differentiate IB (with) from
ID (without) - The K-means iterative steps are derived from
this term
over all p(c), p(cx), p(yc) distributions
9ID/IB Convergence
- The minimized term H(C) ? H(CX) ? H(YC)
is bounded - Each iterative loop reduces its value (at time t)
by KL pt(c) pt-1(c) - ?x p(x) KL pt-1(cx) pt(cx)
- ? ?c pt(c) KL pt(yc) pt-1(yc) gt 0
- ? convergence to a configuration of distributions
that (locally) minimizes the term
10CP Clustering Setting (reminder)
- Data pre-partitioned to w1,, wN, N ? 2
- Goal revealing, themes that cut across all
subsets
Subset w1
Subset w2
Subset wN
- soft (probabilistic) assignments
- (the formalism may also allow soft
pre-partitioning)
partition
partition
partition
11Cross-partition Clustering
- Input additional to p(x), p(yx)
- for each x and each w?W (pre-given subset)
- prob. of assignment to pre-given subset p(wx)
- In principle, p(wx) can be any prob. dist. over
W in our experiments 0/1 hard partitioning - Output
- p(cx) (as in any probabilistic clustering)
- A novel aspect re-associating (re-assigning)
features to clusters p(cy) - Two types of centroids p(yc,w), p(yc)
12The Cross Partition Algorithm
- (1) Assignment probability p(cx) (as in IB/ID)
- pt (cx) ? exp ??KLp(yx)pt?1(yc)
- (2) Recalculate w-projected local centroids
- pt (yc,w) ? ?x p(x) pt (cx) p(yx) p(wx)
- (3) Re-associate feature with clusters
- pt (cy) ? ?w pt (yc,w)?p(w)
- (4) Biased centroids, based on the above
- pt (yc) ? pt (cy) p(y)
t
13Cross-partition Principles
- Assignments p(cx) as in IB/ID (relying on MaxEnt
principle) - Look for C that, jointly with W, is informative
about Y distribution - ? local centroids p(yc,w)
- Re-associate features with clusters, p(cy),
so that the associations are independent of W - (again relying on MaxEnt principle)
144 Different Terms to Optimize
(1) FCP1 ? H(C) ? H(CX) ? H(YC)
H(YC) ? ? ?x p(x) ?c p(cx) ?y p(yx) log
p(yc) (2) FCP2 H(YC,W) ( assuming
I(CYWX) 0 ) ? ? ?x p(x) ?c p(cx) ?y
p(yx) ?w p(wx) log p(yc,w) (3) FCP3 ?
H(C) ? H(CY) ? H(YC,W) H(CY)
? ?y p(y) ?c p(cy) log p(cy) H(YC,W)
? ? ?w p(w) ?y p(y) ?c p(cy) log p(yc,w)
(4) FCP4 H(YC) ? ? ?y p(y) ?c p(cy)
log p(yc)
15Dynamics ID/IB versus CP
Information Distortion
Cross Partition
? H(CX)
? H(CX)
H(YC,W)
H(YC)
H(YC)
? H(CY)
16Priored and Unpriored Variants
- As in the IB method, it is possible to add prior
in the iterative cycle update steps (depending on
the exact terms being optimized) - - In step (1)
- pt(cx) ? pt?1(c) exp
??KLp(yx)pt?1(yc) - - In step (3)
- pt(cy) ? pt (c) ?w pt?1(yc,w)?p(w)
- So there are four variants of the CP method
(withwithout prior in step (1)(3) ) - The previous works mentioned before (CoNLL2002,
NIPS2004) in fact implement two of these variants
17Religion Data
- Given 5 corpora focused on religions
Buddhism, Christianity, Hinduism, Islam and
Judaism. - Encyclopedic entries, online magazines,
introductory web articles. - Co-occurrence statistics
- 200 (auto extracted) keywords of each religion
(same word appearing in two corpora is taken as
two distinct elements) - Count feature words (7000) in an undirected ?5
window truncated by sentence boundaries.
18Clusters Reveal Meaningful Themes
- Two Clusters spiritual vs. establishment
aspects - Seven Clusters (our titles, highest p(cy)
features that did not have a dual role of
clustered keywords) - schools central, dominant?, mainstream,
affiliate - divinity omnipotent?, almighty, mercy,
infinite - religious experience intrinsic, mental,
realm, mature - writings commentary, manuscript, dictionary,
grammar - festivals and rite annual, funeral,
rebuild, feast - sin and suffering vegetable, insect,
penalty, quench - community and family parent, nursing,
spouse, elderly - Interesting correspondence to classical comp.
religion work dimensions of the scared (Smart,
1999) - ritual, mythic, experiential/emotional, ethical,
social, material
19Evaluation
- Measure how well the output clusters capture
expert clusters, of freely chosen keywords. - We have not instructed the experts on how to
choose the words. - Three experts participated each one of them
covered a different subset of all possible
religion comparisons.
20Example Good Match to Expert
- Our sacred writings cross-partition cluster
- terms used by the expert are underlined
- first 15 words per religion shown (p(cx)
indicated) - Corresponding expert cluster ( hit scores
high in another cluster)
21Example Poor Match to Expert
- Our suffer, sin and punishment CP cluster
- The mysticism expert cluster was most closely
relevant
22Quantitative Evaluation
- Comparing religion pairs (W 2) hard
clustering - Evaluation restricted to terms common to two
experts - Jaccard coefficient proportion between
- num. of pairs of words co-assigned to the same
cluster (no matter which) by both expert and
algorithm - A num. of pairs co-assigned by one but not by
the other
23Conclusion
- Focus on particular features ones that are
important in the context of identifying
commonalities across domains - Applicable to real-world data
- Principled info-theoretic approach
- For future work
- Detect relational structure
- Problem solving
- More applications Commercial products, Legal
24Thank You
- Time
- - Last session
- Patience
- (Last session!)