Title: CS 430: Information Discovery
1CS 430 Information Discovery
Lecture 23 Cluster Analysis 2 Thesaurus
Construction
2Course Administration
Next week Guest lecture on Thursday, Thorsten
Joachims. Final examination The final
examination will include questions on all
lectures, including the guest lectures, and the
readings for the discussion classes. Examination
date Wednesday, December 18, 1200 noon - 130
p.m. Early examination Thursday December 12,
1200 noon - 130 p.m. Contact Anat Nidar-Levi
(anat_at_cs.cornell.edu) if you plan to take the
early examination.
3Example 2 Concept Spaces for Scientific Terms
Large-scale searches can only match terms
specified by the user to terms appearing in
documents. Cluster analysis can be used to
provide information retrieval by concepts, rather
than by terms. Bruce Schatz, William H. Mischo,
Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop
(University of Illinois), Hsinchun Chen
(University of Arizona), Federating Diverse
Collections of Scientific Literature, IEEE
Computer, May 1996. Federating Diverse
Collections of Scientific Literature
4Concept Spaces Methodology
Concept space A similarity matrix based on
co-occurrence of terms. Approach Use cluster
analysis to generate "concept spaces"
automatically, i.e., clusters of terms that
embrace a single semantic concept. Arrange
concepts in a hierarchical classification.
5Concept Spaces INSPEC Data
Data set 1 All terms in 400,000 records from
INSPEC, containing 270,000 terms with 4,000,000
links. 24.5 hours of CPU on 16-node Silicon
Graphics supercomputer.
computer-aided instruction see also
education UF teaching machines BT educational
computing TT computer applications RT
education RT teaching
6Concept Space Compendex Data
Data set 2 (a) 4,000,000 abstracts from the
Compendex database covering all of engineering as
the collection, partitioned along classification
code lines into some 600 community repositories.
Four days of CPU on 64-processor Convex
Exemplar. (b) In the largest experiment,
10,000,000 abstracts, were divided into sets of
100,000 and the concept space for each set
generated separately. The sets were selected by
the existing classification scheme.
7Objectives
Semantic retrieval (using concept spaces for
term suggestion) Semantic interoperability
(vocabulary switching across subject domains)
Semantic indexing (concept identification of
document content) Information representation
(information units for uniform manipulation)
8Use of Concept Space Term Suggestion
9Future Use of Concept Space Vocabulary Switching
"I'm a civil engineer who designs bridges. I'm
interested in using fluid dynamics to compute the
structural effects of wind currents on long
structures. Ocean engineers who design undersea
cables probably do similar computations for the
structural effects of water currents on long
structures. I want you the system to change my
civil engineering fluid dynamics terms into the
ocean engineering terms and search the undersea
cable literature."
10Example 3 Visual thesaurus for browsing large
collections of geographic images
Methodology Divide images into small regions.
Create a similarity measure based on
properties of these images. Use cluster
analysis tools to generate clusters of similar
images. Provide alternative representations of
clusters. Marshall Ramsey, Hsinchun Chen, Bin
Zhu, A Collection of Visual Thesauri for
Browsing Large Collections of Geographic Images,
May 1997. (http//ai.bpa.arizona.edu/mramsey/pap
ers/visualThesaurus/visual Thesaurus.html)
11(No Transcript)
12Information Visualization
Human eye is excellent in identifying patterns in
graphical data. Trends in time-dependent
data. Broad patterns in complex
data. Anomalies in scientific
data. Visualizing information spaces for
browsing.
13Pad
Concept. A large collection of information
viewed at many different scales. Imagine a
collection of documents spread out on an enormous
wall. Zoom. Zoom out and see the whole collection
with little detail. Zoom in part way to see
sections of the collection. Zoom in to see every
detail. Semantic Zooming. Objects change
appearance when they change size, so as to be
most meaningful. (Compare maps.) Performance.
Rendering operations timed so that the frame
refresh rate remains constant during pans and
zooms.
14Pad File Browser
15Pad File Browser
16Pad File Browser
17Example Tilebars
The figure represents a set of hits from a text
search. Each large rectangle represents a
document or section of text. Each row represents
a search term or subquery. The density of each
small square indicates the frequency with which a
term appears in a section of a document.
Hearst 1995
18Self Organizing Maps (SOM)
19Automatic Thesaurus Construction
Approach Select a subject domain.
Choose a corpus of documents that cover the
domain. Create vocabulary by extracting
terms, normalization, precoordination of phrase,
etc. Devise a measure of similarity between
terms and thesaurus classes. Cluster terms
into thesaurus classes, using complete linkage or
other cluster method that generates compact
clusters.
20Decisions in creating a thesaurus
1. Which terms should be included in the
thesaurus? 2. How should the terms be grouped?
21Terms to include
Only terms that are likely to be of interest
for content identification Ambiguous terms
should be coded for the senses likely to be
important in the document collection Each
thesaurus class should have approximately the
same frequency of occurrence Terms of
negative discrimination should be
eliminated after Salton and McGill
22Discriminant value
Discriminant value is the degree to which a term
is able to discriminate between the documents of
a collection (average document similarity
without term k) - (average document
similarity with term k) Good discriminators
decrease the average document similarity Note
that this definition uses the document similarity.
23Incidence array
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta
?7 ?3 ?4 ?4
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 1
D3 1 1 1 1 D4 1
1 1 1
24Document similarity matrix
D1 D2 D3 D4 D1 0.65 0.76 0.76 D2 0.65
0.00 0.87 D3 0.76 0.00 0.25 D4 0.76 0.87
0.25
Average similarity 0.55
25Discriminant value
Average similarity 0.55 without average
similarity DV alpha 0.53 -0.02
bravo 0.56 0.01 charlie 0.56 0.01
delta 0.53 -0.02 echo 0.56 0.01
foxtrot 0.52 -0.03 golf 0.53 -0.02
alpha, delta, foxtrot, golf are good
discriminators
26Phrase construction
In a thesaurus, term classes may contain
phrases. Informal definitions pair-frequency (i,
j) is the frequency that a pair of words occur in
context (e.g., in succession within a
sentence) phrase is a pair of words, i and j that
occur in context with a higher frequency than
would be expected from their overall
frequency cohesion (i, j) pair-frequency
(i, j)
frequency(i)frequency(j)
27Phrase construction
Salton and McGill algorithm 1. Computer
pair-frequency for all terms. 2. Reject all
pairs that fall below a certain threshold 3.
Calculate cohesion values 4. If cohesion above a
threshold value, consider word pair as a
phrase. Automatic phrase construction by
statistical methods is rarely used in practice.
There is promising research on phrase
identification using methods of computational
linguistics