Title: Chapter 7 Text Operations
1Chapter 7Text Operations
- Hsin-Hsi Chen
- Department of Computer Science and Information
Engineering - National Taiwan University
2Logical View of a Document
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
full text
structure
3Document Preprocessing
- Lexical analysis
- Elimination of stopwords
- Stemming of the remaining words
- Selection of index terms
- Construction of term categorization structures
4Lexical Analysis for Automatic Indexing
- Lexical AnalysisConvert an input stream of
characters into stream words or token. - What is a word or a token?Tokens consist of
letters. - digits Most numbers are not good index
terms.counterexamples case numbers in a legal
database, B6 and B12 in vitamin database. - hyphens
- break hyphenated words state-of-the-art, state
of the art - keep hyphenated words as a token Jean-Claude,
F-16
5Lexical Analysis for Automatic Indexing(Continued
)
- punctuation marks often used as parts of terms,
e.g., OS/2, 510B.C. - case usually not significant in index terms
- Issues recall and precision
- breaking up hyphenated termsincrease recall but
decrease precision - preserving case distinctionsenhance precision
but decrease recall - commercial information systemsusually take
recall enhancing approach(numbers and words
containing digits are index terms, and all are
case insensitive)
6Lexical Analysis for Query Processing
- Tasks
- depend on the design strategies of the lexical
analyzer for automatic indexing (search terms
must match index terms) - distinguish operators like Boolean operators,
stemming or truncating operators, and weighting
functions - distinguish grouping indicators like parentheses
and brackets
7stoplist (negative dictionary)
- Avoid retrieving almost every item in a database
regardless of its relevance. - Examples
- conservative approach (ORBIT Search Service)
and, an, by, from, of, the, with - (derived from Brown corpus) 425 wordsa, about,
above, across, after, again, against, all,
almost, alone, along, already, also, although,
always, among, an, and, another, any, anybody,
anyone, anything, anywhere, are, area, areas,
around, as, ask, asked, asking, asks, at, away,
b, back, backed, backing, backs, be, because,
became, ... - Articles, prepositions, conjunctions,
8Chinese Stop Words?
- ? Neu 58388 ? Nh 40332 ? D 39014
- ? Di 31873 ? Nh 30025 ? D 29646
- ? D 29211 ? Na 24269 ? D 20403
- ? VE 19625 ?? Nh 18152 ? Nh 17298
- ? D 15955 ? D 14066 ? Dfa 13013
- ? VH 11577 ? D 11125 ? Di 11026
- ? Nh 10776 ? D 9698 ?? D 9670
- ? Dfa 9416 ?? Nh 9069 ? D 8992
- ? D 8869 ?? Nh 8818 ? Neu 8692
- ? D 8508 ? VG 8369 ? VH 8304
- ? D 8037 ? D 7858 ? D 7298
- ? Da 7266 ? D 7256 ...
9Implementing Stoplists
- approaches
- examine lexical analyzer output and remove any
stopwords - Every token must be looked up in the stoplist,
and removed from further analysis if found - A standard list searching problem
- remove stopwords as part of lexical analysis
- best implementation of stoplist
10Stemming
- stem
- Portion of a word which is left after the removal
of its affixes - connect ? connected, connecting, connection,
connections - benefits of stemming?
- Some favor the usage of stemming
- Many Web search engines do not adopt any stemming
algorithm - issues
- correctness
- retrieval performance
- compression performance
11Stemmers
- programs that relate morphologically similar
indexing and search terms - stem at indexing time
- advantage efficiency and index file compression
- disadvantage information about the full terms is
lost - example (CATALOG system), stem at search
time Look for system users Search Term
users Term Occurrences 1. user 15 2.
users 1 3. used 3 4. using 2 Which terms
(0none, CRall)
The user selects the terms he wants by numbers
12Conflation Methods
- manual
- automatic (stemmers)
- affix removallongest match vs. simple removal
- successor variety
- table lookup
- n-gram
- evaluation
- correctness
- retrieval effectiveness
- compression performance
Term Stem engineering engineer engineered enginee
r engineer engineer
13Successor Variety
- Definition (successor variety of a string)the
number of different characters that follow it in
words in some body of text - Examplea body of text able, axle, accident,
ape, aboutsuccessor variety of apple1st 4 (b,
x, c, p)2nd 1 (e)
14Successor Variety (Continued)
- IdeaThe successor variety of substrings of a
term will decrease as more characters are added
until a segment boundary is reached, i.e., the
successor variety will sharply increase. - ExampleTest word READABLECorpus ABLE,
BEATABLE, FIXABLE, READS, READABLE, READING,
RED, ROPE, RIPEPrefix Successor
Variety LettersR 3 E, O, IRE 2 A,
DREA 1 DREAD 3 A, I, SREADA 1 BREA
DAB 1 LREADABL 1 EREADABLE 1 blank
15The successor variety stemming process
- Determine the successor variety for a word.
- Use this information to segment the word.
- cutoff methoda boundary is identified whenever
the cutoff value is reached - peak and plateau methoda character whose
successor variety exceeds that of the character
immediately preceding it and the character
immediately following it - complete word methoda segment is a complete word
- entropy method
- Select one of the segments as the stem.
16n-gram stemmers
- diagrama pair of consecutive letters
- shared diagram methodassociation measures are
calculated between pairs of termswhere A the
number of unique diagrams in the first word,
B the number of unique diagrams in the second,
C the number of unique diagrams shared by A
and B.
17n-gram stemmers (Continued)
- Example statistics gt st ta at ti is st ti ic
cs unique diagrams gt at cs ic is st ta
ti statistical gt st ta at ti is st ti ic ca
al unique diagrams gt al at ca ic is st ta ti
18n-gram stemmers (Continued)
- similarity matrixdetermine the semantic measures
for all pairs of terms in the database word1 wor
d2 word3 ... wordn-1 word1 word2 S21 word3 S31
S32 . . wordn Sn1 Sn2 Sn3 Sn(n-1) - terms are clustered using a single link
clustering method - more a term clustering procedure than a stemming
one
19Affix Removal Stemmers
- procedureRemove suffixes and/or prefixes from
terms leaving a stem, and transform the resultant
stem. E.g., Porter algorithm - example plural formsIf a word ends in ies but
not eies or aies then ies --gt yIf a
word ends in es but not aes, ees, or
oes then es --gt eIf a word ends in s,
but not us or ss then s --gt NULL - ambiguity
20Affix Removal Stemmers (Continued)
- longest match stemmerremove the longest possible
string of characters from a word according to a
set of rules - recoding AxC--gt AyC, e.g., ki --gt ky
- partial matching only n initial characters of
stems are used in comparing - different versionsLovins, Slaton, Dawson,
Porter, Students can refer to the rules listed
in appendix of the text book (pp. 433-436)
21Index Term Selection(see Chapter 2)
22Fast Statistical Parsing of Noun Phrases for
Document Indexing
- Chengxiang Zhai
- Laboratory for Computational Linguistics
- Carnegie Mellon University
- (ANLP97, pp. 312-319)
23Phrases for Document Indexing
- Indexing by single words
- single words are often ambiguous and not specific
enough for accurate discrimination of documents - bank terminology vs. terminology bank
- Indexing by phrases
- Syntactic phrases are almost always more specific
than single words - Indexing by single words and phrases
24No significant improvement?
- Fagan, Joel L., Experiments in Automatic Phrase
Indexing for Document Retrieval A Comparison of
Syntactic and Non-syntactic methods, Ph.D.
thesis, Cornel University, 1987. - Lewis, D., Representation and Learning in
Information Retrieval, Ph.D. thesis, University
of Massachusetts, 1991. - Many syntactic phrases have very low frequency
and tend to be over-weighted by normal weighting
method.
25authors points
- A larger document collection may increase the
frequency of most phrases, and thus alleviate the
problem of low frequency. - The phrases are used only to supplement, not
replace the single words for indexing. - The new issue ishow to parse gigabytes of text
in practically feasible time.(133MH DEC alpha
workstation, 8 hours/GB, 20 hours of training
with 1GB text.)
26Experiment Design
- CLARIT commercial retrieval system
- original document set ----gtCLARIT NP Extractor
----gtRaw Noun Phrases ----gtStatistical NP
Parser, Phrase Extractor ----gtIndexing Term
Set ----gtCLARIT Retrieval Engine
27Different Indexing Units
- example
- heavy construction industry group (WSJ90)
- single words
- heavy, construction, industry, group
- head modifier pairs
- heavy construction, construction industry,
industry group - full noun phrases
- heavy construction industry group
28Different Indexing Units (Continued)
- WD-SET
- single word only (no phrases, baseline)
- WD-HM-SET
- single word head modifier pair
- WD-NP-SET
- single word full NP
- WD-HM-NP-SET
- single word head modifier full NP
29Result Analysis
- Collection Tipster Disk 2 (250MB)
- Query TREC-5 ad hoc topics (251-300)
- relevance feedback top 10 documents returned
from initial retrieval - evaluation
- total number of relevant documents retrieved
- highest level of precision over all the points of
recall - average precision
30Effects of phraseswith feedback and TREC-5
31Summary
- When only one kind of phrase is used to
supplement the single words, each can lead to a
great improvement in precision. - When we combine the two kinds of phrases, the
effect is a greater improvement in recall rather
than precision. - How to combine and weight different phrases
effectively becomes an important issue.
32Thesaurus Construction
- IR thesaurus coordinate indexing and retrievala
list of terms (words or phrases) along with
relationships among them
physics, EE, electronics, computer and control - INSPEC thesaurus (1979) cesium (?,Cs)
USE caesium (the preferred form)
computer-aided instruction see also
education (cross-referenced terms) UF
teaching machines (a set of alternatives) BT
educational computing (broader terms, cf. NT)
TT computer applications (root node/top
term) RT education (related terms)
teaching CC C7810C (subject area) FC
C7810Cf (subject area)
For indexer and searcher
33Roget thesaurus
- example
- cowardly adjective (???)
- Ignobly lacking in courage cowardly turncoats
- Syns chicken (slang), chicken-hearted, craven,
- dastardly, faint-hearted, gutless, lily-livered,
- pusillanimous, unmanly, yellow (slang), yellow-
- bellied (slang)
34Functions of thesauri
- Provide a standard vocabulary for indexing and
searching - Assist users with locating terms for proper query
formulation - Provide classified hierarchies that allow the
broadening and narrowing of the current query
request
35Usage
- IndexingSelect the most appropriate thesaurus
entries for representing the document. - SearchingDesign the most appropriate search
strategy. - If the search does not retrieve enough documents,
the thesaurus can be used to expand the query. - If the search retrieves too many items, the
thesaurus can suggest more specific search
vocabulary.
36Features of Thesauri
Construction of phrases from individual terms
- Coordination Level
- pre-coordination phrases
- phrases are available for indexing and retrieval
- advantage reducing ambiguity in indexing and
searching - disadvantage searcher has to be know the phrase
formulation rules - post-coordination words
- phrases are constructed while searching
- advantage users do not worry about the exact
word ordering - disadvantage the search precision may fall,
e.g.,library school vs. school library - immediate level phrases and single words
- the higher the level of coordination, the greater
the precision of the vocabulary but the larger
the vocabulary size
length of phrases?? Two or three words or more
37Features of Thesauri (Continued)
- Term Relationships
- Aitchison and Gilchrist (1972)
- equivalence relationships
- synonymy trade names, popular and local usage,
superseded terms - quasi-synonymy, e.g., harshness and tenderness
- hierarchical relationships, e.g., genus-species,
BT vs. NT - nonhierarchical relationships, e.g., thing-part
(bus and seat), thing-attribute (rose and
fragrance)
??
??
dog-german shepherd
38Features of Thesauri (Continued)
- Wang, Vandendorpe, and Evens (1985)
- parts-wholes, e.g., set-element, count-mass
- collocation relations words that frequently
co-occur in the same phrase or sentence - paradigmatic relations words that have the same
semantic core, e.g., moon and lunar - taxonomy and synonymy (?????)
- antonymy relations (??)
39Features of Thesauri (Continued)
- Number of entries for each term
- homographs words with multiple meanings
- each homograph entry is associated with its own
set of relations - problem how to select between alternative
meanings - Specificity of vocabulary
- the precision associated with the component terms
- a highly specific vocabulary promotes precision
in retrieval (rules of phrase construction)
40Features of Thesauri (Continued)
- Control on term frequency of class members
- for statistical thesaurus construction methods
- terms included in the same thesaurus class have
roughly equal frequencies - the total frequency in each class should also be
roughly similar - Normalization of vocabulary
- terms should be in noun form
- noun phrases should avoid prepositions unless
they are commonly known - a limited number of adjectives should be used
- singularity, ordering, spelling, capitalization,
transliteration, abbreviations, ...
41Thesaurus Construction
- manual thesaurus construction
- define the boundaries of the subject area
- collect the terms for each subareasources
indexes, encyclopedias, handbooks, textbooks,
journal titles and abstracts, catalogues,
relevant thesauri, vocabulary systems, ... - organize the terms and their relationship into
structures - review (and refine) the entire thesaurus for
consistency - automatic thesaurus construction
- from a collection document items
- by merging existing thesaurus
42Thesaurus Construction from Texts
1. Construction of vocabulary normalization
and selection of terms phrase construction
depending on the coordination level desired 2.
Similarity computations between terms
identify the significant statistical associations
between terms 3. Organization of vocabulary
organize the selected vocabulary into a hierarchy
on the basis of the associations computed in
step 2.
43Construction of Vocabulary
- Objectiveidentify the most informative terms
(words and phrases) - Procedure(1) Identify an appropriate document
collection. The document collection should
be sizable and representative of the subject
area.(2) Determine the required specificity for
the thesaurus.(3) Normalize the vocabulary
terms. (a) Eliminate very trivial words
such as prepositions and
conjunctions. (b) Stem the vocabulary. (4)
Select the most interesting stems, and create
interesting phrases for a higher
coordination level.
44Stem evaluation and selection
- selection by frequency of occurrence
- each term may belong to category of high, medium
or low frequency - terms in the mid-frequency range are the best for
indexing and searching
45Stem evaluation and selection (Continued)
- selection by discrimination value (DV)
- the more discriminating a term, the higher its
value as an index term - procedure
- Compute the average inter-document similarity in
the collection - Remove the term K from the indexing vocabulary,
and recompute the average similarity - DV(K)(average similarity without K)-(average
similarity with k) - The DV for good discriminators is positive.
?retrieval?????,??????terms??
46Phrase Construction
Decrease the frequency of high-frequency terms
and increase their value of retrieval
- Salton and McGill procedure1. Compute pairwise
co-occurrence for high-frequency words.2.
If this co-occurrence is lower than a threshold,
then do not consider the pair any further.3.
For pairs that qualify, compute the cohesion
value. COHESION(ti, tj)
co-occurrence-frequency/(sqrt(frequency(ti)freque
ncy(tj))) COHESION(ti, tj)size-factor
co-occurrence-frequency/(frequency(ti)frequency(t
j)) where size-factor is the size of
thesaurus vocabulary 4. If cohesion is above a
second threshold, retain the phrase
(vs. syntactic/semantic methods)
47Phrase Construction (Continued)
- Choueka Procedure1. Select the range of length
allowed for each collocational expression.
E.g., 2-6 wsords2. Build a list of all potential
expressions from the collection with the
prescribed length that have a minimum
frequency.3. Delete sequences that begin or end
with a trivial word (e.g., prepositions,
pronouns, articles, conjunctions, etc.)4. Delete
expressions that contain high-frequency
nontrivial words.5. Given an expression,
evaluate any potential sub-expressions for
relevance. Discard any that are not
sufficiently relevant.6. Try to merge smaller
expressions into larger and more meaningful
ones.
e.g, abcd ? abc and bcd
48Similarity Computation
- Cosinecompute the number of documents associated
with both terms divided by the square root of the
product of the number of documents associated
with the first term and the number of documents
associated with the second term. - Dicecompute the number of documents associated
with both terms divided by the sum of the number
of documents associated with one term and the
number associated with the other.
49Vocabulary Organization
Assumptions (1) high-frequency words have broad
meaning, while low- frequency words have narrow
meaning. (2) if the density functions of
two terms have the same shape, then the two words
have similar meaning. 1. Identify a set of
frequency ranges. 2. Group the vocabulary terms
into different classes based on their
frequencies and the ranges selected in step 1. 3.
The highest frequency class is assigned level 0,
the next, level 1, and so on. 4. Parent-child
links are determined between adjacent levels as
follows. For each term t in level i,
compute similarity between t and every term
in level i-1. Term t becomes the child of the
most similar term in level i-1. If more than
one term in level i-1 qualifies for this,
then each becomes a parent of t. In other
words, a term is allowed to have multiple
parents. 5. After all terms in level i have been
linked to level i-1 terms, check level
i-1terms and identify those that have no
children. Propagate such terms to level i by
creating an identical dummy term as its
child. 6. Perform steps 4 and 5 for each level
starting with level.
50Merging Existing Thesauri
- simple mergelink hierarchies wherever they have
terms in common - complex merge
- link terms from different hierarchies if they are
similar enough. - similarity is a function of the number of parent
and child terms in common
51Document Clustering
- Searching vs. Browsing
- Disadvantages in using inverted index files
- information pertaining to a document is scattered
among many different inverted-term lists - information relating to different documents with
similar term assignment is not in close proximity
in the file system - Approaches
- inverted-index files (for searching) clustered
document collection (for browsing) - clustered file organization (for searching and
browsing)
52Typical Clustered File Organization
clusters
superclusters
Hypercentroid Supercentroids Centroids Documents
complete space
53Search Strategy for Clustered Documents
Highest-level centroid
Supercentroids
Centroids
Documents
Typical Search path
Centroids
Documents
54Cluster Generation VS Cluster Search
- Cluster structure is generated only once.
- Cluster maintenance can be carried out at
relatively infrequent intervals. - Cluster generation process may be slower and more
expensive. - Cluster search operations may have to be
performed continually. - Cluster search operations must be carried out
efficiently.
55Hierarchical Cluster Generation
- Two strategies
- pairwise item similarities
- heuristic methods
- Models
- Divisive Clustering (top down)
- The complete collection is assumed to represent
one complete cluster. - Then the collection is subsequently broken down
into smaller pieces. - Agglomerative Clustering (bottom up)
- Individual item similarities are used as a
starting point. - A gluing operation collects similar items, or
groups, into larger group.
56Term clustering from column viewpoint Document
clustering from row viewpoint
57A Naive Program for Hierarchical Agglomerative
Clustering
1. Compute all pairwise document-document
similarity coefficients. (N(N-1)/2
coefficients) 2. Place each of N documents into a
class of its own. 3. Form a new cluster by
combining the most similar pair of current
clusters i and j update similarity matrix
by deleting the rows and columns
corresponding to i and j calculate the
entries in the row corresponding to the new
cluster ij. 4. Repeat step 3 if the number of
clusters left is great than 1.
58How to Combine Clusters?
- Single-link clustering
- Each document must have a similarity exceeding a
stated threshold value with at least one other
document in the same class. - similarity between a pair of clusters is taken to
be the similarity between the most similar pair
of items - each cluster member will be more similar to at
least one member in that same cluster than to any
member of another cluster
????
e11
e22
c1
c2
e21
e24
e13
e23
e12
Let (e13,e21) be the most similar pair between c1
and c2, and its distancebe dist (e13,e21). ?
p?c1(c2), ? q?c1(c2), p?q such that
dist(p,q)ltdist (e13,e21)
??dist (e13,e21) ?????,???dist(p,q)ltdist
(e13,e21) ?, dist(p,q)? ??dist(p,r), ??r?c2(r?q )
59How to Combine Clusters? (Continued)
- Complete-link Clustering
- Each document has a similarity to all other
documents in the same class that exceeds the
threshold value. - similarity between the least similar pair of
items from the two clusters is used as the
cluster similarity - each cluster member is more similar to the most
dissimilar member of that cluster than to the
most dissimilar member of any other cluster
????
e11
e22
e23
c1
c2
e13
e21
e24
e12
Let (e12,e24) be the least similar pair between
c1 and c2, and its distancebe dist (e12,e24). ?
p?c1(c2), let q be the most dissimilar member of
p in c1, i.e., dist(p,q)gtdist(p,r) ? r?c1(r ?q).
Because dist(p,q)ltdist(e12,e24), dist(p,r)
ltdist(e12,e24)
60How to Combine Clusters? (Continued)
- Group-average clustering
- a compromise between the extremes of single-link
and complete-link systems - each cluster member has a greater average
similarity to the remaining members of that
cluster than it does to all members of any other
cluster
61Example for Agglomerative Clustering
A-F (6 items) 6(6-1)/2 (15) pairwise similarities
decreasing order
62Single-link Clustering
0.9
1. AF 0.9
A
F
sim(AF,X)max(sim(A,X),sim(F,X))
AF B C D E AF . .8 .5 .6 .8
B .8 . .4 .5 .7 C .5 .4 . .3
.5 D .6 .5 .3 . .4 E .8 .7 .5 .4
.
0.8
2. AE 0.8
0.9
E
A
F
sim(AEF,X)max(sim(AF,X),sim(E,X))
63Single-link Clustering (Continued)
AEF B C D AEF . .8 .5 .6
B .8 . .4 .5 C .5 .4 . .3 D
.6 .5 .3 .
0.8
3. BF 0.8
0.9
B
E
A
F
sim(ABEF,X)max(sim(AEF,X),
sim(B,X))
Note E and B are on the same level (i.e.,
same similarity value)
ABEF C D ABEF . .5 .6 C
.5 . .3 D .6 .3 .
0.8
4. BE 0.7
0.9
B
E
A
F
sim(ABDEF,X)max(sim(ABEF,X))
sim(D,X))
64Single-link Clustering (Continued)
0.6
ABDEF C ABDEF . .5 C
.5 .
0.8
D
5. AD 0.6
0.9
B
E
A
F
0.5
C
0.6
0.8
6. AC 0.5
D
0.9
B
E
A
F
65Single-Link Clusters
- Similarity level 0.7 (i.e., similarity threshold)
- Similarity level 0.5 (i.e., similarity threshold)
E
A
F
E
B
.8
.9
.8
.7
C
D
C
.5
E
F
E
A
B
.8
.9
.8
.7
.6
D
Small number of large, poorly linked clusters
66Complete-link cluster generation
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Check Operations
Similarity Pair
new
1. AF 0.9
0.9
A
F
sim(AF,X)min(sim(A,X),
sim(F,X))
check EF?
2. AE 0.8
(A,E) (A,F)
3. BF 0.8
check AB?
(A,E) (A,F) (B,F)
67Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Similarity Pair
Check Operations
AF B C D E AF . .3 .2 .1 .3
B .3 . .4 .5 .7 C .2 .4 . .3
.5 D .1 .5 .3 . .4 E .3 .7 .5 .4
.
new
0.7
4. BE 0.7
B
E
check DF?
(A,D)(A,E)(A,F) (B,E)(B,F)
5. AD 0.6
6. AC 0.6
check CF?
(A,C)(A,D)(A,E)(A,F) (B,E)(B,F)
7. BD 0.5
check DE?
(A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)
68Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Step Number
Similarity Pair
Check Operations
Similarity Matrix
AF BE C D AF . .3 .2 .1
BE .3 . .4 .4 C .2 .4 . .3 D
.1 .4 .3 .
check BC?
8. CE 0.5
(A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)(C,E)
0.4
check CE0.5
0.7
9. BC 0.4
C
B
E
(in the checklist)
10. DE 0.4
Check BD0.5 CD?
(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,
E)
Check AC0.5 AE0.8 BF0.8 CF ? , EF?
11. AB 0.3
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
E)(D,E)
69Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Similarity Pair
Check Operations
0.3
AF BCE D AF . .2 .1
BCE .2 . .3 D .1 .3 .
12. CD 0.3
Check BD0.5 DE0.4
0.4
D
0.7
C
B
E
Check BF0.8 CF? DF ?
13. EF 0.3
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
D)(C,E)(D,E)(E,F)
Check BF0.8 EF0.3 DF ?
14. CF 0.2
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
D)(C,E)(C,F)(D,E)(E,F)
70Complete-link cluster generation (Continued)
0.1
AF BCDE AF . .1 BCDE
.1 .
15. DF 0.1
last pair
0.9
0.3
A
F
0.4
D
0.7
C
B
E
71Complete link clusters
Similarity level 0.7
A
F
B
E
0.9
0.7
C
D
Similarity level 0.4
Larger number of small, tightly linked clusters
A
F
B
E
0.9
0.7
0.5
D
0.4
C
Similarity level 0.3
D
0.5
0.4
A
F
B
E
0.9
0.3
0.7
0.4
0.5
C
72The Behavior of Single-Link Cluster
- The single-link process tends to produce a small
number of large clusters that are characterized
by a chaining effect. - Each element is usually attached to only one
other member of the same cluster at each
similarity level. - It is sufficient to remember the list of
previously clustered single items.
73The Behavior of Complete-Link Cluster
- Complete-link process produces a much larger
number of small, tightly linked groupings. - Each item in a complete-link cluster is
guaranteed to resemble all other items in that
cluster at the stated similarity level. - It is necessary to remember the list of all item
pairs previously considered in the clustering
process.
74The Behavior of Complete-Link Cluster(Continued)
- The complete-link clustering system may be better
adapted to retrieval than the single-link
clusters. - A complete-link cluster generation is more
expensive to perform than a comparable
single-link process.
75How to Generate Similarity
Di(di1, di2, ..., dit) document vector for
Di Lj(lj1, lj2, ..., ljnj) inverted list for
term Tj lji denotes document identifier of ith
document listed under term Tj nj denote
number of postings for term Tj for j1 to t (for
each of t possible terms) for i1 to nj (for
all nj entries on the jth list) compute
sim(Dlji,Dlj,ik) i1ltkltnj end for end
for
number of documents containing term Tj
76Similarity without Recomputation
for j1 to N (for each document in collection)
set S(j)0, 1ltjltN for k1 to nj (for each
term in document) take up inverted list
Lk for i1 to nk (for each document
identifier on list) if iltj or if
Sji1 take up next document Di
else compute sim(Dj,Di)
set Sji1 end for end for end for
77Heuristic Clustering Methods
- Hierarchical clustering strategies
- use all pairwise similarities between items
- the clustering-generation are relatively
expensive - produce a unique set of well-formed clusters for
each set of data, regardless of the order in
which the similarity pairs are introduced into
the clustering process - Heuristic clustering methods
- produce rough cluster arrangements at relatively
little expense
78Single-Pass Heuristic Clustering Methods
- Item 1 is first taken and placed into a cluster
of its own. - Each subsequent item is then compared against all
existing clusters. - It is placed in a previously existing cluster
whenever it is similar to any existing cluster. - Compute the similarities between all existing
centroids and the new incoming item. - When an item is added to an existing cluster, the
corresponding centroid must then be appropriately
updated. - If a new item is not sufficiently similar to any
existing cluster, the new item forms a cluster of
its own.
79Single-Pass Heuristic Clustering
Methods(Continued)
- Produce uneven cluster structures.
- Solutions
- cluster splitting cluster sizes
- variable similarity thresholds the number of
clusters, and the overlap among clusters - Produce cluster arrangements that vary according
to the order of individual items.
80Cluster Splitting
Addition of one more item to cluster A
Splitting cluster A into two pieces A and A
Splitting superclusters S into two pieces S and
S
81Cluster Searching
- Cluster centroidthe average vector of all the
documents in a given cluster - strategies
- top downthe query is first compared with the
highest-level centroids - bottom uponly the lowest-level centroids are
stored, the higher-level cluster structure is
disregarded
82Top-down entire-clustering search
1. Initialized by adding top item to active node
list 2. Take centroid with highest-query
similarity from active node list if the number
of singleton items in subtree headed by that
centroid is not larger than number of items
wanted, then retrieve these singleton items
and eliminate the centroid from active node
list else eliminate the centroid with
highest query similarity from active node list
and add its sons to active node list 3. if
number of retrieved ? number wanted then
stop else repeat step 2
83Active node list Number of single Retrieved i
tems in subtree items (1,0.2) 14 (too
big) (2,0.5), (4,0.7), (3,0) 6 (too
big) (2,0.5), (8,0.8), (9,0.3),(3,0) 2 I,
J (2,0.5), (9,0.3), (3,0) 4 (too big) (5,0.6),
(6,0.5), (9,0.3), (3,0) 2 A,B
84Bottom-up Individual-Cluster Search
Take a specified number of low-level centroids
if there are enough singleton items in those
clusters to equal the number of items
wanted, then retrieve the number of items
wanted in ranked order else add
additional low-level centroids to list
and repeat test
85Active centroid list (8,.8), (4,.7),
(5,.6) Ranked documents from clusters
(I,.9), (L,.8), (A,.8), (K,.6), (B,.5), (J,.4),
(N,.4), (M,.2) Retrieved items I, L, A