Chapter 7 Text Operations - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Chapter 7 Text Operations

Description:

Chapter 7 Text Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Document Preprocessing Lexical analysis ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 86
Provided by: HH26
Category:

less

Transcript and Presenter's Notes

Title: Chapter 7 Text Operations


1
Chapter 7Text Operations
  • Hsin-Hsi Chen
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University

2
Logical View of a Document
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
full text
structure
3
Document Preprocessing
  • Lexical analysis
  • Elimination of stopwords
  • Stemming of the remaining words
  • Selection of index terms
  • Construction of term categorization structures

4
Lexical Analysis for Automatic Indexing
  • Lexical AnalysisConvert an input stream of
    characters into stream words or token.
  • What is a word or a token?Tokens consist of
    letters.
  • digits Most numbers are not good index
    terms.counterexamples case numbers in a legal
    database, B6 and B12 in vitamin database.
  • hyphens
  • break hyphenated words state-of-the-art, state
    of the art
  • keep hyphenated words as a token Jean-Claude,
    F-16

5
Lexical Analysis for Automatic Indexing(Continued
)
  • punctuation marks often used as parts of terms,
    e.g., OS/2, 510B.C.
  • case usually not significant in index terms
  • Issues recall and precision
  • breaking up hyphenated termsincrease recall but
    decrease precision
  • preserving case distinctionsenhance precision
    but decrease recall
  • commercial information systemsusually take
    recall enhancing approach(numbers and words
    containing digits are index terms, and all are
    case insensitive)

6
Lexical Analysis for Query Processing
  • Tasks
  • depend on the design strategies of the lexical
    analyzer for automatic indexing (search terms
    must match index terms)
  • distinguish operators like Boolean operators,
    stemming or truncating operators, and weighting
    functions
  • distinguish grouping indicators like parentheses
    and brackets

7
stoplist (negative dictionary)
  • Avoid retrieving almost every item in a database
    regardless of its relevance.
  • Examples
  • conservative approach (ORBIT Search Service)
    and, an, by, from, of, the, with
  • (derived from Brown corpus) 425 wordsa, about,
    above, across, after, again, against, all,
    almost, alone, along, already, also, although,
    always, among, an, and, another, any, anybody,
    anyone, anything, anywhere, are, area, areas,
    around, as, ask, asked, asking, asks, at, away,
    b, back, backed, backing, backs, be, because,
    became, ...
  • Articles, prepositions, conjunctions,

8
Chinese Stop Words?
  • ? Neu 58388 ? Nh 40332 ? D 39014
  • ? Di 31873 ? Nh 30025 ? D 29646
  • ? D 29211 ? Na 24269 ? D 20403
  • ? VE 19625 ?? Nh 18152 ? Nh 17298
  • ? D 15955 ? D 14066 ? Dfa 13013
  • ? VH 11577 ? D 11125 ? Di 11026
  • ? Nh 10776 ? D 9698 ?? D 9670
  • ? Dfa 9416 ?? Nh 9069 ? D 8992
  • ? D 8869 ?? Nh 8818 ? Neu 8692
  • ? D 8508 ? VG 8369 ? VH 8304
  • ? D 8037 ? D 7858 ? D 7298
  • ? Da 7266 ? D 7256 ...

9
Implementing Stoplists
  • approaches
  • examine lexical analyzer output and remove any
    stopwords
  • Every token must be looked up in the stoplist,
    and removed from further analysis if found
  • A standard list searching problem
  • remove stopwords as part of lexical analysis
  • best implementation of stoplist

10
Stemming
  • stem
  • Portion of a word which is left after the removal
    of its affixes
  • connect ? connected, connecting, connection,
    connections
  • benefits of stemming?
  • Some favor the usage of stemming
  • Many Web search engines do not adopt any stemming
    algorithm
  • issues
  • correctness
  • retrieval performance
  • compression performance

11
Stemmers
  • programs that relate morphologically similar
    indexing and search terms
  • stem at indexing time
  • advantage efficiency and index file compression
  • disadvantage information about the full terms is
    lost
  • example (CATALOG system), stem at search
    time Look for system users Search Term
    users Term Occurrences 1. user 15 2.
    users 1 3. used 3 4. using 2 Which terms
    (0none, CRall)

The user selects the terms he wants by numbers
12
Conflation Methods
  • manual
  • automatic (stemmers)
  • affix removallongest match vs. simple removal
  • successor variety
  • table lookup
  • n-gram
  • evaluation
  • correctness
  • retrieval effectiveness
  • compression performance

Term Stem engineering engineer engineered enginee
r engineer engineer
13
Successor Variety
  • Definition (successor variety of a string)the
    number of different characters that follow it in
    words in some body of text
  • Examplea body of text able, axle, accident,
    ape, aboutsuccessor variety of apple1st 4 (b,
    x, c, p)2nd 1 (e)

14
Successor Variety (Continued)
  • IdeaThe successor variety of substrings of a
    term will decrease as more characters are added
    until a segment boundary is reached, i.e., the
    successor variety will sharply increase.
  • ExampleTest word READABLECorpus ABLE,
    BEATABLE, FIXABLE, READS, READABLE, READING,
    RED, ROPE, RIPEPrefix Successor
    Variety LettersR 3 E, O, IRE 2 A,
    DREA 1 DREAD 3 A, I, SREADA 1 BREA
    DAB 1 LREADABL 1 EREADABLE 1 blank

15
The successor variety stemming process
  • Determine the successor variety for a word.
  • Use this information to segment the word.
  • cutoff methoda boundary is identified whenever
    the cutoff value is reached
  • peak and plateau methoda character whose
    successor variety exceeds that of the character
    immediately preceding it and the character
    immediately following it
  • complete word methoda segment is a complete word
  • entropy method
  • Select one of the segments as the stem.

16
n-gram stemmers
  • diagrama pair of consecutive letters
  • shared diagram methodassociation measures are
    calculated between pairs of termswhere A the
    number of unique diagrams in the first word,
    B the number of unique diagrams in the second,
    C the number of unique diagrams shared by A
    and B.

17
n-gram stemmers (Continued)
  • Example statistics gt st ta at ti is st ti ic
    cs unique diagrams gt at cs ic is st ta
    ti statistical gt st ta at ti is st ti ic ca
    al unique diagrams gt al at ca ic is st ta ti

18
n-gram stemmers (Continued)
  • similarity matrixdetermine the semantic measures
    for all pairs of terms in the database word1 wor
    d2 word3 ... wordn-1 word1 word2 S21 word3 S31
    S32 . . wordn Sn1 Sn2 Sn3 Sn(n-1)
  • terms are clustered using a single link
    clustering method
  • more a term clustering procedure than a stemming
    one

19
Affix Removal Stemmers
  • procedureRemove suffixes and/or prefixes from
    terms leaving a stem, and transform the resultant
    stem. E.g., Porter algorithm
  • example plural formsIf a word ends in ies but
    not eies or aies then ies --gt yIf a
    word ends in es but not aes, ees, or
    oes then es --gt eIf a word ends in s,
    but not us or ss then s --gt NULL
  • ambiguity

20
Affix Removal Stemmers (Continued)
  • longest match stemmerremove the longest possible
    string of characters from a word according to a
    set of rules
  • recoding AxC--gt AyC, e.g., ki --gt ky
  • partial matching only n initial characters of
    stems are used in comparing
  • different versionsLovins, Slaton, Dawson,
    Porter, Students can refer to the rules listed
    in appendix of the text book (pp. 433-436)

21
Index Term Selection(see Chapter 2)
22
Fast Statistical Parsing of Noun Phrases for
Document Indexing
  • Chengxiang Zhai
  • Laboratory for Computational Linguistics
  • Carnegie Mellon University
  • (ANLP97, pp. 312-319)

23
Phrases for Document Indexing
  • Indexing by single words
  • single words are often ambiguous and not specific
    enough for accurate discrimination of documents
  • bank terminology vs. terminology bank
  • Indexing by phrases
  • Syntactic phrases are almost always more specific
    than single words
  • Indexing by single words and phrases

24
No significant improvement?
  • Fagan, Joel L., Experiments in Automatic Phrase
    Indexing for Document Retrieval A Comparison of
    Syntactic and Non-syntactic methods, Ph.D.
    thesis, Cornel University, 1987.
  • Lewis, D., Representation and Learning in
    Information Retrieval, Ph.D. thesis, University
    of Massachusetts, 1991.
  • Many syntactic phrases have very low frequency
    and tend to be over-weighted by normal weighting
    method.

25
authors points
  • A larger document collection may increase the
    frequency of most phrases, and thus alleviate the
    problem of low frequency.
  • The phrases are used only to supplement, not
    replace the single words for indexing.
  • The new issue ishow to parse gigabytes of text
    in practically feasible time.(133MH DEC alpha
    workstation, 8 hours/GB, 20 hours of training
    with 1GB text.)

26
Experiment Design
  • CLARIT commercial retrieval system
  • original document set ----gtCLARIT NP Extractor
    ----gtRaw Noun Phrases ----gtStatistical NP
    Parser, Phrase Extractor ----gtIndexing Term
    Set ----gtCLARIT Retrieval Engine

27
Different Indexing Units
  • example
  • heavy construction industry group (WSJ90)
  • single words
  • heavy, construction, industry, group
  • head modifier pairs
  • heavy construction, construction industry,
    industry group
  • full noun phrases
  • heavy construction industry group

28
Different Indexing Units (Continued)
  • WD-SET
  • single word only (no phrases, baseline)
  • WD-HM-SET
  • single word head modifier pair
  • WD-NP-SET
  • single word full NP
  • WD-HM-NP-SET
  • single word head modifier full NP

29
Result Analysis
  • Collection Tipster Disk 2 (250MB)
  • Query TREC-5 ad hoc topics (251-300)
  • relevance feedback top 10 documents returned
    from initial retrieval
  • evaluation
  • total number of relevant documents retrieved
  • highest level of precision over all the points of
    recall
  • average precision

30
Effects of phraseswith feedback and TREC-5
31
Summary
  • When only one kind of phrase is used to
    supplement the single words, each can lead to a
    great improvement in precision.
  • When we combine the two kinds of phrases, the
    effect is a greater improvement in recall rather
    than precision.
  • How to combine and weight different phrases
    effectively becomes an important issue.

32
Thesaurus Construction
  • IR thesaurus coordinate indexing and retrievala
    list of terms (words or phrases) along with
    relationships among them
    physics, EE, electronics, computer and control
  • INSPEC thesaurus (1979) cesium (?,Cs)
    USE caesium (the preferred form)
    computer-aided instruction see also
    education (cross-referenced terms) UF
    teaching machines (a set of alternatives) BT
    educational computing (broader terms, cf. NT)
    TT computer applications (root node/top
    term) RT education (related terms)
    teaching CC C7810C (subject area) FC
    C7810Cf (subject area)

For indexer and searcher
33
Roget thesaurus
  • example
  • cowardly adjective (???)
  • Ignobly lacking in courage cowardly turncoats
  • Syns chicken (slang), chicken-hearted, craven,
  • dastardly, faint-hearted, gutless, lily-livered,
  • pusillanimous, unmanly, yellow (slang), yellow-
  • bellied (slang)

34
Functions of thesauri
  • Provide a standard vocabulary for indexing and
    searching
  • Assist users with locating terms for proper query
    formulation
  • Provide classified hierarchies that allow the
    broadening and narrowing of the current query
    request

35
Usage
  • IndexingSelect the most appropriate thesaurus
    entries for representing the document.
  • SearchingDesign the most appropriate search
    strategy.
  • If the search does not retrieve enough documents,
    the thesaurus can be used to expand the query.
  • If the search retrieves too many items, the
    thesaurus can suggest more specific search
    vocabulary.

36
Features of Thesauri
Construction of phrases from individual terms
  • Coordination Level
  • pre-coordination phrases
  • phrases are available for indexing and retrieval
  • advantage reducing ambiguity in indexing and
    searching
  • disadvantage searcher has to be know the phrase
    formulation rules
  • post-coordination words
  • phrases are constructed while searching
  • advantage users do not worry about the exact
    word ordering
  • disadvantage the search precision may fall,
    e.g.,library school vs. school library
  • immediate level phrases and single words
  • the higher the level of coordination, the greater
    the precision of the vocabulary but the larger
    the vocabulary size

length of phrases?? Two or three words or more
37
Features of Thesauri (Continued)
  • Term Relationships
  • Aitchison and Gilchrist (1972)
  • equivalence relationships
  • synonymy trade names, popular and local usage,
    superseded terms
  • quasi-synonymy, e.g., harshness and tenderness
  • hierarchical relationships, e.g., genus-species,
    BT vs. NT
  • nonhierarchical relationships, e.g., thing-part
    (bus and seat), thing-attribute (rose and
    fragrance)

??
??
dog-german shepherd
38
Features of Thesauri (Continued)
  • Wang, Vandendorpe, and Evens (1985)
  • parts-wholes, e.g., set-element, count-mass
  • collocation relations words that frequently
    co-occur in the same phrase or sentence
  • paradigmatic relations words that have the same
    semantic core, e.g., moon and lunar
  • taxonomy and synonymy (?????)
  • antonymy relations (??)

39
Features of Thesauri (Continued)
  • Number of entries for each term
  • homographs words with multiple meanings
  • each homograph entry is associated with its own
    set of relations
  • problem how to select between alternative
    meanings
  • Specificity of vocabulary
  • the precision associated with the component terms
  • a highly specific vocabulary promotes precision
    in retrieval (rules of phrase construction)

40
Features of Thesauri (Continued)
  • Control on term frequency of class members
  • for statistical thesaurus construction methods
  • terms included in the same thesaurus class have
    roughly equal frequencies
  • the total frequency in each class should also be
    roughly similar
  • Normalization of vocabulary
  • terms should be in noun form
  • noun phrases should avoid prepositions unless
    they are commonly known
  • a limited number of adjectives should be used
  • singularity, ordering, spelling, capitalization,
    transliteration, abbreviations, ...

41
Thesaurus Construction
  • manual thesaurus construction
  • define the boundaries of the subject area
  • collect the terms for each subareasources
    indexes, encyclopedias, handbooks, textbooks,
    journal titles and abstracts, catalogues,
    relevant thesauri, vocabulary systems, ...
  • organize the terms and their relationship into
    structures
  • review (and refine) the entire thesaurus for
    consistency
  • automatic thesaurus construction
  • from a collection document items
  • by merging existing thesaurus

42
Thesaurus Construction from Texts
1. Construction of vocabulary normalization
and selection of terms phrase construction
depending on the coordination level desired 2.
Similarity computations between terms
identify the significant statistical associations
between terms 3. Organization of vocabulary
organize the selected vocabulary into a hierarchy
on the basis of the associations computed in
step 2.
43
Construction of Vocabulary
  • Objectiveidentify the most informative terms
    (words and phrases)
  • Procedure(1) Identify an appropriate document
    collection. The document collection should
    be sizable and representative of the subject
    area.(2) Determine the required specificity for
    the thesaurus.(3) Normalize the vocabulary
    terms. (a) Eliminate very trivial words
    such as prepositions and
    conjunctions. (b) Stem the vocabulary. (4)
    Select the most interesting stems, and create
    interesting phrases for a higher
    coordination level.

44
Stem evaluation and selection
  • selection by frequency of occurrence
  • each term may belong to category of high, medium
    or low frequency
  • terms in the mid-frequency range are the best for
    indexing and searching

45
Stem evaluation and selection (Continued)
  • selection by discrimination value (DV)
  • the more discriminating a term, the higher its
    value as an index term
  • procedure
  • Compute the average inter-document similarity in
    the collection
  • Remove the term K from the indexing vocabulary,
    and recompute the average similarity
  • DV(K)(average similarity without K)-(average
    similarity with k)
  • The DV for good discriminators is positive.

?retrieval?????,??????terms??
46
Phrase Construction
Decrease the frequency of high-frequency terms
and increase their value of retrieval
  • Salton and McGill procedure1. Compute pairwise
    co-occurrence for high-frequency words.2.
    If this co-occurrence is lower than a threshold,
    then do not consider the pair any further.3.
    For pairs that qualify, compute the cohesion
    value. COHESION(ti, tj)
    co-occurrence-frequency/(sqrt(frequency(ti)freque
    ncy(tj))) COHESION(ti, tj)size-factor
    co-occurrence-frequency/(frequency(ti)frequency(t
    j)) where size-factor is the size of
    thesaurus vocabulary 4. If cohesion is above a
    second threshold, retain the phrase

(vs. syntactic/semantic methods)
47
Phrase Construction (Continued)
  • Choueka Procedure1. Select the range of length
    allowed for each collocational expression.
    E.g., 2-6 wsords2. Build a list of all potential
    expressions from the collection with the
    prescribed length that have a minimum
    frequency.3. Delete sequences that begin or end
    with a trivial word (e.g., prepositions,
    pronouns, articles, conjunctions, etc.)4. Delete
    expressions that contain high-frequency
    nontrivial words.5. Given an expression,
    evaluate any potential sub-expressions for
    relevance. Discard any that are not
    sufficiently relevant.6. Try to merge smaller
    expressions into larger and more meaningful
    ones.

e.g, abcd ? abc and bcd
48
Similarity Computation
  • Cosinecompute the number of documents associated
    with both terms divided by the square root of the
    product of the number of documents associated
    with the first term and the number of documents
    associated with the second term.
  • Dicecompute the number of documents associated
    with both terms divided by the sum of the number
    of documents associated with one term and the
    number associated with the other.

49
Vocabulary Organization
Assumptions (1) high-frequency words have broad
meaning, while low- frequency words have narrow
meaning. (2) if the density functions of
two terms have the same shape, then the two words
have similar meaning. 1. Identify a set of
frequency ranges. 2. Group the vocabulary terms
into different classes based on their
frequencies and the ranges selected in step 1. 3.
The highest frequency class is assigned level 0,
the next, level 1, and so on. 4. Parent-child
links are determined between adjacent levels as
follows. For each term t in level i,
compute similarity between t and every term
in level i-1. Term t becomes the child of the
most similar term in level i-1. If more than
one term in level i-1 qualifies for this,
then each becomes a parent of t. In other
words, a term is allowed to have multiple
parents. 5. After all terms in level i have been
linked to level i-1 terms, check level
i-1terms and identify those that have no
children. Propagate such terms to level i by
creating an identical dummy term as its
child. 6. Perform steps 4 and 5 for each level
starting with level.
50
Merging Existing Thesauri
  • simple mergelink hierarchies wherever they have
    terms in common
  • complex merge
  • link terms from different hierarchies if they are
    similar enough.
  • similarity is a function of the number of parent
    and child terms in common

51
Document Clustering
  • Searching vs. Browsing
  • Disadvantages in using inverted index files
  • information pertaining to a document is scattered
    among many different inverted-term lists
  • information relating to different documents with
    similar term assignment is not in close proximity
    in the file system
  • Approaches
  • inverted-index files (for searching) clustered
    document collection (for browsing)
  • clustered file organization (for searching and
    browsing)

52
Typical Clustered File Organization
clusters
superclusters
Hypercentroid Supercentroids Centroids Documents
complete space
53
Search Strategy for Clustered Documents
Highest-level centroid
Supercentroids
Centroids
Documents
Typical Search path
Centroids
Documents
54
Cluster Generation VS Cluster Search
  • Cluster structure is generated only once.
  • Cluster maintenance can be carried out at
    relatively infrequent intervals.
  • Cluster generation process may be slower and more
    expensive.
  • Cluster search operations may have to be
    performed continually.
  • Cluster search operations must be carried out
    efficiently.

55
Hierarchical Cluster Generation
  • Two strategies
  • pairwise item similarities
  • heuristic methods
  • Models
  • Divisive Clustering (top down)
  • The complete collection is assumed to represent
    one complete cluster.
  • Then the collection is subsequently broken down
    into smaller pieces.
  • Agglomerative Clustering (bottom up)
  • Individual item similarities are used as a
    starting point.
  • A gluing operation collects similar items, or
    groups, into larger group.

56
Term clustering from column viewpoint Document
clustering from row viewpoint
57
A Naive Program for Hierarchical Agglomerative
Clustering
1. Compute all pairwise document-document
similarity coefficients. (N(N-1)/2
coefficients) 2. Place each of N documents into a
class of its own. 3. Form a new cluster by
combining the most similar pair of current
clusters i and j update similarity matrix
by deleting the rows and columns
corresponding to i and j calculate the
entries in the row corresponding to the new
cluster ij. 4. Repeat step 3 if the number of
clusters left is great than 1.
58
How to Combine Clusters?
  • Single-link clustering
  • Each document must have a similarity exceeding a
    stated threshold value with at least one other
    document in the same class.
  • similarity between a pair of clusters is taken to
    be the similarity between the most similar pair
    of items
  • each cluster member will be more similar to at
    least one member in that same cluster than to any
    member of another cluster

????
e11
e22
c1
c2
e21
e24
e13
e23
e12
Let (e13,e21) be the most similar pair between c1
and c2, and its distancebe dist (e13,e21). ?
p?c1(c2), ? q?c1(c2), p?q such that
dist(p,q)ltdist (e13,e21)
??dist (e13,e21) ?????,???dist(p,q)ltdist
(e13,e21) ?, dist(p,q)? ??dist(p,r), ??r?c2(r?q )
59
How to Combine Clusters? (Continued)
  • Complete-link Clustering
  • Each document has a similarity to all other
    documents in the same class that exceeds the
    threshold value.
  • similarity between the least similar pair of
    items from the two clusters is used as the
    cluster similarity
  • each cluster member is more similar to the most
    dissimilar member of that cluster than to the
    most dissimilar member of any other cluster

????
e11
e22
e23
c1
c2
e13
e21
e24
e12
Let (e12,e24) be the least similar pair between
c1 and c2, and its distancebe dist (e12,e24). ?
p?c1(c2), let q be the most dissimilar member of
p in c1, i.e., dist(p,q)gtdist(p,r) ? r?c1(r ?q).
Because dist(p,q)ltdist(e12,e24), dist(p,r)
ltdist(e12,e24)
60
How to Combine Clusters? (Continued)
  • Group-average clustering
  • a compromise between the extremes of single-link
    and complete-link systems
  • each cluster member has a greater average
    similarity to the remaining members of that
    cluster than it does to all members of any other
    cluster

61
Example for Agglomerative Clustering
A-F (6 items) 6(6-1)/2 (15) pairwise similarities
decreasing order
62
Single-link Clustering
0.9
1. AF 0.9
A
F
sim(AF,X)max(sim(A,X),sim(F,X))
AF B C D E AF . .8 .5 .6 .8
B .8 . .4 .5 .7 C .5 .4 . .3
.5 D .6 .5 .3 . .4 E .8 .7 .5 .4
.
0.8
2. AE 0.8
0.9
E
A
F
sim(AEF,X)max(sim(AF,X),sim(E,X))
63
Single-link Clustering (Continued)
AEF B C D AEF . .8 .5 .6
B .8 . .4 .5 C .5 .4 . .3 D
.6 .5 .3 .
0.8
3. BF 0.8
0.9
B
E
A
F
sim(ABEF,X)max(sim(AEF,X),
sim(B,X))
Note E and B are on the same level (i.e.,
same similarity value)
ABEF C D ABEF . .5 .6 C
.5 . .3 D .6 .3 .
0.8
4. BE 0.7
0.9
B
E
A
F
sim(ABDEF,X)max(sim(ABEF,X))
sim(D,X))
64
Single-link Clustering (Continued)
0.6
ABDEF C ABDEF . .5 C
.5 .
0.8
D
5. AD 0.6
0.9
B
E
A
F
0.5
C
0.6
0.8
6. AC 0.5
D
0.9
B
E
A
F
65
Single-Link Clusters
  • Similarity level 0.7 (i.e., similarity threshold)
  • Similarity level 0.5 (i.e., similarity threshold)

E
A
F
E
B
.8
.9
.8
.7
C
D
C
.5
E
F
E
A
B
.8
.9
.8
.7
.6
D
Small number of large, poorly linked clusters
66
Complete-link cluster generation
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Check Operations
Similarity Pair
new
1. AF 0.9
0.9
A
F
sim(AF,X)min(sim(A,X),
sim(F,X))
check EF?
2. AE 0.8
(A,E) (A,F)
3. BF 0.8
check AB?
(A,E) (A,F) (B,F)
67
Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Similarity Pair
Check Operations
AF B C D E AF . .3 .2 .1 .3
B .3 . .4 .5 .7 C .2 .4 . .3
.5 D .1 .5 .3 . .4 E .3 .7 .5 .4
.
new
0.7
4. BE 0.7
B
E
check DF?
(A,D)(A,E)(A,F) (B,E)(B,F)
5. AD 0.6
6. AC 0.6
check CF?
(A,C)(A,D)(A,E)(A,F) (B,E)(B,F)
7. BD 0.5
check DE?
(A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)
68
Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Step Number
Similarity Pair
Check Operations
Similarity Matrix
AF BE C D AF . .3 .2 .1
BE .3 . .4 .4 C .2 .4 . .3 D
.1 .4 .3 .
check BC?
8. CE 0.5
(A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)(C,E)
0.4
check CE0.5
0.7
9. BC 0.4
C
B
E
(in the checklist)
10. DE 0.4
Check BD0.5 CD?
(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,
E)
Check AC0.5 AE0.8 BF0.8 CF ? , EF?
11. AB 0.3
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
E)(D,E)
69
Complete-link cluster generation (Continued)
Complete Link Structure Pairs Covered
Similarity Matrix
Step Number
Similarity Pair
Check Operations
0.3
AF BCE D AF . .2 .1
BCE .2 . .3 D .1 .3 .
12. CD 0.3
Check BD0.5 DE0.4
0.4
D
0.7
C
B
E
Check BF0.8 CF? DF ?
13. EF 0.3
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
D)(C,E)(D,E)(E,F)
Check BF0.8 EF0.3 DF ?
14. CF 0.2
(A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,
D)(C,E)(C,F)(D,E)(E,F)
70
Complete-link cluster generation (Continued)
0.1
AF BCDE AF . .1 BCDE
.1 .
15. DF 0.1
last pair
0.9
0.3
A
F
0.4
D
0.7
C
B
E
71
Complete link clusters
Similarity level 0.7
A
F
B
E
0.9
0.7
C
D
Similarity level 0.4
Larger number of small, tightly linked clusters
A
F
B
E
0.9
0.7
0.5
D
0.4
C
Similarity level 0.3
D
0.5
0.4
A
F
B
E
0.9
0.3
0.7
0.4
0.5
C
72
The Behavior of Single-Link Cluster
  • The single-link process tends to produce a small
    number of large clusters that are characterized
    by a chaining effect.
  • Each element is usually attached to only one
    other member of the same cluster at each
    similarity level.
  • It is sufficient to remember the list of
    previously clustered single items.

73
The Behavior of Complete-Link Cluster
  • Complete-link process produces a much larger
    number of small, tightly linked groupings.
  • Each item in a complete-link cluster is
    guaranteed to resemble all other items in that
    cluster at the stated similarity level.
  • It is necessary to remember the list of all item
    pairs previously considered in the clustering
    process.

74
The Behavior of Complete-Link Cluster(Continued)
  • The complete-link clustering system may be better
    adapted to retrieval than the single-link
    clusters.
  • A complete-link cluster generation is more
    expensive to perform than a comparable
    single-link process.

75
How to Generate Similarity
Di(di1, di2, ..., dit) document vector for
Di Lj(lj1, lj2, ..., ljnj) inverted list for
term Tj lji denotes document identifier of ith
document listed under term Tj nj denote
number of postings for term Tj for j1 to t (for
each of t possible terms) for i1 to nj (for
all nj entries on the jth list) compute
sim(Dlji,Dlj,ik) i1ltkltnj end for end
for
number of documents containing term Tj
76
Similarity without Recomputation
for j1 to N (for each document in collection)
set S(j)0, 1ltjltN for k1 to nj (for each
term in document) take up inverted list
Lk for i1 to nk (for each document
identifier on list) if iltj or if
Sji1 take up next document Di
else compute sim(Dj,Di)
set Sji1 end for end for end for
77
Heuristic Clustering Methods
  • Hierarchical clustering strategies
  • use all pairwise similarities between items
  • the clustering-generation are relatively
    expensive
  • produce a unique set of well-formed clusters for
    each set of data, regardless of the order in
    which the similarity pairs are introduced into
    the clustering process
  • Heuristic clustering methods
  • produce rough cluster arrangements at relatively
    little expense

78
Single-Pass Heuristic Clustering Methods
  • Item 1 is first taken and placed into a cluster
    of its own.
  • Each subsequent item is then compared against all
    existing clusters.
  • It is placed in a previously existing cluster
    whenever it is similar to any existing cluster.
  • Compute the similarities between all existing
    centroids and the new incoming item.
  • When an item is added to an existing cluster, the
    corresponding centroid must then be appropriately
    updated.
  • If a new item is not sufficiently similar to any
    existing cluster, the new item forms a cluster of
    its own.

79
Single-Pass Heuristic Clustering
Methods(Continued)
  • Produce uneven cluster structures.
  • Solutions
  • cluster splitting cluster sizes
  • variable similarity thresholds the number of
    clusters, and the overlap among clusters
  • Produce cluster arrangements that vary according
    to the order of individual items.

80
Cluster Splitting
Addition of one more item to cluster A
Splitting cluster A into two pieces A and A
Splitting superclusters S into two pieces S and
S
81
Cluster Searching
  • Cluster centroidthe average vector of all the
    documents in a given cluster
  • strategies
  • top downthe query is first compared with the
    highest-level centroids
  • bottom uponly the lowest-level centroids are
    stored, the higher-level cluster structure is
    disregarded

82
Top-down entire-clustering search
1. Initialized by adding top item to active node
list 2. Take centroid with highest-query
similarity from active node list if the number
of singleton items in subtree headed by that
centroid is not larger than number of items
wanted, then retrieve these singleton items
and eliminate the centroid from active node
list else eliminate the centroid with
highest query similarity from active node list
and add its sons to active node list 3. if
number of retrieved ? number wanted then
stop else repeat step 2
83
Active node list Number of single Retrieved i
tems in subtree items (1,0.2) 14 (too
big) (2,0.5), (4,0.7), (3,0) 6 (too
big) (2,0.5), (8,0.8), (9,0.3),(3,0) 2 I,
J (2,0.5), (9,0.3), (3,0) 4 (too big) (5,0.6),
(6,0.5), (9,0.3), (3,0) 2 A,B
84
Bottom-up Individual-Cluster Search
Take a specified number of low-level centroids
if there are enough singleton items in those
clusters to equal the number of items
wanted, then retrieve the number of items
wanted in ranked order else add
additional low-level centroids to list
and repeat test
85
Active centroid list (8,.8), (4,.7),
(5,.6) Ranked documents from clusters
(I,.9), (L,.8), (A,.8), (K,.6), (B,.5), (J,.4),
(N,.4), (M,.2) Retrieved items I, L, A
Write a Comment
User Comments (0)
About PowerShow.com