Text Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

Text Similarity

Description:

manor. The time. was past midnight. Doc 2. How Inverted. Files are Created ... manor - d2. country AND manor - d2. Also used for statistical ranking algorithms ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 95
Provided by: eam9
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Text Similarity


1
Text Similarity Dr Eamonn Keogh Computer
Science Engineering DepartmentUniversity of
California - RiversideRiverside,CA
92521eamonn_at_cs.ucr.edu
2
(No Transcript)
3
(No Transcript)
4
Information Retrieval
  • Task Statement
  • Build a system that retrieves documents that
    users are likely to find relevant to their
    queries.
  • This assumption underlies the field of
    Information Retrieval.

5
Information need
Collections
How is the query constructed?
text input
How is the text processed?
6
Terminology
Token A natural language word Swim,
Simpson, 92513 etc Document Usually a web
page, but more generally any file.
7
Some IR History
  • Roots in the scientific Information Explosion
    following WWII
  • Interest in computer-based IR from mid 1950s
  • H.P. Luhn at IBM (1958)
  • Probabilistic models at Rand (Maron Kuhns)
    (1960)
  • Boolean system development at Lockheed (60s)
  • Vector Space Model (Salton at Cornell 1965)
  • Statistical Weighting methods and theoretical
    advances (70s)
  • Refinements and Advances in application (80s)
  • User Interfaces, Large-scale testing and
    application (90s)

8
Relevance
  • In what ways can a document be relevant to a
    query?
  • Answer precise question precisely.
  • Who is Homers Boss? Montgomery Burns.
  • Partially answer question.
  • Where does Homer work? Power Plant.
  • Suggest a source for more information.
  • What is Barts middle name? Look in Issue 234 of
    Fanzine
  • Give background information.
  • Remind the user of other knowledge.
  • Others ...

9
Information need
Collections
How is the query constructed?
text input
How is the text processed?
The section that follows is about Content
Analysis (transforming raw text into a
computationally more manageable form)
10
Stemming and Morphological Analysis
  • Goal normalize similar words
  • Morphology (form of words)
  • Inflectional Morphology
  • E.g,. inflect verb endings and noun number
  • Never change grammatical class
  • dog, dogs
  • Bike, Biking
  • Swim, Swimmer, Swimming

What about build, building
11
Examples of Stemming (using Porters algorithm)
Original Words         consignconsignedconsign
ingconsignmentconsistconsistedconsistencycons
istentconsistentlyconsistingconsists
Stemmed Words consignconsignconsignconsignco
nsistconsistconsistconsistconsistconsistcons
ist
Porters algorithms is available in Java, C, Lisp,
Perl, Python etc from http//www.tartarus.org/ma
rtin/PorterStemmer/
12
Errors Generated by Porter Stemmer (Krovetz 93)
Homework!! Play with the following
URL http//fusion.scs.carleton.ca/dquesnel/java/s
tuff/PorterApplet.html
13
Statistical Properties of Text
  • Token occurrences in text are not uniformly
    distributed
  • They are also not normally distributed
  • They do exhibit a Zipf distribution

14
Government documents, 157734 tokens, 32259 unique
969 on 915 FT 883 Mr 860 was 855 be 849
Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE
798 HEADLINE 798 DOCNO
1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI
1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT
1 ADVISERS 1 AE
8164 the 4771 of 4005 to 2834 a 2827 and 2802
in 1592 The 1370 for 1326 is 1324 s 1194 that
973 by
15
Plotting Word Frequency by Rank
  • Main idea count
  • How many times tokens occur in the text
  • Over all texts in the collection
  • Now rank these according to how often they occur.
    This is called the rank.

16
Rank Freq1 37 system2 32
knowledg3 24 base4 20
problem5 18 abstract6 15
model7 15 languag8 15
implem9 13 reason10 13
inform11 11 expert12 11
analysi13 10 rule14 10
program15 10 oper16 10
evalu17 10 comput18 10
case19 9 gener20 9 form
The Corresponding Zipf Curve
17
Zipf Distribution
  • The Important Points
  • a few elements occur very frequently
  • a medium number of elements have medium frequency
  • many elements occur very infrequently

18
Zipf Distribution
  • The product of the frequency of words (f) and
    their rank (r) is approximately constant
  • Rank order of words frequency of occurrence
  • Another way to state this is with an
    approximately correct rule of thumb
  • Say the most common term occurs C times
  • The second most common occurs C/2 times
  • The third most common occurs C/3 times

19
Zipf Distribution(linear and log scale)
20
What Kinds of Data Exhibit a Zipf Distribution?
  • Words in a text collection
  • Virtually any language usage
  • Library book checkout patterns
  • Incoming Web Page Requests
  • Outgoing Web Page Requests
  • Document Size on Web
  • City Sizes

21
Consequences of Zipf
  • There are always a few very frequent tokens that
    are not good discriminators.
  • Called stop words in IR
  • English examples to, from, on, and, the, ...
  • There are always a large number of tokens that
    occur once and can mess up algorithms.
  • Medium frequency words most descriptive

22
Word Frequency vs. Resolving Power (from van
Rijsbergen 79)
The most frequent words are not the most
descriptive.
23
Statistical Independence
  • Two events x and y are statistically
    independent if the product of their probability
    of their happening individually equals their
    probability of happening together.

24
Lexical Associations
  • Subjects write first word that comes to mind
  • doctor/nurse black/white (Palermo Jenkins 64)
  • Text Corpora yield similar associations
  • One measure Mutual Information (Church and Hanks
    89)
  • If word occurrences were independent, the
    numerator and denominator would be equal (if
    measured across a large collection)

25
Statistical Independence
  • Compute for a window of words

a b c d e f g h i j k l m n o p
w1
w11
w21
26
Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
27
Un-Interesting Associations with Doctor (AP
Corpus, N15 million, Church Hanks 89)
These associations were likely to happen because
the non-doctor words shown here are very
common and therefore likely to co-occur with any
noun.
28
Associations Are Important Because
  • We may be able to discover that phrases that
    should be treated as a word. I.e. data mining.
  • We may be able to automatically discover
    synonyms. I.e. Bike and Bicycle

29
Content Analysis Summary
  • Content Analysis transforming raw text into more
    computationally useful forms
  • Words in text collections exhibit interesting
    statistical properties
  • Word frequencies have a Zipf distribution
  • Word co-occurrences exhibit dependencies
  • Text documents are transformed to vectors
  • Pre-processing includes tokenization, stemming,
    collocations/phrases

30
(No Transcript)
31
Information need
Collections
text input
How is the index constructed?
The section that follows is about Index
Construction
32
Inverted Index
  • This is the primary data structure for text
    indexes
  • Main Idea
  • Invert documents into a big index
  • Basic steps
  • Make a dictionary of all the tokens in the
    collection
  • For each token, list all the docs it occurs in.
  • Do a few things to reduce redundancy in the data
    structure

33
Inverted Indexes
  • We have seen Vector files conceptually. An
    Inverted File is a vector file inverted so
    that rows become columns and columns become rows

34
How Are Inverted Files Created
  • Documents are parsed to extract tokens. These are
    saved with the Document ID.

Doc 1
Doc 2
Now is the time for all good men to come to the
aid of their country
It was a dark and stormy night in the country
manor. The time was past midnight
35
How Inverted Files are Created
  • After all documents have been parsed the inverted
    file is sorted alphabetically.

36
How InvertedFiles are Created
  • Multiple term entries for a single document are
    merged.
  • Within-document term frequency information is
    compiled.

37
How Inverted Files are Created
  • Then the file can be split into
  • A Dictionary file
  • and
  • A Postings file

38
How Inverted Files are Created
  • Dictionary Postings

39
Inverted Indexes
  • Permit fast search for individual terms
  • For each term, you get a list consisting of
  • document ID
  • frequency of term in doc (optional)
  • position of term in doc (optional)
  • These lists can be used to solve Boolean queries
  • country -gt d1, d2
  • manor -gt d2
  • country AND manor -gt d2
  • Also used for statistical ranking algorithms

40
How Inverted Files are Used
Query on time AND dark 2 docs with time
in dictionary -gt IDs 1 and 2 from posting file 1
doc with dark in dictionary -gt ID 2 from
posting file Therefore, only doc 2 satisfied the
query.
  • Dictionary Postings

41
(No Transcript)
42
Information need
Collections
text input
How is the index constructed?
The section that follows is about Querying (and
ranking)
43
Simple query language Boolean
  • Terms Connectors (or operators)
  • terms
  • words
  • normalized (stemmed) words
  • phrases
  • connectors
  • AND
  • OR
  • NOT
  • NEAR (Pseudo Boolean)
  • Word Doc
  • Cat x
  • Dog
  • Collar x
  • Leash

44
Boolean Queries
  • Cat
  • Cat OR Dog
  • Cat AND Dog
  • (Cat AND Dog)
  • (Cat AND Dog) OR Collar
  • (Cat AND Dog) OR (Collar AND Leash)
  • (Cat OR Dog) AND (Collar OR Leash)

45
Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
46
Ordering of Retrieved Documents
  • Pure Boolean has no ordering
  • In practice
  • order chronologically
  • order by total number of hits on query terms
  • What if one term has more hits than others?
  • Is it better to one of each term or many of one
    term?

47
Boolean Model
  • Advantages
  • simple queries are easy to understand
  • relatively easy to implement
  • Disadvantages
  • difficult to specify what is wanted
  • too much returned, or too little
  • ordering not well determined
  • Dominant language in commercial Information
    Retrieval systems until the WWW

Since the Boolean model is limited, lets consider
a generalization
48
Vector Model
  • Documents are represented as bags of words
  • Represented as vectors when used computationally
  • A vector is like an array of floating point
  • Has direction and magnitude
  • Each vector holds a place for every term in the
    collection
  • Therefore, most vectors are sparse
  • Smithers secretly loves Monty Burns
  • Monty Burns secretly loves Smithers
  • Both map to
  • Burns, loves, Monty, secretly, Smithers

49
Document VectorsOne location for each word
Document ids
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
50
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
51
Documents in 3D Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
52
Vector Space Model
Note that the query is projected into the same
vector space as the documents. The query here is
for Marge. We can use a vector similarity
model to determine the best match to our query
(details in a few slides). But what weights
should we use for the terms?
53
Assigning Weights to Terms
  • Binary Weights
  • Raw term frequency
  • tf x idf
  • Recall the Zipf distribution
  • Want to weight terms highly if they are
  • frequent in relevant documents BUT
  • infrequent in the collection as a whole

54
Binary Weights
  • Only the presence (1) or absence (0) of a term is
    included in the vector

We have already seen and discussed this model.
55
Raw Term Weights
  • The frequency of occurrence for the term in each
    document is included in the vector

This model is open to exploitation by websites
sex sex sex sex sex sex sex sex sex sex sex sex
sex sex sex sex sex sex sex sex sex sex sex sex
sex sex sex sex sex sex
Counts can be normalized by document lengths.
56
tf idf Weights
  • tf idf measure
  • term frequency (tf)
  • inverse document frequency (idf) -- a way to deal
    with the problems of the Zipf distribution
  • Goal assign a tf idf weight to each term in
    each document

57
tf idf
58
Inverse Document Frequency
  • IDF provides high values for rare words and low
    values for common words

For a collection of 10000 documents
59
Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
60
Cosine
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
61
Vector Space Similarity Measure
62
Problems with Vector Space
  • There is no real theoretical basis for the
    assumption of a term space
  • it is more for visualization that having any real
    basis
  • most similarity measures work about the same
    regardless of model
  • Terms are not really orthogonal dimensions
  • Terms are not independent of all other terms

63
Probabilistic Models
  • Rigorous formal model attempts to predict the
    probability that a given document will be
    relevant to a given query
  • Ranks retrieved documents according to this
    probability of relevance (Probability Ranking
    Principle)
  • Rely on accurate estimates of probabilities

64
(No Transcript)
65
Relevance Feedback
  • Main Idea
  • Modify existing query based on relevance
    judgements
  • Query Expansion Extract terms from relevant
    documents and add them to the query
  • Term Re-weighing and/or re-weight the terms
    already in the query
  • Two main approaches
  • Automatic (psuedo-relevance feedback)
  • Users select relevant documents
  • Users/system select terms from an
    automatically-generated list

66
Definition Relevance Feedback is the
reformulation of a search query in response to
feedback provided by the user for the results of
previous versions of the query.
Suppose you are interested in bovine agriculture
on the banks of the river Jordan
Search
Display Results
Gather Feedback
Update Weights
67
Rocchio Method
68
Rocchio Illustration
Although we usually work in vector space for
text, it is easier to visualize Euclidian space
Original Query
Term Re-weighting Note that both the location of
the center, and the shape of the query have
changed
Query Expansion
69
Rocchio Method
  • Rocchio automatically
  • re-weights terms
  • adds in new terms (from relevant docs)
  • Most methods perform similarly
  • results heavily dependent on test collection
  • Machine learning methods are proving to work
    better than standard IR approaches like Rocchio

70
Using Relevance Feedback
  • Known to improve results
  • People dont seem to like giving feedback!

71
Information need
Collections
text input
How is the index constructed?
The section that follows is about Evaluation
72
Evaluation
  • Why Evaluate?
  • What to Evaluate?
  • How to Evaluate?

73
Why Evaluate?
  • Determine if the system is desirable
  • Make comparative assessments

74
What to Evaluate?
  • How much of the information need is satisfied.
  • How much was learned about a topic.
  • Incidental learning
  • How much was learned about the collection.
  • How much was learned about other topics.
  • How inviting the system is.

75
What to Evaluate?
  • What can be measured that reflects users
    ability to use system? (Cleverdon 66)
  • Coverage of Information
  • Form of Presentation
  • Effort required/Ease of Use
  • Time and Space Efficiency
  • Recall
  • proportion of relevant material actually
    retrieved
  • Precision
  • proportion of retrieved material actually relevant

effectiveness
76
Relevant vs. Retrieved
All docs
Retrieved
Relevant
77
Precision vs. Recall
All docs
Retrieved
Relevant
78
Why Precision and Recall?
  • Intuition
  • Get as much good stuff while at the same time
    getting as little junk as possible.

79
Retrieved vs. Relevant Documents
Very high precision, very low recall
Relevant
80
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)
Relevant
81
Retrieved vs. Relevant Documents
High recall, but low precision
Relevant
82
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
Relevant
83
Precision/Recall Curves
  • There is a tradeoff between Precision and Recall
  • So measure Precision at different levels of
    Recall
  • Note this is an AVERAGE over MANY queries

precision
x
x
x
x
recall
84
Precision/Recall Curves
  • Difficult to determine which of these two
    hypothetical results is better

x
precision
x
x
x
recall
85
Document Cutoff Levels
  • Another way to evaluate
  • Fix the number of documents retrieved at several
    levels
  • top 5
  • top 10
  • top 20
  • top 50
  • top 100
  • top 500
  • Measure precision at each of these levels
  • Take (weighted) average over results
  • This is a way to focus on how well the system
    ranks the first k documents.

86
Problems with Precision/Recall
  • Cant know true recall value
  • except in small collections
  • Precision/Recall are related
  • A combined measure sometimes more appropriate
  • Assumes batch mode
  • Interactive IR is important and has different
    criteria for successful searches
  • Assumes a strict rank ordering matters.

87
Relation to Contingency Table
Doc is Relevant Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved c d
Doc is Relevant Doc is NOT relevant
Doc is retrieved
Doc is NOT retrieved
  • Accuracy (ad) / (abcd)
  • Precision a/(ab)
  • Recall a/(ac)
  • Why dont we use Accuracy for IR?
  • (Assuming a large collection)
  • Most docs arent relevant
  • Most docs arent retrieved
  • Inflates the accuracy value

88
The E-Measure
  • Combine Precision and Recall into one number (van
    Rijsbergen 79)

P precision R recall b measure of relative
importance of P or R For example, b 0.5 means
user is twice as interested in precision as
recall
89
How to Evaluate?Test Collections
90
TREC
  • Text REtrieval Conference/Competition
  • Run by NIST (National Institute of Standards
    Technology)
  • 2004 (November) will be 13th year
  • Collection gt6 Gigabytes (5 CRDOMs), gt1.5
    Million Docs
  • Newswire full text news (AP, WSJ, Ziff, FT)
  • Government documents (federal register,
    Congressional Record)
  • Radio Transcripts (FBIS)
  • Web subsets

91
TREC (cont.)
  • Queries Relevance Judgments
  • Queries devised and judged by Information
    Specialists
  • Relevance judgments done only for those documents
    retrieved -- not entire collection!
  • Competition
  • Various research and commercial groups compete
    (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66)
  • Results judged on precision and recall, going up
    to a recall level of 1000 documents

92
TREC
  • Benefits
  • made research systems scale to large collections
    (pre-WWW)
  • allows for somewhat controlled comparisons
  • Drawbacks
  • emphasis on high recall, which may be unrealistic
    for what most users want
  • very long queries, also unrealistic
  • comparisons still difficult to make, because
    systems are quite different on many dimensions
  • focus on batch ranking rather than interaction
  • no focus on the WWW

93
TREC is changing
  • Emphasis on specialized tracks
  • Interactive track
  • Natural Language Processing (NLP) track
  • Multilingual tracks (Chinese, Spanish)
  • Filtering track
  • High-Precision
  • High-Performance
  • http//trec.nist.gov/

94
Homework
Write a Comment
User Comments (0)
About PowerShow.com