Term and Document Frequency presentation

About This Presentation

Transcript and Presenter's Notes

Title: Term and Document Frequency

1
CS 6633 ????Information Retrieval and Web Search

Lecture 5
Term and Document Frequency

?? 125 Based on ppt files by Hinrich Schütze
2
This lecture

Parametric and field searches
Zones in documents
Scoring documents zone weighting
Index support for scoring
Term weighting

3
Parametric search

Documents have text (data) and metadata (data
about data)
Metadata a set of field-value pairs
Examples
Language French
Format pdf
Subject Physics etc.
Date Feb 2000
Parametric search interface
Combine a full-text query with selection of value
for some fields

4
Parametric search example
Fixed field values or range
5
Parametric search example
Full-text search
6
(No Transcript)
7
Parametric/field search

In these examples, we select field values
Values can be hierarchical, e.g.,
Geography Continent ? Country ? State ? City
Domain tw, edu.tw, nthu.edu.tw
Use field/value to navigate through the document
collection, e.g.,
Aerospace companies in Brazil
Select Geography Brazil
Select Line of Business Aerospace
Use fields to filter docs and run text searches
to narrow down

8
Index support for field search

Must be able to support queries of the form
Find pdf documents that contain stanford
university
A field selection (on doc format) and a phrase
query
Field selection use inverted index of field
values ? docids
Organized by field name
Use compression

9
Parametric index support

Optional features
Wildcards (Author strup)
Range Date between 2003 and 2005
Use B-tree
Use query optimization heuristics
Process the part that has a small df first

10
Zones

A zone is an identified region within a doc
E.g., Title, Abstract, Bibliography
Generally culled from marked-up input or document
metadata (e.g., powerpoint)
Zones contain free text
Not database field with a finite vocabulary
Indexes for each zone - allow queries like
sorting in Title AND smith in Bibliography AND
recur in Body
Not queries like all papers whose authors cite
themselves Why?

11
One index per zone
etc.
Author
Body
Title
12
Comparing information retrieval and database
query?

Databases do lots of things we dont need
Transactions
Recovery (our index is not the system of record
if it breaks, simply reconstruct from the
original source)
Indeed, we never have to store text in a search
engine only indexes
In information retrievl, we focuse on
Optimize indexes for text-oriented queries
Not an SQL command (matching field with value)

13
Scoring and Ranking
14
Scoring

The nature of Boolean queries
Docs either match or not score of 0 and 1
Do people like Boolean queries
Experts with good understanding of needs and the
doc collection can use Boolean query effectively
Difficult for casual users
Good for small collections
For large collections (the Web), search results
can be thousands of documents
Difficult to go through thousands of results

15
Scoring

We wish to return ranked results where the most
likely documents are placed at the top
How can we rank order the docs in the corpus with
respect to a query?
Give each document a score in 0,1 for the query
Assume a perfect world without (keyword) spammers
No stuffing keywords into a doc to make it match
queries
Will talk about adversarial IR (in Web Search
part)

16
Linear zone combinations

First generation of scoring methods
Use a linear combination of Booleans e.g.
Score 0.6ltsorting in Titlegt 0.3ltsorting in
Abstractgt 0.05ltsorting in Bodygt
0.05ltsorting in Boldfacegt
Each expression (keyword, e.g. ltsorting in
Titlegt) is given a value in 0,1 ? overall score
is in 0,1
AND queries will still return docs when one
keyword matches something (like OR query)

17
Linear zone combinations

How to generates weights such as 0.6?
The user?
The IR system
Mathematical model for weights and ranking
Term frequency
Document frequency

18
Exercise

Query bill OR rights

Author
1
2
bill
rights
Title
5
8
3
bill
3
5
9
rights
Body
2
5
1
9
bill
5
8
3
9
rights
19
Combining Boolean and Ranking

Perform Boolean query processing (AND query)
Process keywords in query order by df desc
Merge posting lists for keywords
Stop when have more docs than necessary
Instead of score 1 for all docs, give each doc a
new score for ranking
Keyword with small df is more important
Doc with high tf is more relevant

20
General idea

Assign a score to each doc/keyword
Given a weight vector with a weight for each
zone/field.
Combine weights of keyword and zones/fields
Present the top K highest-scoring docs
K 10, 20, 50, 100

21
Index support for zone combinations

One index per zone
Alternative
one single index
Qualify term with zone in dictionary
E.g.,
The above scheme is still wasteful
Each term is potentially replicated for each zone

bill.author
1
2
bill.title
5
8
3
bill.body
2
5
1
9
22
Zone combinations index

Yet another alternative
Encode the zone in the postings as numbers
At query time, merge postings and
Match zone in query and postings
Accumulate from matched zones

1.author, 1.body
2.author, 2.body
3.title
bill
23
Score accumulation
1 2 3 5

As we walk the postings for the query bill OR
rights, we accumulate scores for each doc in a
linear merge as before.
Note we get both bill and rights in the Title
field of doc 3, but score it no higher.
Should we give more weight to more hits?

bill
1.author, 1.body
2.author, 2.body
3.title
3.title, 3.body
5.title, 5.body
rights
24
Where do these weights come from?

Machine learned relevance
Given
A test corpus
A suite of test queries
A set of relevance judgments
Learn a set of weights such that relevance
judgments matched
Can be formulated as ordinal regression
More in next weeks lecture

25
Full text queries

We just scored the Boolean query bill OR rights
Most users more likely to type bill rights or
bill of rights (free text query without Boolean
connectives)
Interpret such queries as AND (large collection)
or OR (small collection)
Google uses AND as default
Yahoo! probably OR match docs with missing
keywords

26
Full text queries

Combining zone with free text queries, we need
A way of assigning a score to a pair ltfree text
query, zonegt
Zero query terms in the zone should mean a zero
score
More query terms in the zone should mean a higher
score
Scores are in 0,1
Will look at some alternatives now

27
Incidence matrices

Recall Document (or a zone in it) is binary
vector X in 0,1v
Query is a vector
Score Overlap measure (count of keyword hits)

28
Example

On the query ides of march, Shakespeares Julius
Caesar has a score of 3
All other Shakespeare plays have a score of 2
(because they contain march) or 1
Thus in a rank order, Julius Caesar would come
out tops

29
Overlap matching

Whats wrong with the overlap measure?
It doesnt consider
Term frequency in document
Term specificity in collection (document
frequency) of is more common than ides or march
Length of documents
Longer docs have advantage

30
Overlap matching

One can normalize in various ways
Jaccard coefficient
Cosine measure
What documents would score best using Jaccard
against a typical query?
Does the cosine measure fix this problem?

31
Scoring density-based

Thus far position and overlap of terms in a doc
or zones (title, author)
Intuitively, if a document talks about a keyword
more
Thee doc is more relevant
Even when we only have a single query term
Document relevant if it has a lot of instances
This leads to the idea of term weighting.

32
Term weighting
33
Term-document count matrices

Consider the number of occurrences of a term in a
document
Bag of words model
Document is a vector in Nv a column below

34
Bag of words view of a doc

Thus the doc
John is quicker than Mary.
is indistinguishable from the doc
Mary is quicker than John.
Which of the indexes discussed distinguish these
two docs?

35
Counts vs. frequencies

Consider again the ides of march query.
Julius Caesar has 5 occurrences of ides
No other play has ides
march occurs in over a dozen
All the plays contain of
By frequency-based scoring measure
The top-scoring play is likely to be the one with
the most ofs

36
Digression terminology

WARNING In a lot of IR literature, frequency
is used to mean count
Thus term frequency in IR literature is used to
mean number of occurrences in a doc
Not divided by document length (which would
actually make it a frequency)
We will conform to this misnomer
In saying term frequency we mean the number of
occurrences of a term in a document.

37
Term frequency tf

Long docs are favored because theyre more likely
to contain query terms
Can fix this to some extent by normalizing for
document length
But is raw tf the right measure?

38
Weighting term frequency tf

What is the relative importance of
0 vs. 1 occurrence of a term in a doc
1 vs. 2 occurrences
2 vs. 3 occurrences
Unclear while it seems that more is better, may
be not proportionally better
Can just use raw tf
Another option commonly used in practice
W 0 if tf 0
W 1 log ti if tf gt 0

39
Score computation

Score for a query q sum over terms t in q
Note 0 if no query terms in document
This score can be zone-combined
Can use wf instead of tf in the above
Still doesnt consider term scarcity in
collection (ides is rarer than of)

40
Weighting should depend on the term overall

Which of these tells you more about a doc?
10 occurrences of hernia?
10 occurrences of the?
Would like to lower the weight of a common term
But what is common?
Suggest looking at collection frequency (cf )
The total number of occurrences of the term in
the entire collection of documents

41
Document frequency

But document frequency (df ) may be better
df number of docs in the corpus containing the
term
Word cf df
ferrari 10422 17
insurance 10440 3997
Document/collection frequency weighting is only
possible in known (static) collection.
So how do we make use of df ?

42
tf x idf term weights

tf x idf measure combines
term frequency (tf )
or wf, some measure of term density in a doc
inverse document frequency (idf )
measure of informativeness of a term its rarity
across the whole corpus
could just be raw count of number of documents
the term occurs in (idfi 1/dfi)
but by far the most commonly used version is
See Kishore Papineni, NAACL 2, 2002 for
theoretical justification

43
Summary tf x idf (or tf.idf)

Assign a tf.idf weight to each term i in each
document d
Increases with the number of occurrences within a
doc
Increases with the rarity of the term across the
whole corpus
What is the wt of a term that occurs in all of
the docs?

44
Real-valued term-document matrices

Function (scaling) of count of a word in a
document
Bag of words model
Each is a vector in Rv
Here log-scaled tf.idf

Note can be gt1!
45
Documents as vectors

Each doc j can now be viewed as a vector of
wf?idf values, one component for each term
So we have a vector space
terms are axes
docs live in this space
may have 20,000 dimensions (even with stemming)
(The corpus of documents gives us a matrix, which
we could also view as a vector space in which
words live transposable data)

Term and Document Frequency PowerPoint PPT Presentation