Title: Term and Document Frequency
1CS 6633 ????Information Retrieval and Web Search
- Lecture 5
- Term and Document Frequency
?? 125 Based on ppt files by Hinrich Schütze
2This lecture
- Parametric and field searches
- Zones in documents
- Scoring documents zone weighting
- Index support for scoring
- Term weighting
3Parametric search
- Documents have text (data) and metadata (data
about data) - Metadata a set of field-value pairs
- Examples
- Language French
- Format pdf
- Subject Physics etc.
- Date Feb 2000
- Parametric search interface
- Combine a full-text query with selection of value
for some fields
4Parametric search example
Fixed field values or range
5Parametric search example
Full-text search
6(No Transcript)
7Parametric/field search
- In these examples, we select field values
- Values can be hierarchical, e.g.,
- Geography Continent ? Country ? State ? City
- Domain tw, edu.tw, nthu.edu.tw
- Use field/value to navigate through the document
collection, e.g., - Aerospace companies in Brazil
- Select Geography Brazil
- Select Line of Business Aerospace
- Use fields to filter docs and run text searches
to narrow down
8Index support for field search
- Must be able to support queries of the form
- Find pdf documents that contain stanford
university - A field selection (on doc format) and a phrase
query - Field selection use inverted index of field
values ? docids - Organized by field name
- Use compression
9Parametric index support
- Optional features
- Wildcards (Author strup)
- Range Date between 2003 and 2005
- Use B-tree
- Use query optimization heuristics
- Process the part that has a small df first
10Zones
- A zone is an identified region within a doc
- E.g., Title, Abstract, Bibliography
- Generally culled from marked-up input or document
metadata (e.g., powerpoint) - Zones contain free text
- Not database field with a finite vocabulary
- Indexes for each zone - allow queries like
- sorting in Title AND smith in Bibliography AND
recur in Body - Not queries like all papers whose authors cite
themselves Why?
11One index per zone
etc.
Author
Body
Title
12Comparing information retrieval and database
query?
- Databases do lots of things we dont need
- Transactions
- Recovery (our index is not the system of record
if it breaks, simply reconstruct from the
original source) - Indeed, we never have to store text in a search
engine only indexes - In information retrievl, we focuse on
- Optimize indexes for text-oriented queries
- Not an SQL command (matching field with value)
13Scoring and Ranking
14Scoring
- The nature of Boolean queries
- Docs either match or not score of 0 and 1
- Do people like Boolean queries
- Experts with good understanding of needs and the
doc collection can use Boolean query effectively - Difficult for casual users
- Good for small collections
- For large collections (the Web), search results
can be thousands of documents - Difficult to go through thousands of results
15Scoring
- We wish to return ranked results where the most
likely documents are placed at the top - How can we rank order the docs in the corpus with
respect to a query? - Give each document a score in 0,1 for the query
- Assume a perfect world without (keyword) spammers
- No stuffing keywords into a doc to make it match
queries - Will talk about adversarial IR (in Web Search
part)
16Linear zone combinations
- First generation of scoring methods
- Use a linear combination of Booleans e.g.
-
- Score 0.6ltsorting in Titlegt 0.3ltsorting in
Abstractgt 0.05ltsorting in Bodygt
0.05ltsorting in Boldfacegt - Each expression (keyword, e.g. ltsorting in
Titlegt) is given a value in 0,1 ? overall score
is in 0,1 - AND queries will still return docs when one
keyword matches something (like OR query)
17Linear zone combinations
- How to generates weights such as 0.6?
- The user?
- The IR system
- Mathematical model for weights and ranking
- Term frequency
- Document frequency
18Exercise
Author
1
2
bill
rights
Title
5
8
3
bill
3
5
9
rights
Body
2
5
1
9
bill
5
8
3
9
rights
19Combining Boolean and Ranking
- Perform Boolean query processing (AND query)
- Process keywords in query order by df desc
- Merge posting lists for keywords
- Stop when have more docs than necessary
- Instead of score 1 for all docs, give each doc a
new score for ranking - Keyword with small df is more important
- Doc with high tf is more relevant
20General idea
- Assign a score to each doc/keyword
- Given a weight vector with a weight for each
zone/field. - Combine weights of keyword and zones/fields
- Present the top K highest-scoring docs
- K 10, 20, 50, 100
21Index support for zone combinations
- One index per zone
- Alternative
- one single index
- Qualify term with zone in dictionary
- E.g.,
- The above scheme is still wasteful
- Each term is potentially replicated for each zone
bill.author
1
2
bill.title
5
8
3
bill.body
2
5
1
9
22Zone combinations index
- Yet another alternative
- Encode the zone in the postings as numbers
- At query time, merge postings and
- Match zone in query and postings
- Accumulate from matched zones
1.author, 1.body
2.author, 2.body
3.title
bill
23Score accumulation
1 2 3 5
- As we walk the postings for the query bill OR
rights, we accumulate scores for each doc in a
linear merge as before. - Note we get both bill and rights in the Title
field of doc 3, but score it no higher. - Should we give more weight to more hits?
bill
1.author, 1.body
2.author, 2.body
3.title
3.title, 3.body
5.title, 5.body
rights
24Where do these weights come from?
- Machine learned relevance
- Given
- A test corpus
- A suite of test queries
- A set of relevance judgments
- Learn a set of weights such that relevance
judgments matched - Can be formulated as ordinal regression
- More in next weeks lecture
25Full text queries
- We just scored the Boolean query bill OR rights
- Most users more likely to type bill rights or
bill of rights (free text query without Boolean
connectives) - Interpret such queries as AND (large collection)
or OR (small collection) - Google uses AND as default
- Yahoo! probably OR match docs with missing
keywords
26Full text queries
- Combining zone with free text queries, we need
- A way of assigning a score to a pair ltfree text
query, zonegt - Zero query terms in the zone should mean a zero
score - More query terms in the zone should mean a higher
score - Scores are in 0,1
- Will look at some alternatives now
27Incidence matrices
- Recall Document (or a zone in it) is binary
vector X in 0,1v - Query is a vector
- Score Overlap measure (count of keyword hits)
28Example
- On the query ides of march, Shakespeares Julius
Caesar has a score of 3 - All other Shakespeare plays have a score of 2
(because they contain march) or 1 - Thus in a rank order, Julius Caesar would come
out tops
29Overlap matching
- Whats wrong with the overlap measure?
- It doesnt consider
- Term frequency in document
- Term specificity in collection (document
frequency) of is more common than ides or march - Length of documents
- Longer docs have advantage
30Overlap matching
- One can normalize in various ways
- Jaccard coefficient
- Cosine measure
- What documents would score best using Jaccard
against a typical query? - Does the cosine measure fix this problem?
31Scoring density-based
- Thus far position and overlap of terms in a doc
or zones (title, author) - Intuitively, if a document talks about a keyword
more - Thee doc is more relevant
- Even when we only have a single query term
- Document relevant if it has a lot of instances
- This leads to the idea of term weighting.
32Term weighting
33Term-document count matrices
- Consider the number of occurrences of a term in a
document - Bag of words model
- Document is a vector in Nv a column below
34Bag of words view of a doc
- Thus the doc
- John is quicker than Mary.
- is indistinguishable from the doc
- Mary is quicker than John.
- Which of the indexes discussed distinguish these
two docs?
35Counts vs. frequencies
- Consider again the ides of march query.
- Julius Caesar has 5 occurrences of ides
- No other play has ides
- march occurs in over a dozen
- All the plays contain of
- By frequency-based scoring measure
- The top-scoring play is likely to be the one with
the most ofs
36Digression terminology
- WARNING In a lot of IR literature, frequency
is used to mean count - Thus term frequency in IR literature is used to
mean number of occurrences in a doc - Not divided by document length (which would
actually make it a frequency) - We will conform to this misnomer
- In saying term frequency we mean the number of
occurrences of a term in a document.
37Term frequency tf
- Long docs are favored because theyre more likely
to contain query terms - Can fix this to some extent by normalizing for
document length - But is raw tf the right measure?
38Weighting term frequency tf
- What is the relative importance of
- 0 vs. 1 occurrence of a term in a doc
- 1 vs. 2 occurrences
- 2 vs. 3 occurrences
- Unclear while it seems that more is better, may
be not proportionally better - Can just use raw tf
- Another option commonly used in practice
- W 0 if tf 0
- W 1 log ti if tf gt 0
39Score computation
- Score for a query q sum over terms t in q
- Note 0 if no query terms in document
- This score can be zone-combined
- Can use wf instead of tf in the above
- Still doesnt consider term scarcity in
collection (ides is rarer than of)
40Weighting should depend on the term overall
- Which of these tells you more about a doc?
- 10 occurrences of hernia?
- 10 occurrences of the?
- Would like to lower the weight of a common term
- But what is common?
- Suggest looking at collection frequency (cf )
- The total number of occurrences of the term in
the entire collection of documents
41Document frequency
- But document frequency (df ) may be better
- df number of docs in the corpus containing the
term - Word cf df
- ferrari 10422 17
- insurance 10440 3997
- Document/collection frequency weighting is only
possible in known (static) collection. - So how do we make use of df ?
42tf x idf term weights
- tf x idf measure combines
- term frequency (tf )
- or wf, some measure of term density in a doc
- inverse document frequency (idf )
- measure of informativeness of a term its rarity
across the whole corpus - could just be raw count of number of documents
the term occurs in (idfi 1/dfi) - but by far the most commonly used version is
- See Kishore Papineni, NAACL 2, 2002 for
theoretical justification
43Summary tf x idf (or tf.idf)
- Assign a tf.idf weight to each term i in each
document d - Increases with the number of occurrences within a
doc - Increases with the rarity of the term across the
whole corpus - What is the wt of a term that occurs in all of
the docs?
44Real-valued term-document matrices
- Function (scaling) of count of a word in a
document - Bag of words model
- Each is a vector in Rv
- Here log-scaled tf.idf
Note can be gt1!
45Documents as vectors
- Each doc j can now be viewed as a vector of
wf?idf values, one component for each term - So we have a vector space
- terms are axes
- docs live in this space
- may have 20,000 dimensions (even with stemming)
- (The corpus of documents gives us a matrix, which
we could also view as a vector space in which
words live transposable data)
46Recap
- We began by looking at zones in scoring
- Ended up viewing documents as vectors in a vector
space - We will pursue this view next time.