Title: Term and Document Frequency
 1CS 6633 ????Information Retrieval and Web Search
- Lecture 5 
- Term and Document Frequency
?? 125 Based on ppt files by Hinrich Schütze  
 2This lecture
- Parametric and field searches 
- Zones in documents 
- Scoring documents zone weighting 
- Index support for scoring 
- Term weighting
3Parametric search
- Documents have text (data) and metadata (data 
 about data)
- Metadata a set of field-value pairs 
- Examples 
- Language  French 
- Format  pdf 
- Subject  Physics etc. 
- Date  Feb 2000 
- Parametric search interface 
- Combine a full-text query with selection of value 
 for some fields
4Parametric search example
Fixed field values or range 
 5Parametric search example
Full-text search 
 6(No Transcript) 
 7Parametric/field search
- In these examples, we select field values 
- Values can be hierarchical, e.g., 
- Geography Continent ? Country ? State ? City 
- Domain tw, edu.tw, nthu.edu.tw 
- Use field/value to navigate through the document 
 collection, e.g.,
- Aerospace companies in Brazil 
- Select Geography  Brazil 
- Select Line of Business  Aerospace 
- Use fields to filter docs and run text searches 
 to narrow down
8Index support for field search
- Must be able to support queries of the form 
- Find pdf documents that contain stanford 
 university
- A field selection (on doc format) and a phrase 
 query
- Field selection  use inverted index of field 
 values ? docids
- Organized by field name 
- Use compression
9Parametric index support
- Optional features 
- Wildcards (Author  strup) 
- Range Date between 2003 and 2005 
- Use B-tree 
- Use query optimization heuristics 
- Process the part that has a small df first
10Zones
- A zone is an identified region within a doc 
- E.g., Title, Abstract, Bibliography 
- Generally culled from marked-up input or document 
 metadata (e.g., powerpoint)
- Zones contain free text 
- Not database field with a finite vocabulary 
- Indexes for each zone - allow queries like 
- sorting in Title AND smith in Bibliography AND 
 recur in Body
- Not queries like all papers whose authors cite 
 themselves Why?
11One index per zone
etc.
Author
Body
Title 
 12Comparing information retrieval and database 
query?
- Databases do lots of things we dont need 
- Transactions 
- Recovery (our index is not the system of record 
 if it breaks, simply reconstruct from the
 original source)
- Indeed, we never have to store text in a search 
 engine  only indexes
- In information retrievl, we focuse on 
- Optimize indexes for text-oriented queries 
- Not an SQL command (matching field with value)
13Scoring and Ranking 
 14Scoring
- The nature of Boolean queries 
- Docs either match or not score of 0 and 1 
- Do people like Boolean queries 
- Experts with good understanding of needs and the 
 doc collection can use Boolean query effectively
- Difficult for casual users 
- Good for small collections 
- For large collections (the Web), search results 
 can be thousands of documents
- Difficult to go through thousands of results
15Scoring
- We wish to return ranked results where the most 
 likely documents are placed at the top
- How can we rank order the docs in the corpus with 
 respect to a query?
- Give each document a score in 0,1 for the query 
- Assume a perfect world without (keyword) spammers 
- No stuffing keywords into a doc to make it match 
 queries
- Will talk about adversarial IR (in Web Search 
 part)
16Linear zone combinations
- First generation of scoring methods 
- Use a linear combination of Booleans e.g. 
-  
- Score  0.6ltsorting in Titlegt  0.3ltsorting in 
 Abstractgt  0.05ltsorting in Bodygt
 0.05ltsorting in Boldfacegt
- Each expression (keyword, e.g. ltsorting in 
 Titlegt) is given a value in 0,1 ? overall score
 is in 0,1
- AND queries will still return docs when one 
 keyword matches something (like OR query)
17Linear zone combinations
- How to generates weights such as 0.6? 
- The user? 
- The IR system 
- Mathematical model for weights and ranking 
- Term frequency 
- Document frequency
18Exercise
Author
1
2
bill
rights
Title
5
8
3
bill
3
5
9
rights
Body
2
5
1
9
bill
5
8
3
9
rights 
 19Combining Boolean and Ranking
- Perform Boolean query processing (AND query) 
- Process keywords in query order by df desc 
- Merge posting lists for keywords 
- Stop when have more docs than necessary 
- Instead of score 1 for all docs, give each doc a 
 new score for ranking
- Keyword with small df is more important 
- Doc with high tf is more relevant 
20General idea
- Assign a score to each doc/keyword 
- Given a weight vector with a weight for each 
 zone/field.
- Combine weights of keyword and zones/fields 
- Present the top K highest-scoring docs 
- K  10, 20, 50, 100
21Index support for zone combinations
- One index per zone 
- Alternative 
- one single index 
- Qualify term with zone in dictionary 
- E.g., 
- The above scheme is still wasteful 
- Each term is potentially replicated for each zone
bill.author
1
2
bill.title
5
8
3
bill.body
2
5
1
9 
 22Zone combinations index
- Yet another alternative 
- Encode the zone in the postings as numbers 
- At query time, merge postings and 
- Match zone in query and postings 
- Accumulate from matched zones 
1.author, 1.body
2.author, 2.body
3.title
bill 
 23Score accumulation
1 2 3 5
- As we walk the postings for the query bill OR 
 rights, we accumulate scores for each doc in a
 linear merge as before.
- Note we get both bill and rights in the Title 
 field of doc 3, but score it no higher.
- Should we give more weight to more hits?
bill
1.author, 1.body
2.author, 2.body
3.title
3.title, 3.body
5.title, 5.body
rights 
 24Where do these weights come from?
- Machine learned relevance 
- Given 
- A test corpus 
- A suite of test queries 
- A set of relevance judgments 
- Learn a set of weights such that relevance 
 judgments matched
- Can be formulated as ordinal regression 
- More in next weeks lecture
25Full text queries
- We just scored the Boolean query bill OR rights 
- Most users more likely to type bill rights or 
 bill of rights (free text query without Boolean
 connectives)
- Interpret such queries as AND (large collection) 
 or OR (small collection)
- Google uses AND as default 
- Yahoo! probably OR match docs with missing 
 keywords
26Full text queries
- Combining zone with free text queries, we need 
- A way of assigning a score to a pair ltfree text 
 query, zonegt
- Zero query terms in the zone should mean a zero 
 score
- More query terms in the zone should mean a higher 
 score
- Scores are in 0,1 
- Will look at some alternatives now
27Incidence matrices
- Recall Document (or a zone in it) is binary 
 vector X in 0,1v
- Query is a vector 
- Score Overlap measure (count of keyword hits)
28Example
- On the query ides of march, Shakespeares Julius 
 Caesar has a score of 3
- All other Shakespeare plays have a score of 2 
 (because they contain march) or 1
- Thus in a rank order, Julius Caesar would come 
 out tops
29Overlap matching
- Whats wrong with the overlap measure? 
- It doesnt consider 
- Term frequency in document 
- Term specificity in collection (document 
 frequency) of is more common than ides or march
- Length of documents 
- Longer docs have advantage
30Overlap matching
- One can normalize in various ways 
- Jaccard coefficient 
- Cosine measure 
- What documents would score best using Jaccard 
 against a typical query?
- Does the cosine measure fix this problem? 
31Scoring density-based
- Thus far position and overlap of terms in a doc 
 or zones (title, author)
- Intuitively, if a document talks about a keyword 
 more
- Thee doc is more relevant 
- Even when we only have a single query term 
- Document relevant if it has a lot of instances 
- This leads to the idea of term weighting.
32Term weighting 
 33Term-document count matrices
- Consider the number of occurrences of a term in a 
 document
- Bag of words model 
- Document is a vector in Nv a column below 
34Bag of words view of a doc
- Thus the doc 
- John is quicker than Mary. 
- is indistinguishable from the doc 
- Mary is quicker than John. 
- Which of the indexes discussed distinguish these 
 two docs?
35Counts vs. frequencies
- Consider again the ides of march query. 
- Julius Caesar has 5 occurrences of ides 
- No other play has ides 
- march occurs in over a dozen 
- All the plays contain of 
- By frequency-based scoring measure 
- The top-scoring play is likely to be the one with 
 the most ofs
36Digression terminology
- WARNING In a lot of IR literature, frequency 
 is used to mean count
- Thus term frequency in IR literature is used to 
 mean number of occurrences in a doc
- Not divided by document length (which would 
 actually make it a frequency)
- We will conform to this misnomer 
- In saying term frequency we mean the number of 
 occurrences of a term in a document.
37Term frequency tf
- Long docs are favored because theyre more likely 
 to contain query terms
- Can fix this to some extent by normalizing for 
 document length
- But is raw tf the right measure?
38Weighting term frequency tf
- What is the relative importance of 
- 0 vs. 1 occurrence of a term in a doc 
- 1 vs. 2 occurrences 
- 2 vs. 3 occurrences  
- Unclear while it seems that more is better, may 
 be not proportionally better
- Can just use raw tf 
- Another option commonly used in practice 
- W  0 if tf  0 
- W  1 log ti if tf gt 0 
39Score computation
- Score for a query q  sum over terms t in q 
- Note 0 if no query terms in document 
- This score can be zone-combined 
- Can use wf instead of tf in the above 
- Still doesnt consider term scarcity in 
 collection (ides is rarer than of)
40Weighting should depend on the term overall
- Which of these tells you more about a doc? 
- 10 occurrences of hernia? 
- 10 occurrences of the? 
- Would like to lower the weight of a common term 
- But what is common? 
- Suggest looking at collection frequency (cf ) 
- The total number of occurrences of the term in 
 the entire collection of documents
41Document frequency
- But document frequency (df ) may be better 
- df  number of docs in the corpus containing the 
 term
-  Word cf df 
-  ferrari 10422 17 
-  insurance 10440 3997 
- Document/collection frequency weighting is only 
 possible in known (static) collection.
- So how do we make use of df ?
42tf x idf term weights
- tf x idf measure combines 
- term frequency (tf ) 
- or wf, some measure of term density in a doc 
- inverse document frequency (idf ) 
- measure of informativeness of a term its rarity 
 across the whole corpus
- could just be raw count of number of documents 
 the term occurs in (idfi  1/dfi)
- but by far the most commonly used version is 
- See Kishore Papineni, NAACL 2, 2002 for 
 theoretical justification
43Summary tf x idf (or tf.idf)
- Assign a tf.idf weight to each term i in each 
 document d
- Increases with the number of occurrences within a 
 doc
- Increases with the rarity of the term across the 
 whole corpus
- What is the wt of a term that occurs in all of 
 the docs?
44Real-valued term-document matrices
- Function (scaling) of count of a word in a 
 document
- Bag of words model 
- Each is a vector in Rv 
- Here log-scaled tf.idf
Note can be gt1! 
 45Documents as vectors
- Each doc j can now be viewed as a vector of 
 wf?idf values, one component for each term
- So we have a vector space 
- terms are axes 
- docs live in this space 
- may have 20,000 dimensions (even with stemming) 
- (The corpus of documents gives us a matrix, which 
 we could also view as a vector space in which
 words live  transposable data)
46Recap
- We began by looking at zones in scoring 
- Ended up viewing documents as vectors in a vector 
 space
- We will pursue this view next time.