Term and Document Frequency - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Term and Document Frequency

Description:

On the query ides of march, Shakespeare's Julius Caesar has a score of 3 ... No other play has ides. march occurs in over a dozen. All the plays contain of ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 47
Provided by: christo397
Category:
Tags: document | frequency | ides | march | of | term

less

Transcript and Presenter's Notes

Title: Term and Document Frequency


1
CS 6633 ????Information Retrieval and Web Search
  • Lecture 5
  • Term and Document Frequency

?? 125 Based on ppt files by Hinrich Schütze
2
This lecture
  • Parametric and field searches
  • Zones in documents
  • Scoring documents zone weighting
  • Index support for scoring
  • Term weighting

3
Parametric search
  • Documents have text (data) and metadata (data
    about data)
  • Metadata a set of field-value pairs
  • Examples
  • Language French
  • Format pdf
  • Subject Physics etc.
  • Date Feb 2000
  • Parametric search interface
  • Combine a full-text query with selection of value
    for some fields

4
Parametric search example
Fixed field values or range
5
Parametric search example
Full-text search
6
(No Transcript)
7
Parametric/field search
  • In these examples, we select field values
  • Values can be hierarchical, e.g.,
  • Geography Continent ? Country ? State ? City
  • Domain tw, edu.tw, nthu.edu.tw
  • Use field/value to navigate through the document
    collection, e.g.,
  • Aerospace companies in Brazil
  • Select Geography Brazil
  • Select Line of Business Aerospace
  • Use fields to filter docs and run text searches
    to narrow down

8
Index support for field search
  • Must be able to support queries of the form
  • Find pdf documents that contain stanford
    university
  • A field selection (on doc format) and a phrase
    query
  • Field selection use inverted index of field
    values ? docids
  • Organized by field name
  • Use compression

9
Parametric index support
  • Optional features
  • Wildcards (Author strup)
  • Range Date between 2003 and 2005
  • Use B-tree
  • Use query optimization heuristics
  • Process the part that has a small df first

10
Zones
  • A zone is an identified region within a doc
  • E.g., Title, Abstract, Bibliography
  • Generally culled from marked-up input or document
    metadata (e.g., powerpoint)
  • Zones contain free text
  • Not database field with a finite vocabulary
  • Indexes for each zone - allow queries like
  • sorting in Title AND smith in Bibliography AND
    recur in Body
  • Not queries like all papers whose authors cite
    themselves Why?

11
One index per zone
etc.
Author
Body
Title
12
Comparing information retrieval and database
query?
  • Databases do lots of things we dont need
  • Transactions
  • Recovery (our index is not the system of record
    if it breaks, simply reconstruct from the
    original source)
  • Indeed, we never have to store text in a search
    engine only indexes
  • In information retrievl, we focuse on
  • Optimize indexes for text-oriented queries
  • Not an SQL command (matching field with value)

13
Scoring and Ranking
14
Scoring
  • The nature of Boolean queries
  • Docs either match or not score of 0 and 1
  • Do people like Boolean queries
  • Experts with good understanding of needs and the
    doc collection can use Boolean query effectively
  • Difficult for casual users
  • Good for small collections
  • For large collections (the Web), search results
    can be thousands of documents
  • Difficult to go through thousands of results

15
Scoring
  • We wish to return ranked results where the most
    likely documents are placed at the top
  • How can we rank order the docs in the corpus with
    respect to a query?
  • Give each document a score in 0,1 for the query
  • Assume a perfect world without (keyword) spammers
  • No stuffing keywords into a doc to make it match
    queries
  • Will talk about adversarial IR (in Web Search
    part)

16
Linear zone combinations
  • First generation of scoring methods
  • Use a linear combination of Booleans e.g.
  • Score 0.6ltsorting in Titlegt 0.3ltsorting in
    Abstractgt 0.05ltsorting in Bodygt
    0.05ltsorting in Boldfacegt
  • Each expression (keyword, e.g. ltsorting in
    Titlegt) is given a value in 0,1 ? overall score
    is in 0,1
  • AND queries will still return docs when one
    keyword matches something (like OR query)

17
Linear zone combinations
  • How to generates weights such as 0.6?
  • The user?
  • The IR system
  • Mathematical model for weights and ranking
  • Term frequency
  • Document frequency

18
Exercise
  • Query bill OR rights

Author
1
2
bill
rights
Title
5
8
3
bill
3
5
9
rights
Body
2
5
1
9
bill
5
8
3
9
rights
19
Combining Boolean and Ranking
  • Perform Boolean query processing (AND query)
  • Process keywords in query order by df desc
  • Merge posting lists for keywords
  • Stop when have more docs than necessary
  • Instead of score 1 for all docs, give each doc a
    new score for ranking
  • Keyword with small df is more important
  • Doc with high tf is more relevant

20
General idea
  • Assign a score to each doc/keyword
  • Given a weight vector with a weight for each
    zone/field.
  • Combine weights of keyword and zones/fields
  • Present the top K highest-scoring docs
  • K 10, 20, 50, 100

21
Index support for zone combinations
  • One index per zone
  • Alternative
  • one single index
  • Qualify term with zone in dictionary
  • E.g.,
  • The above scheme is still wasteful
  • Each term is potentially replicated for each zone

bill.author
1
2
bill.title
5
8
3
bill.body
2
5
1
9
22
Zone combinations index
  • Yet another alternative
  • Encode the zone in the postings as numbers
  • At query time, merge postings and
  • Match zone in query and postings
  • Accumulate from matched zones

1.author, 1.body
2.author, 2.body
3.title
bill
23
Score accumulation
1 2 3 5
  • As we walk the postings for the query bill OR
    rights, we accumulate scores for each doc in a
    linear merge as before.
  • Note we get both bill and rights in the Title
    field of doc 3, but score it no higher.
  • Should we give more weight to more hits?

bill
1.author, 1.body
2.author, 2.body
3.title
3.title, 3.body
5.title, 5.body
rights
24
Where do these weights come from?
  • Machine learned relevance
  • Given
  • A test corpus
  • A suite of test queries
  • A set of relevance judgments
  • Learn a set of weights such that relevance
    judgments matched
  • Can be formulated as ordinal regression
  • More in next weeks lecture

25
Full text queries
  • We just scored the Boolean query bill OR rights
  • Most users more likely to type bill rights or
    bill of rights (free text query without Boolean
    connectives)
  • Interpret such queries as AND (large collection)
    or OR (small collection)
  • Google uses AND as default
  • Yahoo! probably OR match docs with missing
    keywords

26
Full text queries
  • Combining zone with free text queries, we need
  • A way of assigning a score to a pair ltfree text
    query, zonegt
  • Zero query terms in the zone should mean a zero
    score
  • More query terms in the zone should mean a higher
    score
  • Scores are in 0,1
  • Will look at some alternatives now

27
Incidence matrices
  • Recall Document (or a zone in it) is binary
    vector X in 0,1v
  • Query is a vector
  • Score Overlap measure (count of keyword hits)

28
Example
  • On the query ides of march, Shakespeares Julius
    Caesar has a score of 3
  • All other Shakespeare plays have a score of 2
    (because they contain march) or 1
  • Thus in a rank order, Julius Caesar would come
    out tops

29
Overlap matching
  • Whats wrong with the overlap measure?
  • It doesnt consider
  • Term frequency in document
  • Term specificity in collection (document
    frequency) of is more common than ides or march
  • Length of documents
  • Longer docs have advantage

30
Overlap matching
  • One can normalize in various ways
  • Jaccard coefficient
  • Cosine measure
  • What documents would score best using Jaccard
    against a typical query?
  • Does the cosine measure fix this problem?

31
Scoring density-based
  • Thus far position and overlap of terms in a doc
    or zones (title, author)
  • Intuitively, if a document talks about a keyword
    more
  • Thee doc is more relevant
  • Even when we only have a single query term
  • Document relevant if it has a lot of instances
  • This leads to the idea of term weighting.

32
Term weighting
33
Term-document count matrices
  • Consider the number of occurrences of a term in a
    document
  • Bag of words model
  • Document is a vector in Nv a column below

34
Bag of words view of a doc
  • Thus the doc
  • John is quicker than Mary.
  • is indistinguishable from the doc
  • Mary is quicker than John.
  • Which of the indexes discussed distinguish these
    two docs?

35
Counts vs. frequencies
  • Consider again the ides of march query.
  • Julius Caesar has 5 occurrences of ides
  • No other play has ides
  • march occurs in over a dozen
  • All the plays contain of
  • By frequency-based scoring measure
  • The top-scoring play is likely to be the one with
    the most ofs

36
Digression terminology
  • WARNING In a lot of IR literature, frequency
    is used to mean count
  • Thus term frequency in IR literature is used to
    mean number of occurrences in a doc
  • Not divided by document length (which would
    actually make it a frequency)
  • We will conform to this misnomer
  • In saying term frequency we mean the number of
    occurrences of a term in a document.

37
Term frequency tf
  • Long docs are favored because theyre more likely
    to contain query terms
  • Can fix this to some extent by normalizing for
    document length
  • But is raw tf the right measure?

38
Weighting term frequency tf
  • What is the relative importance of
  • 0 vs. 1 occurrence of a term in a doc
  • 1 vs. 2 occurrences
  • 2 vs. 3 occurrences
  • Unclear while it seems that more is better, may
    be not proportionally better
  • Can just use raw tf
  • Another option commonly used in practice
  • W 0 if tf 0
  • W 1 log ti if tf gt 0

39
Score computation
  • Score for a query q sum over terms t in q
  • Note 0 if no query terms in document
  • This score can be zone-combined
  • Can use wf instead of tf in the above
  • Still doesnt consider term scarcity in
    collection (ides is rarer than of)

40
Weighting should depend on the term overall
  • Which of these tells you more about a doc?
  • 10 occurrences of hernia?
  • 10 occurrences of the?
  • Would like to lower the weight of a common term
  • But what is common?
  • Suggest looking at collection frequency (cf )
  • The total number of occurrences of the term in
    the entire collection of documents

41
Document frequency
  • But document frequency (df ) may be better
  • df number of docs in the corpus containing the
    term
  • Word cf df
  • ferrari 10422 17
  • insurance 10440 3997
  • Document/collection frequency weighting is only
    possible in known (static) collection.
  • So how do we make use of df ?

42
tf x idf term weights
  • tf x idf measure combines
  • term frequency (tf )
  • or wf, some measure of term density in a doc
  • inverse document frequency (idf )
  • measure of informativeness of a term its rarity
    across the whole corpus
  • could just be raw count of number of documents
    the term occurs in (idfi 1/dfi)
  • but by far the most commonly used version is
  • See Kishore Papineni, NAACL 2, 2002 for
    theoretical justification

43
Summary tf x idf (or tf.idf)
  • Assign a tf.idf weight to each term i in each
    document d
  • Increases with the number of occurrences within a
    doc
  • Increases with the rarity of the term across the
    whole corpus
  • What is the wt of a term that occurs in all of
    the docs?

44
Real-valued term-document matrices
  • Function (scaling) of count of a word in a
    document
  • Bag of words model
  • Each is a vector in Rv
  • Here log-scaled tf.idf

Note can be gt1!
45
Documents as vectors
  • Each doc j can now be viewed as a vector of
    wf?idf values, one component for each term
  • So we have a vector space
  • terms are axes
  • docs live in this space
  • may have 20,000 dimensions (even with stemming)
  • (The corpus of documents gives us a matrix, which
    we could also view as a vector space in which
    words live transposable data)

46
Recap
  • We began by looking at zones in scoring
  • Ended up viewing documents as vectors in a vector
    space
  • We will pursue this view next time.
Write a Comment
User Comments (0)
About PowerShow.com