Title: Basics: Task Definition
1BasicsTask Definition Evaluation,
Characteristics of Texts
2Outline
- Task definition
- Tasks, types of systems, terminology
- Evaluation
- Issues, test collections, metrics
- Statistical properties of text
- Zipfs Law
3Terminology
- Document
- An information object with unknown structure
- Types of documents Text (default), hypertext,
multimedia - Document Text Collection Database
Corpus - Examples Document database, text collection,
corpus - An unordered set of documents
- Corpora
- Several text databases
4Information Needs
- Short-term information need (Ad hoc retrieval)
- Temporary need, e.g., info about used cars
- Information source is relatively static
- User pulls information
- Application example library search, Web search
- Long-term information need (Filtering)
- Stable need, e.g., news stories about the war
of Iraq - Information source is dynamic
- System pushes information to user
- Applications news filter
5Relevance
- Relevance is difficult to define satisfactorily
- A relevant document is one judged useful in the
context of a query - Who judges?
- What is useful?
- Judgment depends on more than document and query
- All retrieval models include an implicit
definition of relevance - Satisfiability of a FOL expression
- Distance
- P (Relevance query, document)
6Relevance Information Need vs. Query
- Information need i
- You are looking for information on whether
drinking red wine is more effective at reducing
your risk of heart attacks than white wine. - Query q
- Red or white wine related to heart attack
- Document d
- He then launched into the heart of his speech and
attacked the wine industry lobby for downplaying
the role of red and white wine in drunk driving. - d is relevant to the query q . . .
- d is not relevant to the information need i .
7Formal Formulation
- Vocabulary Vw1, w2, , wN of language
- Query q q1,,qm, where qi ? V
- Document di di1,,dimi, where dij ? V
- Collection C d1, , dk
- Set of relevant documents R(q) ? C
- Generally unknown and user-dependent
- Query is a hint on which doc is in R(q)
- Task compute R(q), an approximate R(q)
8Computing R(q)
- Strategy 1 Document selection
- Classification function f(d,q) ?0,1
- Outputs 1 for relevance, 0 for irrelevance
- R(q) is determined as a set d?Cf(d,q)1
- System must decide if a doc is relevant or not
(absolute relevance)
9Computing R(q)
- Strategy 2 Document ranking
- Similarity function f(d,q) ??
- Outputs a similarity between document d and query
q - Cut off ?
- The minimum similarity for document and query to
be relevant - R(q) is determined as the set d?Cf(d,q)?
- System must decide if one doc is more likely to
be relevant than another (relative relevance)
10Document Selection vs. Ranking
True R(q)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
11Which Strategy is Better?
12Ranking is often preferred
- Similarity function is more general than
classification function - Relevance is a subject concept
- Factors other than query and document can be
included in the ranking strategy through the cut
off ? - The classifier is unlikely accurate
- Ambiguous information needs
- Over-constrained query (terms are too specific)
- Under-constrained query (terms are too general)
- Query is the only evidence for a users
information need
13Ranking is often preferred
- Similarity function is more general than
classification function - Relevance is a subject concept
- Factors other than query and document can be
included in the ranking strategy through the cut
off ? - The classifier is unlikely accurate
- Ambiguous information needs
- Over-constrained query (terms are too specific)
- Under-constrained query (terms are too general)
- Query is the only evidence for a users
information need
14Ranking is often preferred
- Relevance is a subjective concept
- A user can stop browsing anywhere, so the
boundary is controlled by the user - High recall users would view more items
- High precision users would view only a few
- Theoretical justification Probability Ranking
Principle Robertson 77
15Probability Ranking PrincipleRobertson 77
- As stated by Cooper
- Robertson provides two formal justifications
- Assumptions Independent relevance and sequential
browsing
If a reference retrieval systems response to
each request is a ranking of the documents in the
collections in order of decreasing probability of
usefulness to the user who submitted the request,
where the probabilities are estimated as
accurately as possible on the basis of whatever
data made available to the system for this
purpose, then the overall effectiveness of the
system to its users will be the best that is
obtainable on the basis of that data.
16Ad-hoc Retrieval
- Search a large collection of documents to find
the ones that satisfy an information need
(relevant documents) - Example Web search systems
17Ad-hoc Retrieval
- Ranked ad-hoc retrieval
- Return a set of documents that satisfy the query,
ordered by (presumed) relevance/similarity - Good queries are still important, but large
results not a problem - Less time spent crafting queries
- Unranked ad-hoc retrieval
- Return an unordered set of documents that satisfy
the query - Usually used only in Boolean systems
- It is important to create a good query, so that
the set is small - But, a small set may not have enough relevant
documents
18Cross-lingual Retrieval (CLIR)
- Query in one language (e.g., English)
- Return documents in other languages (e.g.,
Chinese) - Sometimes called translingual/cross-language
retrieval
19Distributed Retrieval
- Ad-hoc retrieval in an environment with many
text databases - More complicated than centralized ad-hoc
retrieval - Database selection
- Merging results from different databases
20Test Collections
- Retrieval performance is compared using a test
collection - Set of documents, set of queries, set of
relevance judgments - To compare two techniques
- Each technique is used to evaluate queries
- Results (set or ranked list) compared using some
metric - Usually use multiple measures, to get different
perspectives - Usually test with multiple test collections,
because performance is collection dependent to
some extent
21Sample Test Collections
22Test Collection I Cranfield
- First testbed allowing precise quantitative
(1950) - Measures of information retrieval effectiveness
- 1398 abstracts of aerodynamics journal articles
- a set of 225 queries
- exhaustive relevance judgments of all
query-document-pairs - Too small, too untypical for serious IR
evaluation today
23Test Collection II TREC
- TREC Text Retrieval Conference (TREC),
organized by the U.S. National Institute of
Standards and Technology (NIST) - TREC Ad-hoc
- 1.89 million documents, mainly newswire articles
- 450 information needs
- Relevance judgments are available only for the
documents that were among the top k returned by
the systems which entered in the TREC evaluation
24Test Collection III Others
- GOV2
- Another TREC/NIST collection
- 25 million web pages
- Largest collection that is easily available
- But still 3 orders of magnitude smaller than what
Google/Yahoo/MSN index - NTCIR
- East Asian language and cross-language
information retrieval - Cross Language Evaluation Forum (CLEF)
- This evaluation series has concentrated on
European languages and cross-language information
retrieval. - Many others
25Finding Relevant Documents
- Two factors make finding relevant documents
difficult - Given a large collection, it is impossible to
judge every document for a query - Relevance judgment is subjective
- How to solve this problem?
26Finding Relevant Documents
Query
1,000,000 docs
27Finding Relevant Documents
- Pooling strategy
- Retrieve documents using several techniques
- Judge top K documents for each technique
- Relevant set is union of relevant documents from
each technique - Relevant set is a subset of the true relevant set
- Problem incomplete set of relevant documents for
a given query
28Finding Relevant Documents
- Relevance judgment is subjective
- Disagreement among assessors
29Finding Relevant Documents
- Judges disagree a lot. How to combine judgments
from multiple reviewers ? - Average
- Union
- Intersection
- Majority vote
30Finding Relevant Documents
- Large impact on absolute performance numbers
- Virtually no impact on ranking of systems
31Evaluation Criteria
- Effectiveness
- Precision, Recall
- Efficiency
- Space and time complexity
- Usability
- How useful for real users?
32Evaluation Criteria
- Effectiveness (Ad-hoc Task)
- Precision, Recall
- Efficiency
- Space and time complexity
- Usability (Interactive Task)
- How useful for real users?
33Evaluation Metrics
34Precision and Recall Curve
- Evaluate the precision at every retr. document
- Plot a precision recall curve
35Precision and Recall Curve
- Evaluate the precision at every retr. document
- Plot a precision recall curve
recall
precision
36Precision and Recall Curve
- Interpolation take maximum of all the future
points - Why?
recall
precision
37Precision and Recall Curve
- Interpolation take maximum of all the future
points - Why? users are willing read more if precision and
recall are getting better
38Precision and Recall Curve
- How to compare to precision recall curves?
System 2
System 4
System 1
System 3
3911-point Interpolated Average Prec.
Avg. Prec. 0.425
40Multiple Evaluation Criteria
- To obtain a comprehensive view of IR performance,
it is often necessary to examine multiple
criteria
41Evaluating Web Search Engines
- Ultimate goal of web search engine
- Make user happy
- Factors include
- Speed of response
- Numbers of web pages being indexed
- User interface
- Most important relevance
42Evaluating Web Search Engines
- Web pages retrieved in the first page matters
most - Commonly used metrics
- Precision at rank 5, 10, and 20
- Mean average precision (MAP)
- For each query, compute its average precision
across all recall levels, - Average the average precision across all the
queries - Given K queries Q q1, q2, , qK , relevant
documents for qj is - dj1, dj2, , djK , MAP(Q) is computed as
43Zipfs Law
Collected from collection WSJ 1987
44Zipfs Law
Slope 1
45Zipfs Law
- Excerpted from Jamie Callans slides
46Implication of Zipfs Law
- Term usage is highly skewed
- Important for retrieval algorithms
47Statistical Profile
48Zipfs Law
- Question how to estimate the probability of a
word that does not appear in a collection?
49Heaps Law
- Estimate the vocabulary size for a collection
based on the number of tokens found in the
collection
M vocabulary size T number of tokens b
slope, 0.5 (sub-linear) k between 30 and 100