Lecture 20: Evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 20: Evaluation

Description:

Title: PowerPoint Presentation Author: Valued Gateway Client Last modified by: Marc Davis Created Date: 9/3/2002 3:52:45 AM Document presentation format – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 61
Provided by: ValuedGate2515
Category:

less

Transcript and Presenter's Notes

Title: Lecture 20: Evaluation


1
Lecture 20 Evaluation
SIMS 202 Information Organization and Retrieval
  • Prof. Ray Larson Prof. Marc Davis
  • UC Berkeley SIMS
  • Tuesday and Thursday 1030 am - 1200 pm
  • Fall 2002
  • http//www.sims.berkeley.edu/academics/courses/is2
    02/f02/

2
Lecture Overview
  • Review
  • Lexical Relations
  • WordNet
  • Can Lexical and Semantic Relations be Exploited
    to Improve IR?
  • Evaluation of IR systems
  • Precision vs. Recall
  • Cutoff Points
  • Test Collections/TREC
  • Blair Maron Study

Credit for some of the slides in this lecture
goes to Marti Hearst and Warren Sack
3
Syntax
  • The syntax of a language is to be understood as a
    set of rules which accounts for the distribution
    of word forms throughout the sentences of a
    language
  • These rules codify permissible combinations of
    classes of word forms

4
Semantics
  • Semantics is the study of linguistic meaning
  • Two standard approaches to lexical semantics
    (cf., sentential semantics and logical
    semantics)
  • (1) Compositional
  • (2) Relational

5
Pragmatics
  • Deals with the relation between signs or
    linguistic expressions and their users
  • Deixis (literally pointing out)
  • E.g., Ill be back in an hour depends upon the
    time of the utterance
  • Conversational implicature
  • A Can you tell me the time?
  • B Well, the milkman has come. I dont know
    exactly, but perhaps you can deduce it from some
    extra information I give you.
  • Presupposition
  • Are you still such a bad driver?
  • Speech acts
  • Constatives vs. performatives
  • E.g., I second the motion.
  • Conversational structure
  • E.g., turn-taking rules

6
Major Lexical Relations
  • Synonymy
  • Polysemy
  • Metonymy
  • Hyponymy/Hyperonymy
  • Meronymy
  • Antonymy

7
Thesauri and Lexical Relations
  • Polysemy Same word, different senses of meaning
  • Slightly different concepts expressed similarly
  • Synonyms Different words, related senses of
    meanings
  • Different ways to express similar concepts
  • Thesauri help draw all these together
  • Thesauri also commonly define a set of relations
    between terms that is similar to lexical
    relations
  • BT, NT, RT

8
WordNet
  • Started in 1985 by George Miller, students, and
    colleagues at the Cognitive Science Laboratory,
    Princeton University
  • Can be downloaded for free
  • www.cogsci.princeton.edu/wn/
  • In terms of coverage, WordNets goals differ
    little from those of a good standard
    college-level dictionary, and the semantics of
    WordNet is based on the notion of word sense that
    lexicographers have traditionally used in writing
    dictionaries. It is in the organization of that
    information that WordNet aspires to innovation.
  • (Miller, 1998, Chapter 1)

9
WordNet Size
WordNet Uses Synsets sets of synonymous terms
  • POS Unique Synsets
  • Strings
  • Noun 107930 74488
  • Verb 10806 12754
  • Adjective 21365 18523
  • Adverb 4583 3612
  • Totals 144684 109377

10
Structure of WordNet
11
Structure of WordNet
12
Structure of WordNet
13
Lexical Relations and IR
  • Recall that most IR research has primarily looked
    at statistical approaches to inferring the
    topicality or meaning of documents
  • I.e., Statistics imply Semantics
  • Is this really true or correct?
  • How has (or might) WordNet be used to provide
    more functionality in searching?
  • What about other thesauri, classification
    schemes, and ontologies?

14
Using NLP
  • Strzalkowski

Text
NLP
repres
Dbase search
TAGGER
PARSER
TERMS
NLP
15
NLP IR Possible Approaches
  • Indexing
  • Use of NLP methods to identify phrases
  • Test weighting schemes for phrases
  • Use of more sophisticated morphological analysis
  • Searching
  • Use of two-stage retrieval
  • Statistical retrieval
  • Followed by more sophisticated NLP filtering

16
Can Statistics Approach Semantics?
  • One approach is the Entry Vocabulary Index (EVI)
    work being done here
  • (The following slides are from my presentation at
    JCDL 2002)

17
What is an Entry Vocabulary Index?
  • EVIs are a means of mapping from users
    vocabulary to the controlled vocabulary of a
    collection of documents

18
SolutionEntry Level Vocabulary Indexes.
Index
EVI
pass mtr veh spark ign eng
Automobile
19
Digital library resources

Statistical association
20
Lecture Overview
  • Review
  • Lexical Relations
  • WordNet
  • Can Lexical and Semantic Relations be Exploited
    to Improve IR?
  • Evaluation of IR systems
  • Precision vs. Recall
  • Cutoff Points
  • Test Collections/TREC
  • Blair Maron Study

Credit for some of the slides in this lecture
goes to Marti Hearst and Warren Sack
21
IR Evaluation
  • Why Evaluate?
  • What to Evaluate?
  • How to Evaluate?

22
Why Evaluate?
  • Determine if the system is desirable
  • Make comparative assessments
  • Is system X better than system Y?
  • Others?

23
What to Evaluate?
  • How much of the information need is satisfied
  • How much was learned about a topic
  • Incidental learning
  • How much was learned about the collection
  • How much was learned about other topic
  • How inviting the system is

24
Relevance
  • In what ways can a document be relevant to a
    query?
  • Answer precise question precisely
  • Partially answer question
  • Suggest a source for more information
  • Give background information
  • Remind the user of other knowledge
  • Others...

25
Relevance
  • How relevant is the document?
  • For this user for this information need
  • Subjective, but
  • Measurable to some extent
  • How often do people agree a document is relevant
    to a query?
  • How well does it answer the question?
  • Complete answer? Partial?
  • Background Information?
  • Hints for further exploration?

26
What to Evaluate?
  • What can be measured that reflects users ability
    to use system? (Cleverdon 66)
  • Coverage of information
  • Form of presentation
  • Effort required/ease of use
  • Time and space efficiency
  • Recall
  • Proportion of relevant material actually
    retrieved
  • Precision
  • Proportion of retrieved material actually relevant

Effectiveness
27
Relevant vs. Retrieved
All Docs
Retrieved
Relevant
28
Precision vs. Recall
29
Why Precision and Recall?
  • Get as much good stuff while at the same time
    getting as little junk as possible

30
Retrieved vs. Relevant Documents
Very high precision, very low recall
31
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)
32
Retrieved vs. Relevant Documents
High recall, but low precision
33
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
34
Precision/Recall Curves
  • There is a tradeoff between Precision and Recall
  • So measure Precision at different levels of
    Recall
  • Note this is an AVERAGE over MANY queries

35
Precision/Recall Curves
  • Difficult to determine which of these two
    hypothetical results is better

x
precision
x
x
x
recall
36
TREC (Manual Queries)
37
Document Cutoff Levels
  • Another way to evaluate
  • Fix the number of RELEVANT documents retrieved at
    several levels
  • Top 5
  • Top 10
  • Top 20
  • Top 50
  • Top 100
  • Top 500
  • Measure precision at each of these levels
  • Take (weighted) average over results
  • This is a way to focus on how well the system
    ranks the first k documents

38
Problems with Precision/Recall
  • Cant know true recall value
  • Except in small collections
  • Precision/Recall are related
  • A combined measure sometimes more appropriate
  • Assumes batch mode
  • Interactive IR is important and has different
    criteria for successful searches
  • We will touch on this in the UI section
  • Assumes a strict rank ordering matters

39
Relation to Contingency Table
Doc is Relevant Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved c d
  • Accuracy (ad) / (abcd)
  • Precision a/(ab)
  • Recall ?
  • Why dont we use Accuracy for IR Evaluation?
    (Assuming a large collection)
  • Most docs arent relevant
  • Most docs arent retrieved
  • Inflates the accuracy value

40
The E-Measure
  • Combine Precision and Recall into one number (van
    Rijsbergen 79)

P precision R recall b measure of relative
importance of P or R For example, b 0.5 means
user is twice as interested in precision as
recall
41
F Measure (Harmonic Mean)
42
Test Collections
  • Cranfield 2
  • 1400 Documents, 221 Queries
  • 200 Documents, 42 Queries
  • INSPEC 542 Documents, 97 Queries
  • UKCIS -- gt 10000 Documents, multiple sets, 193
    Queries
  • ADI 82 Document, 35 Queries
  • CACM 3204 Documents, 50 Queries
  • CISI 1460 Documents, 35 Queries
  • MEDLARS (Salton) 273 Documents, 18 Queries

43
TREC
  • Text REtrieval Conference/Competition
  • Run by NIST (National Institute of Standards
    Technology)
  • 1999 was the 8th year - 9th TREC in early
    November
  • Collection gt6 Gigabytes (5 CRDOMs), gt1.5
    Million Docs
  • Newswire full text news (AP, WSJ, Ziff, FT)
  • Government documents (federal register,
    Congressional Record)
  • Radio Transcripts (FBIS)
  • Web subsets (Large Web separate with 18.5
    Million pages of Web data 100 Gbytes)
  • Patents

44
TREC (cont.)
  • Queries Relevance Judgments
  • Queries devised and judged by Information
    Specialists
  • Relevance judgments done only for those documents
    retrievednot entire collection!
  • Competition
  • Various research and commercial groups compete
    (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66)
  • Results judged on precision and recall, going up
    to a recall level of 1000 documents
  • Following slides from TREC overviews by Ellen
    Voorhees of NIST

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
Sample TREC Query (Topic)
ltnumgt Number 168 lttitlegt Topic Financing
AMTRAK ltdescgt Description A document will
address the role of the Federal Government in
financing the operation of the National Railroad
Transportation Corporation (AMTRAK) ltnarrgt
Narrative A relevant document must provide
information on the governments responsibility to
make AMTRAK an economically viable entity. It
could also discuss the privatization of AMTRAK as
an alternative to continuing government
subsidies. Documents comparing government
subsidies given to air and bus transportation
with those provided to AMTRAK would also be
relevant.
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
TREC
  • Benefits
  • Made research systems scale to large collections
    (pre-WWW)
  • Allows for somewhat controlled comparisons
  • Drawbacks
  • Emphasis on high recall, which may be unrealistic
    for what most users want
  • Very long queries, also unrealistic
  • Comparisons still difficult to make, because
    systems are quite different on many dimensions
  • Focus on batch ranking rather than interaction
  • There is an interactive track

57
TREC is Changing
  • Emphasis on specialized tracks
  • Interactive track
  • Natural Language Processing (NLP) track
  • Multilingual tracks (Chinese, Spanish)
  • Filtering track
  • High-Precision
  • High-Performance
  • http//trec.nist.gov/

58
Blair and Maron 1985
  • A classic study of retrieval effectiveness
  • Earlier studies were on unrealistically small
    collections
  • Studied an archive of documents for a legal suit
  • 350,000 pages of text
  • 40 queries
  • Focus on high recall
  • Used IBMs STAIRS full-text system
  • Main Result
  • The system retrieved less than 20 of the
    relevant documents for a particular information
    need
  • Lawyers thought they had 75
  • But many queries had very high precision

59
Blair and Maron (cont.)
  • How they estimated recall
  • Generated partially random samples of unseen
    documents
  • Had users (unaware these were random) judge them
    for relevance
  • Other results
  • Two lawyers searches had similar performance
  • Lawyers recall was not much different from
    paralegals

60
Blair and Maron (cont.)
  • Why recall was low
  • Users cant foresee exact words and phrases that
    will indicate relevant documents
  • accident referred to by those responsible as
  • event, incident, situation, problem,
  • Differing technical terminology
  • Slang, misspellings
  • Perhaps the value of higher recall decreases as
    the number of relevant documents grows, so more
    detailed queries were not attempted once the
    users were satisfied
Write a Comment
User Comments (0)
About PowerShow.com