Title: IR and NLP
1IR and NLP
- Jimmy Lin
- College of Information Studies
- Institute for Advanced Computer Studies
- University of Maryland
- Wednesday, March 15, 2006
2On the Menu
- Overview of information retrieval
- Evaluation
- Three IR models
- Boolean
- Vector space
- Language modeling
- NLP for IR
3Types of Information Needs
- Retrospective
- Searching the past
- Different queries posed against a static
collection - Time invariant
- Prospective
- Searching the future
- Static query posed against a dynamic collection
- Time dependent
4Retrospective Searches (I)
- Ad hoc retrieval find documents about this
- Known item search
- Directed exploration
Identify positive accomplishments of the Hubble
telescope since it was launched in 1991. Compile
a list of mammals that are considered to be
endangered, identify their habitat and, if
possible, specify what threatens them.
Find Jimmy Lins homepage. Whats the ISBN
number of Modern Information Retrieval?
Who makes the best chocolates? What video
conferencing systems exist for digital reference
desk services?
5Retrospective Searches (II)
6Prospective Searches
- Filtering
- Make a binary decision about each incoming
document - Routing
- Sort incoming documents into different bins?
Spam or not spam?
Categorize news headlines World? Nation? Metro?
Sports?
7The Information Retrieval Cycle
Source Selection
Query Formulation
Search
Selection
Examination
Delivery
8Supporting the Search Process
Source Selection
Resource
Query Formulation
Query
Search
Ranked List
Selection
Indexing
Documents
Index
Examination
Acquisition
Documents
Collection
Delivery
9Evaluation
10IR is an experimental science!
- Formulate a research question the hypothesis
- Questions about the system
- Questions about the system user
- Design an experiment to answer the question
- Perform the experiment
- Compare with a baseline
- Does the experiment answer the question?
- Are the results significant? Or is it just luck?
- Report the results!
- Rinse, repeat
11The Importance of Evaluation
- The ability to measure differences underlies
experimental science - How well do our systems work?
- Is A better than B?
- Is it really?
- Under what conditions?
- Evaluation drives what to research
- Identify techniques that work and dont work
- Build on techniques that work
12Evaluating the Black Box
Search
13Automatic Evaluation Model
Documents
Query
IR Black Box
Ranked List
Evaluation Module
Relevance Judgments
Measure of Effectiveness
14Test Collections
- Reusable test collections consist of
- Collection of documents
- Should be representative
- Things to consider size, sources, genre, topics,
- Sample of information needs
- Should be randomized and representative
- Usually formalized topic statements
- Known relevance judgments
- Assessed by humans, for each topic-document pair
(topic, not query!) - Binary judgments make evaluation easier
- Measure of effectiveness
- Usually a numeric score for quantifying
performance - Used to compare different systems
15Which is the Best Rank Order?
A.
B.
C.
D.
E.
F.
16Set-Based Measures
- Precision A (AB)
- Recall A (AC)
- Miss C (AC)
- False alarm (fallout) B (BD)
Relevant Not relevant
Retrieved A B
Not retrieved C D
Collection size ABCD Relevant
AC Retrieved AB
When is precision important? When is recall
important?
17Another View
Space of all documents
Relevant Retrieved
Relevant
Retrieved
Not Relevant Not Retrieved
18ROC Curves
Adapted from a presentation by Ellen Voorhees at
the University of Maryland, March 29, 1999
19Building Test Collections
- Where do test collections come from?
- Someone goes out and builds them (expensive)
- As the byproduct of large scale evaluations
- TREC Text REtrieval Conferences
- Sponsored by NIST
- Series of annual evaluations, started in 1992
- Organized into tracks
- Larger tracks may draw a few dozen participants
See proceedings online at http//trec.nist.gov/
20Ad Hoc Topics
- In TREC, a statement of information need is
called a topic
Title Health and Computer Terminals
Description Is it hazardous to the health of
individuals to work with computer terminals on a
daily basis? Narrative Relevant documents
would contain any information that expands on
any physical disorder/problems that may be
associated with the daily working with computer
terminals. Such things as carpel tunnel,
cataracts, and fatigue have been said to be
associated, but how widespread are these or other
problems and what is being done to alleviate any
health problems.
21Obtaining Judgments
- Exhaustive assessment is usually impractical
- TREC has 50 queries
- Collection has gt1 million documents
- Random sampling wont work
- If relevant docs are rare, none may be found!
- IR systems can help focus the sample
- Each system finds some relevant documents
- Different systems find different relevant
documents - Together, enough systems will find most of them
- Leverages cooperative evaluations
22Pooling Methodology
- Systems submit top 1000 documents per topic
- Top 100 documents from each are judged
- Single pool, duplicates removed, arbitrary order
- Judged by the person who developed the topic
- Treat unevaluated documents as not relevant
- Evaluate down to 1000 documents
- To make pooling work
- Systems must do reasonable well
- Systems must not all do the same thing
- Gather topics and relevance judgments to create a
reusable test collection
23Retrieval Models
- Boolean
- Vector space
- Language Modeling
24What is a model?
- A model is a construct designed help us
understand a complex system - A particular way of looking at things
- Models inevitably make simplifying assumptions
- What are the limitations of the model?
- Different types of models
- Conceptual models
- Physical analog models
- Mathematical models
25The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
26The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
27How do we represent text?
- How do we represent the complexities of language?
- Keeping in mind that computers dont understand
documents or queries - Simple, yet effective approach bag of words
- Treat unique words as independent features of the
document
28Sample Document
- McDonald's slims down spuds
- Fast-food chain to reduce certain types of fat in
its french fries with new cooking oil. - NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french
fries nearly in half, the fast-food chain said
Tuesday as it moves to make all its fried menu
items healthier. - But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along
with an even healthier nutrition profile," said
Mike Roberts, president of McDonald's USA. - But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to
use, but at least one nutrition expert says
playing with the formula could mean a different
taste. - Shares of Oak Brook, Ill.-based McDonald's (MCD
down 0.54 to 23.22, Research, Estimates) were
lower Tuesday afternoon. It was unclear Tuesday
whether competitors Burger King and Wendy's
International (WEN down 0.80 to 34.91,
Research, Estimates) would follow suit. Neither
company could immediately be reached for comment.
- 16 said
- 14 McDonalds
- 12 fat
- 11 fries
- 8 new
- 6 company french nutrition
- 5 food oil percent reduce taste Tuesday
-
Bag of Words
29Whats the point?
- Retrieving relevant information is hard!
- Evolving, ambiguous user needs, context, etc.
- Complexities of language
- To operationalize information retrieval, we must
vastly simplify the picture - Bag-of-words approach
- Information retrieval is all (and only) about
matching words in documents with words in queries - Obviously, not true
- But it works pretty well!
30Representing Documents
Document 1
Term
Document 1
Document 2
The quick brown fox jumped over the lazy dogs
back.
Stopword List
for
is
of
Document 2
the
to
Now is the time for all good men to come to the
aid of their party.
31Boolean Retrieval
- Weights assigned to terms are either 0 or 1
- 0 represents absence term isnt in the
document - 1 represents presence term is in the
document - Build queries by combining terms with Boolean
operators - AND, OR, NOT
- The system returns all documents that satisfy the
query
32Boolean View of a Collection
Each column represents the view of a particular
document What terms are contained in this
document?
Each row represents the view of a particular
term What documents contain this term?
To execute a query, pick out rows corresponding
to query terms and then apply logic table of
corresponding Boolean operator
33Sample Queries
dog AND fox ? Doc 3, Doc 5
dog OR fox ? Doc 3, Doc 5, Doc 7
dog NOT fox ? empty
fox NOT dog ? Doc 7
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
good
0
1
0
1
0
1
0
1
party
0
0
0
0
0
1
0
1
good AND party ? Doc 6, Doc 8
g ? p
0
0
0
0
0
1
0
1
over
1
0
1
0
1
0
1
1
good AND party NOT over ? Doc 6
g ? p ? o
0
0
0
0
0
1
0
0
34Proximity Operators
- More precise versions of AND
- NEAR n allows at most n-1 intervening terms
- WITH requires terms to be adjacent and in order
- Other extensions within n sentences, within n
paragraphs, etc. - Relatively easy to implement, but less efficient
- Store position information for each word in the
document vectors - Perform normal Boolean computations, but treat
WITH and NEAR as extra constraints
35Why Boolean Retrieval Works
- Boolean operators approximate natural language
- Find documents about a good party that is not
over - AND can discover relationships between concepts
- good party
- OR can discover alternate terminology
- excellent party, wild party, etc.
- NOT can discover alternate meanings
- Democratic party
36Why Boolean Retrieval Fails
- Natural language is way more complex
- AND discovers nonexistent relationships
- Terms in different sentences, paragraphs,
- Guessing terminology for OR is hard
- good, nice, excellent, outstanding, awesome,
- Guessing terms to exclude is even harder!
- Democratic party, party to a lawsuit,
37Strengths and Weaknesses
- Strengths
- Precise, if you know the right strategies
- Precise, if you have an idea of what youre
looking for - Efficient for the computer
- Weaknesses
- Users must learn Boolean logic
- Boolean logic insufficient to capture the
richness of language - No control over size of result set either too
many documents or none - When do you stop reading? All documents in the
result set are considered equally good - What about partial matches? Documents that dont
quite match the query may be useful also
38The Vector Space Model
- Lets replace relevance with similarity
- Rank documents by their similarity with the query
- Treat the query as if it were a document
- Create a query bag-of-words
- Find its similarity to each document
- Rank order the documents by similarity
- Surprisingly, this works pretty well!
39Vector Space Model
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together in
vector space talk about the same things
Therefore, retrieve documents based on how close
the document is to the query (i.e., similarity
closeness)
40Similarity Metric
- How about d1 d2?
- This is the Euclidean distance between the
vectors - Instead of distance, use angle between the
vectors
Why is this not a good idea?
41How do we weight doc terms?
- Heres the intuition
- Terms that appear often in a document should get
high weights - Terms that appear in many documents should get
low weights - How do we capture this mathematically?
- Term frequency
- Inverse document frequency
The more often a document contains the term
dog, the more likely that the document is
about dogs.
Words like the, a, of appear in (nearly)
all documents.
42TF.IDF Term Weighting
weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term i
43What is a Language Model?
- Probability distribution over strings of text
- How likely is a string in a given language?
- Probabilities depend on what language were
modeling
p1 P(a quick brown dog)
p2 P(dog quick a brown)
p3 P(??????? brown dog)
p4 P(??????? ??????)
In a language model for English p1 gt p2 gt p3 gt p4
In a language model for Russian p1 lt p2 lt p3 lt p4
44Noisy-Channel Model of IR
Information need
d1
d2
Query
User has a information need, thinks of a
relevant document
and writes down some queries
dn
document collection
Task of information retrieval given the query,
figure out which document it came from?
45Retrieval w/ Language Models
- Build a model for every document
- Rank document d based on P(MD q)
- Expand using Bayes Theorem
- Same as ranking by P(q MD)
P(q) is same for all documents doesnt change
ranks P(MD) the prior is assumed to be the same
for all d
46What does it mean?
Ranking by P(MD q)
is the same as ranking by P(q MD)
47Ranking Models?
Ranking by P(q MD)
is the same as ranking documents
48Unigram Language Model
- Assume each word is generated independently
- Obviously, this is not true
- But it seems to work well in practice!
- The probability of a string, given a model
The probability of a sequence of words decomposes
into a product of the probabilities of individual
words
49Modeling
- How do we build a language model for a document?
Whats in the urn?
50NLP for IR
51The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
52Why is IR hard?
- IR is hard because natural language is so rich
(among other reasons) - What are the issues?
- Tokenization
- Morphological Variation
- Synonymy
- Polysemy
- Paraphrase
- Ambiguity
- Anaphora
53Possible Solutions
- Vary the unit of indexing
- Strings and segments
- Tokens and words
- Phrases and entities
- Senses and concepts
- Manipulate queries and results
- Term expansion
- Post-processing of results
54Tokenization
- Whats a word?
- First try words are separated by spaces
- What about clitics?
- What about languages without spaces?
- Same problem with speech!
Im not saying that I dont want Johns input on
this.
?????????????????????
55Word-Level Issues
- Morphological variation
- different forms of the same concept
- Inflectional morphology same part of speech
- Derivational morphology different parts of
speech - Synonymy
- different words, same meaning
- Polysemy
- same word, different meanings
break, broke, broken sing, sang, sung etc.
destroy, destruction invent, invention,
reinvention etc.
dog, canine, doggy, puppy, etc. ? concept of dog
Bank financial institution or side of a
river? Crane bird or construction equipment? Is
depends on what the meaning of is is!
56Paraphrase
- Language provides different ways of saying the
same thing
- Who killed Abraham Lincoln?
- John Wilkes Booth killed Abraham Lincoln.
- John Wilkes Booth altered history with a bullet.
He will forever be known as the man who ended
Abraham Lincolns life.
- When did Wilt Chamberlain score 100 points?
- Wilt Chamberlain scored 100 points on March 2,
1962 against the New York Knicks. - On December 8, 1961, Wilt Chamberlain scored 78
points in a triple overtime game. It was a new
NBA record, but Warriors coach Frank McGuire
didnt expect it to last long, saying, Hell get
100 points someday. McGuires prediction came
true just a few months later in a game against
the New York Knicks on March 2.
57Ambiguity
- What exactly do you mean?
- Why dont we have problems (most of the time)?
58Ambiguity in Action
- Different documents with the same keywords may
have different meanings
What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?
- Adult frogs eat mainly insects and other small
animals, including earthworms, minnows, and
spiders. - Alligators eat many kinds of small animals that
live in or near the water, including fish,
snakes, frogs, turtles, small mammals, and birds. - Some bats catch fish with their claws, and a few
species eat lizards, rodents, small birds, tree
frogs, and other bats.
?
?
59Anaphora
- Language provides different ways of referring to
the same entity
- Who killed Abraham Lincoln?
- John Wilkes Booth killed Abraham Lincoln.
- John Wilkes Booth altered history with a bullet.
He will forever be known as the man who ended
Abraham Lincolns life.
- When did Wilt Chamberlain score 100 points?
- Wilt Chamberlain scored 100 points on March 2,
1962 against the New York Knicks. - On December 8, 1961, Wilt Chamberlain scored 78
points in a triple overtime game. It was a new
NBA record, but Warriors coach Frank McGuire
didnt expect it to last long, saying, Hell get
100 points someday. McGuires prediction came
true just a few months later in a game against
the New York Knicks on March 2.
60More Anaphora
- Terminology
- Anaphor an expression that refers to another
- Anaphora the phenomenon
- Other different types of referring expressions
- Anaphora resolution can be hard!
Fujitsu and NEC said they were still
investigating, and that knowledge of more such
bids could emerge... Other major Japanese
computer companies contacted yesterday said they
have never made such bids.
The hotel recently went through a 200 million
restoration original artworks include an
impressive collection of Greek statues in the
lobby.
The city council denied the demonstrators a
permit because they feared violence. they
advocated violence.
61What can we do?
- Here are the some of the problems
- Tokenization
- Morphological variation, synonymy, polysemy
- Paraphrase, ambiguity
- Anaphora
- General approaches
- Vary the unit of indexing
- Manipulate queries and results
62What do we index?
- In information retrieval, we are after the
concepts represented in the documents - but we can only index strings
- So whats the best unit of indexing?
63The Tokenization Problem
- In many languages, words are not separated by
spaces - Tokenization separating a string into words
- Simple greedy approach
- Start with a list of every possible term (e.g.,
from a dictionary) - Look for the longest word in the unsegmented
string - Take longest matching term as the next word and
repeat
64Probabilistic Segmentation
- For an input word c1 c2 c3 cn
- Try all possible partitions
- Choose the highest probability partition
- E.g., compute P(c1 c2 c3) using a language model
- Challenges search, probability estimation
c1 c2 c3 c4 cn c1 c2 c3 c4 cn c1
c2 c3 c4 cn
65Indexing N-Grams
- Consider a Chinese document c1 c2 c3 cn
- Dont segment (you could be wrong!)
- Instead, treat every character bigram as a term
- Break up queries the same way
- Works at least as well as trying to segment
correctly!
66Morphological Variation
- Handling morphology related concepts have
different forms - Inflectional morphology same part of speech
- Derivational morphology different parts of
speech - Different morphological processes
- Prefixing
- Suffixing
- Infixing
- Reduplication
dogs dog PLURAL
broke break PAST
destruction destroy ion
researcher research er
67Stemming
- Dealing with morphological variation index stems
instead of words - Stem a word equivalence class that preserves the
central concept - How much to stem?
- organization ? organize ? organ?
- resubmission ? resubmit/submission ? submit?
- reconstructionism?
68Does Stemming Work?
- Generally, yes! (in English)
- Helps more for longer queries
- Lots of work done in this area
Donna Harman (1991) How Effective is Suffixing?
Journal of the American Society for Information
Science, 42(1)7-15. Robert Krovetz. (1993)
Viewing Morphology as an Inference Process.
Proceedings of SIGIR 1993. David A. Hull. (1996)
Stemming Algorithms A Case Study for Detailed
Evaluation. Journal of the American Society for
Information Science, 47(1)70-84. And others
69Stemming in Other Languages
- Arabic makes frequent use of infixes
- Whats the most effective stemming strategy in
Arabic? Open research question
maktab (office), kitaab (book), kutub (books),
kataba (he wrote), naktubu (we write), etc.
the root ktb
70Words wrong indexing unit!
- Synonymy
- different words, same meaning
- Polysemy
- same word, different meanings
- Itd be nice if we could index concepts!
- Word sense a coherent cluster in semantic space
- Indexing word senses achieves the effect of
conceptual indexing
dog, canine, doggy, puppy, etc. ? concept of dog
Bank financial institution or side of a
river? Crane bird or construction equipment?
71Indexing Word Senses
- How does indexing word senses solve the
synonym/polysemy problem? - Okay, so where do we get the word senses?
- WordNet
- Automatically find clusters of words that
describe the same concepts - Other methods also have been tried
dog, canine, doggy, puppy, etc. ? concept 112986
I deposited my check in the bank. bank ? concept
76529 I saw the sailboat from the bank. bank ?
concept 53107
72Word Sense Disambiguation
- Given a word in context, automatically determine
the sense (concept) - This is the Word Sense Disambiguation (WSD)
problem - Context is the key
- For each ambiguous word, note the surrounding
words - Learn a classifier from a collection of examples
- Use the classifier to determine the senses of
words in the documents
bank river, sailboat, water, etc. ? side of a
river bank check, money, account, etc. ?
financial institution
73Does it work?
- Nope!
- Examples of limited success.
Ellen M. Voorhees. (1993) Using WordNet to
Disambiguate Word Senses for Text Retrieval.
Proceedings of SIGIR 1993. Mark Sanderson.
(1994) Word-Sense Disambiguation and Information
Retrieval. Proceedings of SIGIR 1994 And others
Hinrich Schütze and Jan O. Pedersen. (1995)
Information Retrieval Based on Word Senses.
Proceedings of the 4th Annual Symposium on
Document Analysis and Information
Retrieval. Rada Mihalcea and Dan Moldovan.
(2000) Semantic Indexing Using WordNet Senses.
Proceedings of ACL 2000 Workshop on Recent
Advances in NLP and IR.
74Why Disambiguation Hurts
- Bag-of-words techniques already disambiguate
- Context for each term is established in the query
- WSD is hard!
- Many words are highly polysemous, e.g., interest
- Granularity of senses is often domain/application
specific - WSD tries to improve precision
- But incorrect sense assignments would hurt recall
- Slight gains in precision do not offset large
drops in recall
75An Alternate Approach
- Indexing word senses freezes concepts at index
time - What if we expanded query terms at query time
instead? - Two approaches
- Manual thesaurus, e.g., WordNet, UMLS, etc.
- Automatically-derived thesaurus, e.g.,
co-occurrence statistics
dog AND cat ? ( dog OR canine ) AND ( cat OR
feline )
76Does it work?
- Yes if done carefully
- User should be involved in the process
- Otherwise, poor choice of terms can hurt
performance
77Handling Anaphora
- Anaphora resolution finding what the anaphor
refers to (i.e., the antecedent) - Most common example pronominal anaphora
resolution - Simplest method works pretty well find previous
noun phrase matching in gender and number
John Wilkes Booth altered history with a bullet.
He will forever be known as the man who ended
Abraham Lincolns life. He John Wilkes Booth
78Expanding Anaphors
- When indexing, replace anaphors with their
antecedents - Does it work?
- Somewhat
- but can be computationally expensive
- helps more if you want to retrieve sub-document
segments
79Beyond Word-Level Indexing
- Words are the wrong unit to index
- Many multi-word combinations identify entities
- Persons George W. Bush, Dr. Jones
- Organizations Red Cross, United Way
- Corporations Hewlett Packard, Kraft Foods
- Locations Easter Island, New York City
- Entities often have finer-grained structures
Professor Stephen W. Hawking
title
first name
middle initial
last name
Cambridge, Massachusetts
city
state
80Indexing Named Entities
- Why would we want to index named entities?
- Index named entities as special tokens
- And treat special tokens like query terms
- Works pretty well for question answering
Who patented the light bulb?
patent light bulb PERSON
When was the light bulb patented?
patent light bulb DATE
John Prager, Eric Brown, and Anni Coden. (2000)
Question-Answering by Predictive Annotation.
Proceedings of SIGIR 2000.
81Indexing Phrases
- Two types of phrases
- Those that make sense, e.g., school bus, hot
dog - Those that dont, e.g., bigrams in Chinese
- Treat multi-word tokens as index terms
- Three sources of evidence
- Dictionary lookup
- Linguistic analysis
- Statistical analysis (e.g., co-occurrence)
82Known Phrases
- Compile a term list that includes phrases
- Technical terminology can be very helpful
- Index any phrase that occurs in the list
- Most effective in a limited domain
- Otherwise hard to capture most useful phrases
83Syntactic Phrases
- Parsing automatically assign structure to a
sentence - Walk the tree and extract phrases
- Index all noun phrases
- Index subjects and verbs
- Index verbs and objects
- etc.
Sentence
Prepositional Phrase
Noun Phrase
Noun phrase
Det
Adj
Adj
Noun
Verb
Adj
Noun
Adj
Det
Prep
The quick brown fox jumped over the lazy
black dog
84Syntactic Variations
- What does linguistic analysis buy?
- Coordinations
- Substitutions
- Permutations
lung and breast cancer ? lung cancer, breast
cancer
inflammatory sinonasal disease ? inflammatory
disease, sinonasal disease
addition of calcium ? calcium addition
85Statistical Analysis
- Automatically discover phrases based on
co-occurrence probabilities - If terms are not independent, they may form a
phrase - Use this method to automatically learn a phrase
dictionary
P(kick the bucket) P(kick) ? P(the) ?
P(bucket) ?
86Does Phrasal Indexing Work?
- Yes
- But the gains are so small theyre not worth the
cost - Primary drawback too slow!
87What about ambiguity?
- Different documents with the same keywords may
have different meanings
What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?
- Adult frogs eat mainly insects and other small
animals, including earthworms, minnows, and
spiders. - Alligators eat many kinds of small animals that
live in or near the water, including fish,
snakes, frogs, turtles, small mammals, and birds. - Some bats catch fish with their claws, and a few
species eat lizards, rodents, small birds, tree
frogs, and other bats.
?
?
88Indexing Relations
- Instead of terms, index syntactic relations
between entities in the text
Adult frogs eat mainly insects and other small
animals, including earthworms, minnows, and
spiders.
lt frogs subject-of eat gt lt insects object-of eat
gt lt animals object-of eat gt lt adult modifies
frogs gt lt small modifies animals gt
Alligators eat many kinds of small animals that
live in or near the water, including fish,
snakes, frogs, turtles, small mammals, and birds.
lt alligators subject-of eat gt lt kinds object-of
animals gt lt small modifies animals gt
From the relations, it is clear whos eating whom!
89Are syntactic relations enough?
- Consider this example
- Syntax sometimes isnt enough we need semantics
(or meaning)! - Semantics, for example, allows us to relate the
following two fragments
John broke the window. The window broke.
lt John subject-of break gt lt window subject-of
breakgt
John and window are both subjects But John
is the person doing the breaking (or
agent), and the window is the thing being
broken (or theme)
The barbarians destroyed the city The
destruction of the city by the barbarians
event destroy agent barbarians theme city
90Semantic Roles
- Semantic roles are invariant with respect to
syntactic expression - The idea
- Identify semantic roles
- Index frame structures with filled slots
- Retrieve answers based on semantic-level matching
Mary loaded the truck with hay. Hay was loaded
onto the truck by Mary.
event load agent Mary material
hay destination truck
91Does it work?
- No, not really
- Why not?
- Syntactic and semantic analysis is difficult
errors offset whatever gain is gotten - As with WSD, these techniques are
precision-enhancers recall usually takes a dive - Its slow!
92Alternative Approach
- Sophisticated linguistic analysis is slow!
- Unnecessary processing can be avoided by query
time analysis - Two-stage retrieval
- Use standard document retrieval techniques to
fetch a candidate set of documents - Use passage retrieval techniques to choose a few
promising passages (e.g., paragraphs) - Apply sophisticated linguistic techniques to
pinpoint the answer - Passage retrieval
- Find good passages within documents
- Key Idea locate areas where lots of query terms
appear close together
93Key Ideas
- IR is hard because language is rich and complex
(among other reasons) - Two general approaches to the problem
- Attempt to find the best unit of indexing
- Try to fix things at query time
- It is hard to predict a priori what techniques
work - Questions must be answered experimentally
- Words are really the wrong thing to index
- But there isnt really a better alternative