Title: What is Information Retrieval
1What is Information Retrieval?
- (Process of) fetching information relevant to
users information need - Fetch detect (return pointer to)
- Fetch extract compose summarize deduct
- Information documents (text multimedia web)
- Relevant on topic useful just right for the
task - Information need query question problem
statement profile
2Information Retrieval Users perspective
FORM A QUERY
Information need What recent disasters
occurred in tunnels used for transportation?
SEARCH QUERY
RELEVANCE FEEDBACK
DB
ANSWER
SEARCH
EXTRACT ORGANIZE SUMMARIZE
RESULTS RANKED BY RELEVANCE
3Information Structures
Raw Data
User Query
Data Index
Indexed Query
4Information Index
- Access data and documents by content
- inverted index (like subject index in a book)
- usually a hash table of descriptors
- used for retrieval operations
- Access documents by id
- straight index (like table of contents)
- used for relevance feedback operations
5Information Retrieval System
Raw data
Raw needs
6Origins
- Communication theory revisited
- Problems with transmission of meaning
Noise
7Structure of an IR System
Search Line
Adapted from Soergel, p. 19
8The Standard Retrieval Interaction Model
9Standard Model of IR
- Assumptions
- The goal is maximizing precision and recall
simultaneously - The information need remains static
- The value is in the resulting document set
10Problems with Standard Model
- Users learn during the search process
- Scanning titles of retrieved documents
- Reading retrieved documents
- Viewing lists of related topics/thesaurus terms
- Navigating hyperlinks
- Some users dont like long (apparently)
disorganized lists of documents
11IR is an Iterative Process
12IR is a Dialog
- The exchange doesnt end with first answer
- Users can recognize elements of a useful answer,
even when incomplete - Questions and understanding changes as the
process continues
13Bates Berry-Picking Model
- Standard IR model
- Assumes the information need remains the same
throughout the search process - Berry-picking model
- Interesting information is scattered like berries
among bushes - The query is continually shifting
14Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
15Berry-Picking Model (cont.)
- The query is continually shifting
- New information may yield new ideas and new
directions - The information need
- Is not satisfied by a single, final retrieved set
- Is satisfied by a series of selections and bits
of information found along the way
16Restricted Form of the IR Problem
- The system has available only pre-existing,
canned text passages - Its response is limited to selecting from these
passages and presenting them to the user - It must select, say, 10 or 20 passages out of
millions or billions!
17Information Retrieval
- Revised Task Statement
- Build a system that retrieves documents that
users are likely to find relevant to their
queries - This set of assumptions underlies the field of
Information Retrieval
18Text Indexing
- Controlled vocabulary
- pre-defined terms assigned to a document
- usually a manual process
- requires non-trivial cognitive abilities
- apparent limitations
- Full text indexing
- use all words in text
- linguistic processing to normalize forms
- map onto concepts, events and relationships
19Document Indexing Querying
- Text Snow Falling on Cedars by David Guterson
- Index (Library of Congress)
- Fiction, Washington State, Japanese Americans,
Trials (Murder), Journalists - User Request
- Find a book about a Puget Sound fisherman who is
found drowned and a Japanese American charged
with his murder.
Washington State
20Full Text Indexing
- Text Gardening, the perennial pleasures of
spring. Robin Lane Fox prepares to strike an
economic blow for a better garden on a
shoestring. - Terms gardening, perennial, pleasure, spring,
robin, lane, fox, prepare, strike, economic,
blow, better, garden, shoestring.
21Automated Text Processing
- Noun phrases perennialpleasure, spring,
robinlanefox, economicblow, bettergarden,
shoestring - Names robinlanefox
- Operator-Argument
- prepare(strike(RLF,economic-blow))
22Querying Indexed Data
- Ask What is the economic and financial situation
of the gardening supplies retailers in New York? - Query economic, financial, situation, gardening,
supplies, retailers, new, york - Retrieve add terms in common, calculate score
23Term Weighting
- How much a term contributes to content?
- gardening 0.05
- perennial 0.03
- pleasure 0.0002
- spring 0.00007
- robin 0.01
- How often used in a document?
- How often used in the database?
24The Notion of Relevance
- Relevant supplies information asked for by the
user. - Subjective
- Complete information
- Necessary information
- Form of information
- Interpretation of user need
- Representation of user need the query
25Computational Relevance
- A degree of similarity between documents
- Query to Database document retrieval
- One document to another clustering
- Content overlap
- Common descriptors
- Closeness in document space
- Conceptual overlap
26Closeness in Concept Space
What recent disasters occurred in tunnels for
transportation?
27Retrieval Results
- A (ranked) list of hits relevant documents
(according to system) - Typical format
- Rank DocId Score Title Abstract
- 1 WSJ910426-1234 0.95738 Locomotive Catches Fire
in Swiss Tunnel. A Swiss Railroad locomotive
caught fire in a tunnel near Zurich on Wednesday - 2 APW890714-097841 0.89567 Tragedy in Chunnel.
-
- Ranking by
- Similarity score (Infoseek)
- Linking score (popularity) (Google) hybrid
(Lycos) - Document style, document type, date/update, etc.
28Search Results (Google)
GoogleMM.htm 1. DBLP Miroslav Martinovic dblp.uni
-trier.de Miroslav Martinovic. ... 2002. 2, EE,
G. Sampath, Miroslav Martinovic A Multilevel
Text Processing Model of Newsgroup Dynamics. NLDB
2002 208-212. ... www.informatik.uni-trier.de/le
y/db/indices/ a-tree/m/MartinovicMiroslav.html -
3k - Cached - Similar pages 2. Home Page of Dr.
M. Martinovic Miroslav Martinovic, Ph.D.
... www.tcnj.edu/mmmartin/ - 16k - Cached -
Similar pages 3. CRA-W ... Design and
Development of a Word Conflation Module Student
Researchers Emily Gibson, Christina Grape
Advisor Dr. Miroslav Martinovic Institution The
College ... www.cra.org/Activities/craw/crew/crewR
eports/ 2002/newjersey_final.html - 14k - Cached
- Similar pages 4. tomek strzalkowski -
ResearchIndex document query ... 171-186
acl.ldc.upenn.edu/J/J93/J93-4009.pdf Comparing
Two Grammar-Based Generation - Case Study
Miroslav (Correct) A Case Study Miroslav
Martinovic And Tomek ... citeseer.nj.nec.com/cs?q
TomekStrzalkowski - 19k - Cached - Similar
pages 5. SpringerLink - Volume ... 208 - 212. G.
Sampath, Miroslav Martinovic. Best Feature
Selection for Maximum Entropy-Based Word Sense
Disambiguation, pp. 213 - 217. ... link.springer.d
e/link/service/ series/0558/tocs/t2553.htm - 37k
- Cached - Similar pages
29Search Results (Alta Vista)
AltaVistaMM.htm 1. DBLP Miroslav Martinovic
Miroslav Martinovic List of publications from the
DBLP Bibliography Server - FAQ Ask others ACM -
CiteSeer - CSB - Google ... 2002 2 EE G.
Sampath, Miroslav Martinovic A Multilevel Text
... www.acm.org/sigs/sigmod/dblp/db/indices/..
.cMiroslav.html More pages from www.acm.org
2. ABSTRACT ... AND LINGUISTIC APPROACHES IN
BUILDING INTELLIGENT QUESTION ANSWERING SYSTEMS
Miroslav Martinovic TOPIC AREA Question
Answering, Mathematical Linguistics,
Quantitative/Qualitative Linguistics ...
www.trenton.edu/mmmartin/SSGRR-ABS.HTML
Related Pages More pages from www.trenton.edu
3. ABSTRACT To Appear Transforming A
Word Conflation Algorithm into A Minimal Stem
Algorithm Miroslav Martinovic TOPIC AREA
Word Conflation, Information Retrieval, Stem
Dictionaries, NLP Tools and Resources ...
www.trenton.edu/mmmartin/CREW-ABS.html 4.
SpringerLink Lecture Notes in Computer Science
2553 A Multilevel Text Processing Model of
Newsgroup Dynamics G. Sampath and Miroslav
Martinovic Department of Computer Science, The
College of New Jersey, Ewing, NJ 08628
sampath,mmmartin_at_tcnj.edu ...
link.springer.de/link/service/series/055...53/2553
0208.htm Related Pages More pages from
link.springer.de 5. COMPARING TWO
GRAMMAR-BASED GENERATION ALGORITHMS A CASE
STUDY File typePDF - Download PDF Reader
title COMPARING TWO GRAMMAR-BASED GENERATION
ALGORITHMS A CASE STUDY author Miroslav
Martinovic Tomek Strzaikowski creation data
D20020326132025-05'00' revision date
D20020403103942-05'00' www.ldc.upenn.edu/acl/P/P9
2/P92-1011.pdf More pages from www.ldc.upenn.edu
30Search Results (Yahoo)
YahooMM.htm 1. DBLP Miroslav Martinovic
dblp.uni-trier.de Miroslav Martinovic. ... 2002.
2, EE, G. Sampath, Miroslav Martinovic A
Multilevel Text Processing Model of Newsgroup
Dynamics. NLDB 2002 208-212. ...
www.informatik.uni-trier.de/ley/db/indices/a-tree
/m/MartinovicMiroslav.html cached more results
from this site 2. Home Page of Dr. M. Martinovic
Miroslav Martinovic, Ph.D. ...
www.tcnj.edu/mmmartin/ cached more results
from this site 3. CRA-W ... Design and
Development of a Word Conflation Module Student
Researchers Emily Gibson, Christina Grape
Advisor Dr. Miroslav Martinovic Institution The
College ... www.cra.org/Activities/craw/crew/crewR
eports/2002/newjersey_final.html cached more
results from this site 4. tomek strzalkowski -
ResearchIndex document query ... 171-186
acl.ldc.upenn.edu/J/J93/J93-4009.pdf Comparing
Two Grammar-Based Generation - Case Study
Miroslav (Correct) A Case Study Miroslav
Martinovic And Tomek ... citeseer.nj.nec.com/cs?q
TomekStrzalkowski cached 5. SpringerLink -
Volume ... 208 - 212. G. Sampath, Miroslav
Martinovic. Best Feature Selection for Maximum
Entropy-Based Word Sense Disambiguation, pp. 213
- 217. ... link.springer.de/link/service/series/05
58/tocs/t2553.htm cached more results from this
site 6. Michael E. Locasto Projects ...
projects Cohorts are in 's. QASTIIR
Question/Answer System for Intelligent
Information Retrieval Michael Hulme, Miroslav
Martinovic ... www1.cs.columbia.edu/locasto/pro
jects/ cached 7. List of Papers ... 34.
Integrating Statistical and Linguistic Approaches
in Building Intelligent Question Answering
Systems. Miroslav Martinovic. 35. ...
www.ssgrr.it/en/ssgrr2002w/papers.htm cached
more results from this site
31The IR Tasks
- Ad-hoc Querying
- Filtering and Routing
- Topic Detection and Tracking
- Question Answering
- Automated Summarization
- Information Fusion
32Ad Hoc Querying
- Ask arbitrary queries against database
- Most Internet search is adhoc
- Probably the hardest of all IR tasks
- Reflects real-life tasks
- Literature search/research
- Legal case preparation
- Medical diagnosis
33Cross-Lingual IR
- Ask query in a users native language
- E.g., English, French
- Database consists of documents
- in another language (e.g., Mandarin)
- multiple languages (e.g., FBIS, WWW)
- Full-text indexing in source language
- Machine translation unreliable
34Filtering and Routing
- Fixed queries against a data stream
- news broadcast, newswire service
- real time/no collection (filtering) or floating
collection (routing) - one assignment per document (classification), or
multiple assignments - Adaptive filtering
- Relevance threshold
35Topic Detection Tracking
- A special form of filtering
- Abstraction of real-life tasks
- News reporting (e.g., NBC)
- Intelligence gathering (e.g., CIA)
- Financial markets
- Detect and track stories on topics of interest
in continuous data stream - Sources text, audio, video, multimedia
36TDT Concept
sources
ABC
NBC
PRI
APW
Reuters
TsingHua
time
37TDT Baseline Tasks
Segmentation
Disjoint, Homogenous Regions (Stories or
Non-Stories)
Detection
(with or without Segmentation)
Stories about Some Topic
Tracking
(with or without Segmentation)
Training Story
More Stories about Same Topic
38TDT Application Source Coordination
newswire feeds
satellite video feeds
text
audio
text
stories
video
39Topics Formats vs. Ratings
time
40Question Answering
- Supply actual answers to user questions
- How long does it take to fly from Paris to New
York on a Concorde? - 3.5 hours
- Find relevant information, not documents
- Extract information, convert into an answer
- Ranges from trivia to research problems
- Commercial service AskJeeves
41Why Question-Answering?
Users want information, not lists of documents
Query What disasters have occurred in tunnels
used for transportation?
Answer A Swiss locomotive caught fire in a
Zurich railway tunnel in 1991 and
more than 50 passengers were
injured. In 1992, a tunnel worker
died after being hit by a works
train and the British end of the
Channel.
42Complexity of Question Answering
- Question Scope
- Simple factual to template/compounds to
exploratory - Question context
- Isolated trivia-style to task context to user
knowledge context - Complexity
- Atomic questions to decomposable compound
problems - Interpretation (of answer)
- Explicit facts to groups of facts to hypothesis
forming - Answer fusion
- Concatenation to clustering to composition
- Sources of data
- Single source to multiple sources to
multimedia/multilingual
43QA Application Automated Call Center
Dialogue Agent
telephony interface
telephone Internet multilingual
Automatic Call Router
- I need the part number for a suction cup, please.
BackEnd Service Support service specialists and
information
- What product do you need this
- part for?
44Automated Summarization
- Summarize content of a text/media document(s)
- Informative vs. Indicative
- Generic vs. Topical
- main content vs. topic-related
- Single text vs. Cross-document
- topic aspect detection
- topical briefs, topic tracking
- Info Organization Visualization
- Multiple views of info space
- Rapid comprehension
45Cross-Document Summarization
- Cluster documents into topics
- Derive cluster summaries
- Form a cross-document summary
46Cross-media Summarization
language
ontology
video
47Fused Information Retrieval
1. NYT991028 2. 3. ...
Whats the current financial situation at Airbus?
NYTimes
APWire
DJFin
Medline
USPTO
48Information Understanding Tools