Title: What is Information Retrieval
1What is Information Retrieval?
- (Process of) fetching information relevant to
users information need - Fetch detect (return pointer to)
- Fetch extract compose summarize deduct
- Information documents (text multimedia web)
- Relevant on topic useful just right for the
task - Information need query question problem
statement profile
2Information Retrieval Users perspective
FORM A QUERY
Information need What recent disasters
occurred in tunnels used for transportation?
SEARCH QUERY
RELEVANCE FEEDBACK
DB
ANSWER
SEARCH
EXTRACT ORGANIZE SUMMARIZE
RESULTS RANKED BY RELEVANCE
3Information Structures
Raw Data
User Query
Data Index
Indexed Query
4Information Index
- Access data and documents by content
- inverted index (like subject index in a book)
- usually a hash table of descriptors
- used for retrieval operations
- Access documents by id
- straight index (like table of contents)
- used for relevance feedback operations
5Information Retrieval System
Raw data
Raw needs
6Text Indexing
- Controlled vocabulary
- pre-defined terms assigned to a document
- usually a manual process
- requires non-trivial cognitive abilities
- apparent limitations
- Full text indexing
- use all words in text
- linguistic processing to normalize forms
- map onto concepts, events and relationships
7Document Indexing Querying
- Text Snow Falling on Cedars by David Guterson
- Index (Library of Congress)
- Fiction, Washington State, Japanese Americans,
Trials (Murder), Journalists - User Request
- Find a book about a Puget Sound fisherman who is
found drowned and a Japanese American charged
with his murder.
Washington State
8Full Text Indexing
- Text Gardening, the perennial pleasures of
spring. Robin Lane Fox prepares to strike an
economic blow for a better garden on a
shoestring. - Terms gardening, perennial, pleasure, spring,
robin, lane, fox, prepare, strike, economic,
blow, better, garden, shoestring.
9Automated Text Processing
- Noun phrases perennialpleasure, spring,
robinlanefox, economicblow, bettergarden,
shoestring - Names robinlanefox
- Operator-Argument
- prepare(strike(RLF,economic-blow))
10Querying Indexed Data
- Ask What is the economic and financial situation
of the gardening supplies retailers in New York? - Query economic, financial, situation, gardening,
supplies, retailers, new, york - Retrieve add terms in common, calculate score
11Term Weighting
- How much a term contributes to content?
- gardening 0.05
- perennial 0.03
- pleasure 0.0002
- spring 0.00007
- robin 0.01
- How often used in a document?
- How often used in the database?
12The Notion of Relevance
- Relevant supplies information asked for by the
user. - Subjective
- Complete information
- Necessary information
- Form of information
- Interpretation of user need
- Representation of user need the query
13Computational Relevance
- A degree of similarity between documents
- Query to Database document retrieval
- One document to another clustering
- Content overlap
- Common descriptors
- Closeness in document space
- Conceptual overlap
14Closeness in Concept Space
What recent disasters occurred in tunnels for
transportation?
15Retrieval Results
- A (ranked) list of hits relevant documents
(according to system) - Typical format
- Rank DocId Score Title Abstract
- 1 WSJ910426-1234 0.95738 Locomotive Catches Fire
in Swiss Tunnel. A Swiss Railroad locomotive
caught fire in a tunnel near Zurich on Wednesday - 2 APW890714-097841 0.89567 Tragedy in Chunnel.
-
- Ranking by
- Similarity score (Infoseek)
- Linking score (popularity) (Google) hybrid
(Lycos) - Document style, document type, date/update, etc.
16Search Results (Google)
1. DBLP Tomek Strzalkowski Tomek
Strzalkowski List of publications from the
DBLP... www.informatik.uni-trier.de/ley/db/in
dices/a-tree/s/StrzalkowskiTomek.html 2. Tomek
Strzalkowski, Jose Perez Carballo and Mihnea
Marinescu citeseer.nj.nec.com/did/51110 -
16k - Cached - Similar pages 3. Amit Bagga's
Publications pp. 207-210, 2000. Tomek
Strzalkowski, G. Bowden Wise ,...
www.cs.duke.edu/amit/publications.html - 12k 4.
Log for 1998 Talked to Tomek Strzalkowski, GE
research... www.cs.columbia.edu/min/logs/sche
d98.html - 34k 5. Topic Japanese Regulation of
Insider Trading LANGUAGE TEXT RETRIEVAL
Tomek Strzalkowski and Jose Perez
trec.nist.gov/pubs/trec2/papers/txt/12.txt -
67k 6. TDT Email Correspondence
From "Strzalkowski,
Tomek (CRD)"... www.ldc.upenn.edu/Projects/TDT
2/email/email_127.html - 2k
17Search Results (Alta Vista)
1. NATURAL LANGUAGE INFORMATION RETRIEVAL IN
DIGITAL LIBRARIES Tomek Strzalkowski GE
Corporate Research Developm URL
searchpdf.adobe.com/proxies/0/73/47/17.html 2.
qa-track Mailing List Archive. Thread Index. Re
Questions on TREC QA track. From Amitabh
Singhal URL www.research.att.com/lists/qa-track/maillist
.html 3. Jussi Karlgren's Reasonably Complete
List of Publications. SICS HUMLE ILE
DigLib NoDaLiNe NYU HY Authors. Year.
Title. Publication.... URL
www.sics.se/jussi/Artiklar/ 4.
Humanist.Archives.Vol.12 12.0584 new
publications. Humanist Discussion Group
(humanist_at_kcl.ac.uk) Mon, 26 Apr 1999 221701
0100 (BST) URL lists.village.virginia.edu/li
sts_archive...t/v12/0583.html 5. Cha-Cha A
System for Organizing Intranet Search Results
URL cha-cha.berkeley.edu/papers/usits99/index.ht
ml 6. Modern Information Retrieval -
Bibliography URL sunsite.dcc.uchile.cl/irboo
k/biblio.html
18Search Results (Infoseek)
1. AAAI Technical Report SS-98-06 Relevance
41 Date 12 Jul 1999, Size 9.5K,
http//www.aaai.org/Press/Reports/Symposia/Spring/
SS-98-06/SS-98-06.ht... 2. Corpora Jan 1998 to
Present Corpora Final CFP COLING-ACL'98
Relevance 41 Date 17 Feb 2000, Size
6.5K, http//www.hd.uib.no/corpora/1998-1/01
35.html 3. ILE Publications Relevance 36
Date 18 Jul 2000, Size 19.3K,
http//www.sics.se/humle/ile/publications/ 4.
LINGUIST List 9.878 COMPUTERM WS at
COLING-ACL'98 Relevance 36 Date 14 Jun
1998, Size 7.0K, http//www.linguistlist.or
g/issues/9/9-878.html 5. Recent Natural Language
Processing Publications Relevance 34 Date
1 Feb 2000, Size 4.9K, http//www.cs.utah.e
du/riloff/publications.html 6. ACL2000
Relevance 34 Date 11 Feb 2000, Size 5.4K,
http//www.geocities.com/Athens/Forum/137
3/acl-2000-summarization
19The IR Tasks
- Ad-hoc Querying
- Filtering and Routing
- Topic Detection and Tracking
- Question Answering
- Automated Summarization
- Information Fusion
20Ad Hoc Querying
- Ask arbitrary queries against database
- Most Internet search is adhoc
- Probably the hardest of all IR tasks
- Reflects real-life tasks
- Literature search/research
- Legal case preparation
- Medical diagnosis
21Cross-Lingual IR
- Ask query in a users native language
- E.g., English, French
- Database consists of documents
- in another language (e.g., Mandarin)
- multiple languages (e.g., FBIS, WWW)
- Full-text indexing in source language
- Machine translation unreliable
22Filtering and Routing
- Fixed queries against a data stream
- news broadcast, newswire service
- real time/no collection (filtering) or floating
collection (routing) - one assignment per document (classification), or
multiple assignments - Adaptive filtering
- Relevance threshold
23Topic Detection Tracking
- A special form of filtering
- Abstraction of real-life tasks
- News reporting (e.g., NBC)
- Intelligence gathering (e.g., CIA)
- Financial markets
- Detect and track stories on topics of interest
in continuous data stream - Sources text, audio, video, multimedia
24TDT Concept
sources
ABC
NBC
PRI
APW
Reuters
TsingHua
time
25TDT Baseline Tasks
Segmentation
Disjoint, Homogenous Regions (Stories or
Non-Stories)
Detection
(with or without Segmentation)
Stories about Some Topic
Tracking
(with or without Segmentation)
Training Story
More Stories about Same Topic
26TDT Application Source Coordination
newswire feeds
satellite video feeds
text
audio
text
stories
video
27Topics Formats vs. Ratings
time
28Question Answering
- Supply actual answers to user questions
- How long does it take to fly from Paris to New
York on a Concorde? - 3.5 hours
- Find relevant information, not documents
- Extract information, convert into an answer
- Ranges from trivia to research problems
- Commercial service AskJeeves
29Why Question-Answering?
Users want information, not lists of documents
Query What disasters have occurred in tunnels
used for transportation?
Answer A Swiss locomotive caught fire in a
Zurich railway tunnel in 1991 and
more than 50 passengers were
injured. In 1992, a tunnel worker
died after being hit by a works
train and the British end of the
Channel.
30Complexity of Question Answering
- Question Scope
- Simple factual to template/compounds to
exploratory - Question context
- Isolated trivia-style to task context to user
knowledge context - Complexity
- Atomic questions to decomposable compound
problems - Interpretation (of answer)
- Explicit facts to groups of facts to hypothesis
forming - Answer fusion
- Concatenation to clustering to composition
- Sources of data
- Single source to multiple sources to
multimedia/multilingual
31QA Application Automated Call Center
Dialogue Agent
telephony interface
telephone Internet multilingual
Automatic Call Router
- I need the part number for a suction cup, please.
BackEnd Service Support service specialists and
information
- What product do you need this
- part for?
32Automated Summarization
- Summarize content of a text/media document(s)
- Informative vs. Indicative
- Generic vs. Topical
- main content vs. topic-related
- Single text vs. Cross-document
- topic aspect detection
- topical briefs, topic tracking
- Info Organization Visualization
- Multiple views of info space
- Rapid comprehension
33Cross-Document Summarization
- Cluster documents into topics
- Derive cluster summaries
- Form a cross-document summary
34Cross-media Summarization
language
ontology
video
35Fused Information Retrieval
1. NYT991028 2. 3. ...
Whats the current financial situation at Airbus?
NYTimes
APWire
DJFin
Medline
USPTO
36Information Understanding Tools