What is Information Retrieval - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

What is Information Retrieval

Description:

White House Crisis. In Texas a brutal killer is scheduled to. die this Tuesday, but Carla Faye ... by a works train and the British end of the. Channel. ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 37
Provided by: tcnj
Category:

less

Transcript and Presenter's Notes

Title: What is Information Retrieval


1
What is Information Retrieval?
  • (Process of) fetching information relevant to
    users information need
  • Fetch detect (return pointer to)
  • Fetch extract compose summarize deduct
  • Information documents (text multimedia web)
  • Relevant on topic useful just right for the
    task
  • Information need query question problem
    statement profile

2
Information Retrieval Users perspective
FORM A QUERY
Information need What recent disasters
occurred in tunnels used for transportation?
SEARCH QUERY
RELEVANCE FEEDBACK
DB
ANSWER
SEARCH
EXTRACT ORGANIZE SUMMARIZE
RESULTS RANKED BY RELEVANCE
3
Information Structures
Raw Data
User Query
Data Index
Indexed Query
4
Information Index
  • Access data and documents by content
  • inverted index (like subject index in a book)
  • usually a hash table of descriptors
  • used for retrieval operations
  • Access documents by id
  • straight index (like table of contents)
  • used for relevance feedback operations

5
Information Retrieval System
Raw data
Raw needs
6
Text Indexing
  • Controlled vocabulary
  • pre-defined terms assigned to a document
  • usually a manual process
  • requires non-trivial cognitive abilities
  • apparent limitations
  • Full text indexing
  • use all words in text
  • linguistic processing to normalize forms
  • map onto concepts, events and relationships

7
Document Indexing Querying
  • Text Snow Falling on Cedars by David Guterson
  • Index (Library of Congress)
  • Fiction, Washington State, Japanese Americans,
    Trials (Murder), Journalists
  • User Request
  • Find a book about a Puget Sound fisherman who is
    found drowned and a Japanese American charged
    with his murder.

Washington State
8
Full Text Indexing
  • Text Gardening, the perennial pleasures of
    spring. Robin Lane Fox prepares to strike an
    economic blow for a better garden on a
    shoestring.
  • Terms gardening, perennial, pleasure, spring,
    robin, lane, fox, prepare, strike, economic,
    blow, better, garden, shoestring.

9
Automated Text Processing
  • Noun phrases perennialpleasure, spring,
    robinlanefox, economicblow, bettergarden,
    shoestring
  • Names robinlanefox
  • Operator-Argument
  • prepare(strike(RLF,economic-blow))

10
Querying Indexed Data
  • Ask What is the economic and financial situation
    of the gardening supplies retailers in New York?
  • Query economic, financial, situation, gardening,
    supplies, retailers, new, york
  • Retrieve add terms in common, calculate score

11
Term Weighting
  • How much a term contributes to content?
  • gardening 0.05
  • perennial 0.03
  • pleasure 0.0002
  • spring 0.00007
  • robin 0.01
  • How often used in a document?
  • How often used in the database?

12
The Notion of Relevance
  • Relevant supplies information asked for by the
    user.
  • Subjective
  • Complete information
  • Necessary information
  • Form of information
  • Interpretation of user need
  • Representation of user need the query

13
Computational Relevance
  • A degree of similarity between documents
  • Query to Database document retrieval
  • One document to another clustering
  • Content overlap
  • Common descriptors
  • Closeness in document space
  • Conceptual overlap

14
Closeness in Concept Space
What recent disasters occurred in tunnels for
transportation?
15
Retrieval Results
  • A (ranked) list of hits relevant documents
    (according to system)
  • Typical format
  • Rank DocId Score Title Abstract
  • 1 WSJ910426-1234 0.95738 Locomotive Catches Fire
    in Swiss Tunnel. A Swiss Railroad locomotive
    caught fire in a tunnel near Zurich on Wednesday
  • 2 APW890714-097841 0.89567 Tragedy in Chunnel.
  • Ranking by
  • Similarity score (Infoseek)
  • Linking score (popularity) (Google) hybrid
    (Lycos)
  • Document style, document type, date/update, etc.

16
Search Results (Google)
1. DBLP Tomek Strzalkowski Tomek
Strzalkowski List of publications from the
DBLP... www.informatik.uni-trier.de/ley/db/in
dices/a-tree/s/StrzalkowskiTomek.html 2. Tomek
Strzalkowski, Jose Perez Carballo and Mihnea
Marinescu citeseer.nj.nec.com/did/51110 -
16k - Cached - Similar pages 3. Amit Bagga's
Publications pp. 207-210, 2000. Tomek
Strzalkowski, G. Bowden Wise ,...
www.cs.duke.edu/amit/publications.html - 12k 4.
Log for 1998 Talked to Tomek Strzalkowski, GE
research... www.cs.columbia.edu/min/logs/sche
d98.html - 34k 5. Topic Japanese Regulation of
Insider Trading LANGUAGE TEXT RETRIEVAL
Tomek Strzalkowski and Jose Perez
trec.nist.gov/pubs/trec2/papers/txt/12.txt -
67k 6. TDT Email Correspondence
From "Strzalkowski,
Tomek (CRD)"... www.ldc.upenn.edu/Projects/TDT
2/email/email_127.html - 2k
17
Search Results (Alta Vista)
1. NATURAL LANGUAGE INFORMATION RETRIEVAL IN
DIGITAL LIBRARIES Tomek Strzalkowski GE
Corporate Research Developm URL
searchpdf.adobe.com/proxies/0/73/47/17.html 2.
qa-track Mailing List Archive. Thread Index. Re
Questions on TREC QA track. From Amitabh
Singhal URL www.research.att.com/lists/qa-track/maillist
.html 3. Jussi Karlgren's Reasonably Complete
List of Publications. SICS HUMLE ILE
DigLib NoDaLiNe NYU HY Authors. Year.
Title. Publication.... URL
www.sics.se/jussi/Artiklar/ 4.
Humanist.Archives.Vol.12 12.0584 new
publications. Humanist Discussion Group
(humanist_at_kcl.ac.uk) Mon, 26 Apr 1999 221701
0100 (BST) URL lists.village.virginia.edu/li
sts_archive...t/v12/0583.html 5. Cha-Cha A
System for Organizing Intranet Search Results
URL cha-cha.berkeley.edu/papers/usits99/index.ht
ml 6. Modern Information Retrieval -
Bibliography URL sunsite.dcc.uchile.cl/irboo
k/biblio.html
18
Search Results (Infoseek)
1. AAAI Technical Report SS-98-06 Relevance
41 Date 12 Jul 1999, Size 9.5K,
http//www.aaai.org/Press/Reports/Symposia/Spring/
SS-98-06/SS-98-06.ht... 2. Corpora Jan 1998 to
Present Corpora Final CFP COLING-ACL'98
Relevance 41 Date 17 Feb 2000, Size
6.5K, http//www.hd.uib.no/corpora/1998-1/01
35.html 3. ILE Publications Relevance 36
Date 18 Jul 2000, Size 19.3K,
http//www.sics.se/humle/ile/publications/ 4.
LINGUIST List 9.878 COMPUTERM WS at
COLING-ACL'98 Relevance 36 Date 14 Jun
1998, Size 7.0K, http//www.linguistlist.or
g/issues/9/9-878.html 5. Recent Natural Language
Processing Publications Relevance 34 Date
1 Feb 2000, Size 4.9K, http//www.cs.utah.e
du/riloff/publications.html 6. ACL2000
Relevance 34 Date 11 Feb 2000, Size 5.4K,
http//www.geocities.com/Athens/Forum/137
3/acl-2000-summarization
19
The IR Tasks
  • Ad-hoc Querying
  • Filtering and Routing
  • Topic Detection and Tracking
  • Question Answering
  • Automated Summarization
  • Information Fusion

20
Ad Hoc Querying
  • Ask arbitrary queries against database
  • Most Internet search is adhoc
  • Probably the hardest of all IR tasks
  • Reflects real-life tasks
  • Literature search/research
  • Legal case preparation
  • Medical diagnosis

21
Cross-Lingual IR
  • Ask query in a users native language
  • E.g., English, French
  • Database consists of documents
  • in another language (e.g., Mandarin)
  • multiple languages (e.g., FBIS, WWW)
  • Full-text indexing in source language
  • Machine translation unreliable

22
Filtering and Routing
  • Fixed queries against a data stream
  • news broadcast, newswire service
  • real time/no collection (filtering) or floating
    collection (routing)
  • one assignment per document (classification), or
    multiple assignments
  • Adaptive filtering
  • Relevance threshold

23
Topic Detection Tracking
  • A special form of filtering
  • Abstraction of real-life tasks
  • News reporting (e.g., NBC)
  • Intelligence gathering (e.g., CIA)
  • Financial markets
  • Detect and track stories on topics of interest
    in continuous data stream
  • Sources text, audio, video, multimedia

24
TDT Concept
sources
ABC
NBC
PRI
APW
Reuters
TsingHua
time
25
TDT Baseline Tasks
Segmentation
Disjoint, Homogenous Regions (Stories or
Non-Stories)
Detection
(with or without Segmentation)
Stories about Some Topic
Tracking
(with or without Segmentation)
Training Story
More Stories about Same Topic
26
TDT Application Source Coordination
newswire feeds
satellite video feeds
text
audio
text
stories
video
27
Topics Formats vs. Ratings
time
28
Question Answering
  • Supply actual answers to user questions
  • How long does it take to fly from Paris to New
    York on a Concorde?
  • 3.5 hours
  • Find relevant information, not documents
  • Extract information, convert into an answer
  • Ranges from trivia to research problems
  • Commercial service AskJeeves

29
Why Question-Answering?
Users want information, not lists of documents
Query What disasters have occurred in tunnels
used for transportation?
Answer A Swiss locomotive caught fire in a
Zurich railway tunnel in 1991 and
more than 50 passengers were
injured. In 1992, a tunnel worker
died after being hit by a works
train and the British end of the
Channel.
30
Complexity of Question Answering
  • Question Scope
  • Simple factual to template/compounds to
    exploratory
  • Question context
  • Isolated trivia-style to task context to user
    knowledge context
  • Complexity
  • Atomic questions to decomposable compound
    problems
  • Interpretation (of answer)
  • Explicit facts to groups of facts to hypothesis
    forming
  • Answer fusion
  • Concatenation to clustering to composition
  • Sources of data
  • Single source to multiple sources to
    multimedia/multilingual

31
QA Application Automated Call Center
Dialogue Agent
telephony interface
telephone Internet multilingual
Automatic Call Router
  • I need the part number for a suction cup, please.

BackEnd Service Support service specialists and
information
  • What product do you need this
  • part for?
  • Its called 3MLasercam

32
Automated Summarization
  • Summarize content of a text/media document(s)
  • Informative vs. Indicative
  • Generic vs. Topical
  • main content vs. topic-related
  • Single text vs. Cross-document
  • topic aspect detection
  • topical briefs, topic tracking
  • Info Organization Visualization
  • Multiple views of info space
  • Rapid comprehension

33
Cross-Document Summarization
  • Cluster documents into topics
  • Derive cluster summaries
  • Form a cross-document summary

34
Cross-media Summarization
language
ontology
video
35
Fused Information Retrieval
1. NYT991028 2. 3. ...
Whats the current financial situation at Airbus?
NYTimes
APWire
DJFin
Medline
USPTO
36
Information Understanding Tools
Write a Comment
User Comments (0)
About PowerShow.com