What is Information Retrieval - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

What is Information Retrieval

Description:

link.springer.de/link/service/ series/0558/tocs/t2553.htm - 37k - Cached - Similar pages ... by a works train and the British end of the. Channel. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 49
Provided by: tcnj
Category:

less

Transcript and Presenter's Notes

Title: What is Information Retrieval


1
What is Information Retrieval?
  • (Process of) fetching information relevant to
    users information need
  • Fetch detect (return pointer to)
  • Fetch extract compose summarize deduct
  • Information documents (text multimedia web)
  • Relevant on topic useful just right for the
    task
  • Information need query question problem
    statement profile

2
Information Retrieval Users perspective
FORM A QUERY
Information need What recent disasters
occurred in tunnels used for transportation?
SEARCH QUERY
RELEVANCE FEEDBACK
DB
ANSWER
SEARCH
EXTRACT ORGANIZE SUMMARIZE
RESULTS RANKED BY RELEVANCE
3
Information Structures
Raw Data
User Query
Data Index
Indexed Query
4
Information Index
  • Access data and documents by content
  • inverted index (like subject index in a book)
  • usually a hash table of descriptors
  • used for retrieval operations
  • Access documents by id
  • straight index (like table of contents)
  • used for relevance feedback operations

5
Information Retrieval System
Raw data
Raw needs
6
Origins
  • Communication theory revisited
  • Problems with transmission of meaning

Noise
7
Structure of an IR System
Search Line
Adapted from Soergel, p. 19
8
The Standard Retrieval Interaction Model
9
Standard Model of IR
  • Assumptions
  • The goal is maximizing precision and recall
    simultaneously
  • The information need remains static
  • The value is in the resulting document set

10
Problems with Standard Model
  • Users learn during the search process
  • Scanning titles of retrieved documents
  • Reading retrieved documents
  • Viewing lists of related topics/thesaurus terms
  • Navigating hyperlinks
  • Some users dont like long (apparently)
    disorganized lists of documents

11
IR is an Iterative Process
12
IR is a Dialog
  • The exchange doesnt end with first answer
  • Users can recognize elements of a useful answer,
    even when incomplete
  • Questions and understanding changes as the
    process continues

13
Bates Berry-Picking Model
  • Standard IR model
  • Assumes the information need remains the same
    throughout the search process
  • Berry-picking model
  • Interesting information is scattered like berries
    among bushes
  • The query is continually shifting

14
Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
15
Berry-Picking Model (cont.)
  • The query is continually shifting
  • New information may yield new ideas and new
    directions
  • The information need
  • Is not satisfied by a single, final retrieved set
  • Is satisfied by a series of selections and bits
    of information found along the way

16
Restricted Form of the IR Problem
  • The system has available only pre-existing,
    canned text passages
  • Its response is limited to selecting from these
    passages and presenting them to the user
  • It must select, say, 10 or 20 passages out of
    millions or billions!

17
Information Retrieval
  • Revised Task Statement
  • Build a system that retrieves documents that
    users are likely to find relevant to their
    queries
  • This set of assumptions underlies the field of
    Information Retrieval

18
Text Indexing
  • Controlled vocabulary
  • pre-defined terms assigned to a document
  • usually a manual process
  • requires non-trivial cognitive abilities
  • apparent limitations
  • Full text indexing
  • use all words in text
  • linguistic processing to normalize forms
  • map onto concepts, events and relationships

19
Document Indexing Querying
  • Text Snow Falling on Cedars by David Guterson
  • Index (Library of Congress)
  • Fiction, Washington State, Japanese Americans,
    Trials (Murder), Journalists
  • User Request
  • Find a book about a Puget Sound fisherman who is
    found drowned and a Japanese American charged
    with his murder.

Washington State
20
Full Text Indexing
  • Text Gardening, the perennial pleasures of
    spring. Robin Lane Fox prepares to strike an
    economic blow for a better garden on a
    shoestring.
  • Terms gardening, perennial, pleasure, spring,
    robin, lane, fox, prepare, strike, economic,
    blow, better, garden, shoestring.

21
Automated Text Processing
  • Noun phrases perennialpleasure, spring,
    robinlanefox, economicblow, bettergarden,
    shoestring
  • Names robinlanefox
  • Operator-Argument
  • prepare(strike(RLF,economic-blow))

22
Querying Indexed Data
  • Ask What is the economic and financial situation
    of the gardening supplies retailers in New York?
  • Query economic, financial, situation, gardening,
    supplies, retailers, new, york
  • Retrieve add terms in common, calculate score

23
Term Weighting
  • How much a term contributes to content?
  • gardening 0.05
  • perennial 0.03
  • pleasure 0.0002
  • spring 0.00007
  • robin 0.01
  • How often used in a document?
  • How often used in the database?

24
The Notion of Relevance
  • Relevant supplies information asked for by the
    user.
  • Subjective
  • Complete information
  • Necessary information
  • Form of information
  • Interpretation of user need
  • Representation of user need the query

25
Computational Relevance
  • A degree of similarity between documents
  • Query to Database document retrieval
  • One document to another clustering
  • Content overlap
  • Common descriptors
  • Closeness in document space
  • Conceptual overlap

26
Closeness in Concept Space
What recent disasters occurred in tunnels for
transportation?
27
Retrieval Results
  • A (ranked) list of hits relevant documents
    (according to system)
  • Typical format
  • Rank DocId Score Title Abstract
  • 1 WSJ910426-1234 0.95738 Locomotive Catches Fire
    in Swiss Tunnel. A Swiss Railroad locomotive
    caught fire in a tunnel near Zurich on Wednesday
  • 2 APW890714-097841 0.89567 Tragedy in Chunnel.
  • Ranking by
  • Similarity score (Infoseek)
  • Linking score (popularity) (Google) hybrid
    (Lycos)
  • Document style, document type, date/update, etc.

28
Search Results (Google)
GoogleMM.htm 1. DBLP Miroslav Martinovic dblp.uni
-trier.de Miroslav Martinovic. ... 2002. 2, EE,
G. Sampath, Miroslav Martinovic A Multilevel
Text Processing Model of Newsgroup Dynamics. NLDB
2002 208-212. ... www.informatik.uni-trier.de/le
y/db/indices/ a-tree/m/MartinovicMiroslav.html -
3k - Cached - Similar pages 2. Home Page of Dr.
M. Martinovic Miroslav Martinovic, Ph.D.
... www.tcnj.edu/mmmartin/ - 16k - Cached -
Similar pages 3. CRA-W ... Design and
Development of a Word Conflation Module Student
Researchers Emily Gibson, Christina Grape
Advisor Dr. Miroslav Martinovic Institution The
College ... www.cra.org/Activities/craw/crew/crewR
eports/ 2002/newjersey_final.html - 14k - Cached
- Similar pages 4. tomek strzalkowski -
ResearchIndex document query ... 171-186
acl.ldc.upenn.edu/J/J93/J93-4009.pdf Comparing
Two Grammar-Based Generation - Case Study
Miroslav (Correct) A Case Study Miroslav
Martinovic And Tomek ... citeseer.nj.nec.com/cs?q
TomekStrzalkowski - 19k - Cached - Similar
pages 5. SpringerLink - Volume ... 208 - 212. G.
Sampath, Miroslav Martinovic. Best Feature
Selection for Maximum Entropy-Based Word Sense
Disambiguation, pp. 213 - 217. ... link.springer.d
e/link/service/ series/0558/tocs/t2553.htm - 37k
- Cached - Similar pages
29
Search Results (Alta Vista)
AltaVistaMM.htm 1. DBLP Miroslav Martinovic
Miroslav Martinovic List of publications from the
DBLP Bibliography Server - FAQ Ask others ACM -
CiteSeer - CSB - Google ... 2002 2 EE G.
Sampath, Miroslav Martinovic A Multilevel Text
... www.acm.org/sigs/sigmod/dblp/db/indices/..
.cMiroslav.html More pages from www.acm.org
2. ABSTRACT ... AND LINGUISTIC APPROACHES IN
BUILDING INTELLIGENT QUESTION ANSWERING SYSTEMS
Miroslav Martinovic TOPIC AREA Question
Answering, Mathematical Linguistics,
Quantitative/Qualitative Linguistics ...
www.trenton.edu/mmmartin/SSGRR-ABS.HTML
Related Pages More pages from www.trenton.edu
3. ABSTRACT To Appear Transforming A
Word Conflation Algorithm into A Minimal Stem
Algorithm Miroslav Martinovic TOPIC AREA
Word Conflation, Information Retrieval, Stem
Dictionaries, NLP Tools and Resources ...
www.trenton.edu/mmmartin/CREW-ABS.html 4.
SpringerLink Lecture Notes in Computer Science
2553 A Multilevel Text Processing Model of
Newsgroup Dynamics G. Sampath and Miroslav
Martinovic Department of Computer Science, The
College of New Jersey, Ewing, NJ 08628
sampath,mmmartin_at_tcnj.edu ...
link.springer.de/link/service/series/055...53/2553
0208.htm Related Pages More pages from
link.springer.de 5. COMPARING TWO
GRAMMAR-BASED GENERATION ALGORITHMS A CASE
STUDY File typePDF - Download PDF Reader
title COMPARING TWO GRAMMAR-BASED GENERATION
ALGORITHMS A CASE STUDY author Miroslav
Martinovic Tomek Strzaikowski creation data
D20020326132025-05'00' revision date
D20020403103942-05'00' www.ldc.upenn.edu/acl/P/P9
2/P92-1011.pdf More pages from www.ldc.upenn.edu
30
Search Results (Yahoo)
YahooMM.htm 1. DBLP Miroslav Martinovic
dblp.uni-trier.de Miroslav Martinovic. ... 2002.
2, EE, G. Sampath, Miroslav Martinovic A
Multilevel Text Processing Model of Newsgroup
Dynamics. NLDB 2002 208-212. ...
www.informatik.uni-trier.de/ley/db/indices/a-tree
/m/MartinovicMiroslav.html cached more results
from this site 2. Home Page of Dr. M. Martinovic
Miroslav Martinovic, Ph.D. ...
www.tcnj.edu/mmmartin/ cached more results
from this site 3. CRA-W ... Design and
Development of a Word Conflation Module Student
Researchers Emily Gibson, Christina Grape
Advisor Dr. Miroslav Martinovic Institution The
College ... www.cra.org/Activities/craw/crew/crewR
eports/2002/newjersey_final.html cached more
results from this site 4. tomek strzalkowski -
ResearchIndex document query ... 171-186
acl.ldc.upenn.edu/J/J93/J93-4009.pdf Comparing
Two Grammar-Based Generation - Case Study
Miroslav (Correct) A Case Study Miroslav
Martinovic And Tomek ... citeseer.nj.nec.com/cs?q
TomekStrzalkowski cached 5. SpringerLink -
Volume ... 208 - 212. G. Sampath, Miroslav
Martinovic. Best Feature Selection for Maximum
Entropy-Based Word Sense Disambiguation, pp. 213
- 217. ... link.springer.de/link/service/series/05
58/tocs/t2553.htm cached more results from this
site 6. Michael E. Locasto Projects ...
projects Cohorts are in 's. QASTIIR
Question/Answer System for Intelligent
Information Retrieval Michael Hulme, Miroslav
Martinovic ... www1.cs.columbia.edu/locasto/pro
jects/ cached 7. List of Papers ... 34.
Integrating Statistical and Linguistic Approaches
in Building Intelligent Question Answering
Systems. Miroslav Martinovic. 35. ...
www.ssgrr.it/en/ssgrr2002w/papers.htm cached
more results from this site
31
The IR Tasks
  • Ad-hoc Querying
  • Filtering and Routing
  • Topic Detection and Tracking
  • Question Answering
  • Automated Summarization
  • Information Fusion

32
Ad Hoc Querying
  • Ask arbitrary queries against database
  • Most Internet search is adhoc
  • Probably the hardest of all IR tasks
  • Reflects real-life tasks
  • Literature search/research
  • Legal case preparation
  • Medical diagnosis

33
Cross-Lingual IR
  • Ask query in a users native language
  • E.g., English, French
  • Database consists of documents
  • in another language (e.g., Mandarin)
  • multiple languages (e.g., FBIS, WWW)
  • Full-text indexing in source language
  • Machine translation unreliable

34
Filtering and Routing
  • Fixed queries against a data stream
  • news broadcast, newswire service
  • real time/no collection (filtering) or floating
    collection (routing)
  • one assignment per document (classification), or
    multiple assignments
  • Adaptive filtering
  • Relevance threshold

35
Topic Detection Tracking
  • A special form of filtering
  • Abstraction of real-life tasks
  • News reporting (e.g., NBC)
  • Intelligence gathering (e.g., CIA)
  • Financial markets
  • Detect and track stories on topics of interest
    in continuous data stream
  • Sources text, audio, video, multimedia

36
TDT Concept
sources
ABC
NBC
PRI
APW
Reuters
TsingHua
time
37
TDT Baseline Tasks
Segmentation
Disjoint, Homogenous Regions (Stories or
Non-Stories)
Detection
(with or without Segmentation)
Stories about Some Topic
Tracking
(with or without Segmentation)
Training Story
More Stories about Same Topic
38
TDT Application Source Coordination
newswire feeds
satellite video feeds
text
audio
text
stories
video
39
Topics Formats vs. Ratings
time
40
Question Answering
  • Supply actual answers to user questions
  • How long does it take to fly from Paris to New
    York on a Concorde?
  • 3.5 hours
  • Find relevant information, not documents
  • Extract information, convert into an answer
  • Ranges from trivia to research problems
  • Commercial service AskJeeves

41
Why Question-Answering?
Users want information, not lists of documents
Query What disasters have occurred in tunnels
used for transportation?
Answer A Swiss locomotive caught fire in a
Zurich railway tunnel in 1991 and
more than 50 passengers were
injured. In 1992, a tunnel worker
died after being hit by a works
train and the British end of the
Channel.
42
Complexity of Question Answering
  • Question Scope
  • Simple factual to template/compounds to
    exploratory
  • Question context
  • Isolated trivia-style to task context to user
    knowledge context
  • Complexity
  • Atomic questions to decomposable compound
    problems
  • Interpretation (of answer)
  • Explicit facts to groups of facts to hypothesis
    forming
  • Answer fusion
  • Concatenation to clustering to composition
  • Sources of data
  • Single source to multiple sources to
    multimedia/multilingual

43
QA Application Automated Call Center
Dialogue Agent
telephony interface
telephone Internet multilingual
Automatic Call Router
  • I need the part number for a suction cup, please.

BackEnd Service Support service specialists and
information
  • What product do you need this
  • part for?
  • Its called 3MLasercam

44
Automated Summarization
  • Summarize content of a text/media document(s)
  • Informative vs. Indicative
  • Generic vs. Topical
  • main content vs. topic-related
  • Single text vs. Cross-document
  • topic aspect detection
  • topical briefs, topic tracking
  • Info Organization Visualization
  • Multiple views of info space
  • Rapid comprehension

45
Cross-Document Summarization
  • Cluster documents into topics
  • Derive cluster summaries
  • Form a cross-document summary

46
Cross-media Summarization
language
ontology
video
47
Fused Information Retrieval
1. NYT991028 2. 3. ...
Whats the current financial situation at Airbus?
NYTimes
APWire
DJFin
Medline
USPTO
48
Information Understanding Tools
Write a Comment
User Comments (0)
About PowerShow.com