Prof. Ray Larson

About This Presentation

Title:

Prof. Ray Larson

Description:

Title: PowerPoint Presentation Author: Valued Gateway Client Last modified by: ray Created Date: 9/3/2002 3:52:45 AM Document presentation format – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 63

Provided by: ValuedGate1518

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Prof. Ray Larson

1
Lecture 16 Intro to Information Retrieval
SIMS 202 Information Organization and Retrieval

Prof. Ray Larson Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 1030 am - 1200 pm
Fall 2003
http//www.sims.berkeley.edu/academics/courses/is2
02/f03/

2
Lecture Overview

Review
MPEG-7
Introduction to Information Retrieval
The Information Seeking Process
Information Retrieval History and Developments
Discussion
Prep for Presentations
MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
3
Lecture Overview

Review
MPEG-7
Introduction to Information Retrieval
The Information Seeking Process
Information Retrieval History and Developments
Discussion
Prep for Presentations
MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
4
Review Information Overload

The world's total yearly production of print,
film, optical, and magnetic content would require
roughly 1.5 billion gigabytes of storage. This is
the equivalent of 250 megabytes per person for
each man, woman, and child on earth. (Varian
Lyman)
The greatest problem of today is how to teach
people to ignore the irrelevant, how to refuse to
know things, before they are suffocated. For too
many facts are as bad as none at all. (W.H.
Auden)

5
Course Outline

Organization
Overview
Categorization
Knowledge Representation
Metadata Introduction
Controlled Vocabularies Introduction
Thesaurus Design and Construction
Multimedia Information Organization and Retrieval
Metadata for Media
Database Design
XML

Retrieval
Introduction to Search Process
Boolean Queries and Text Processing
Statistical Properties of Text and Vector
Representation
Probabilistic Ranking and Relevance Feedback
Evaluation
Web Search Issues and Architecture
Interfaces for Information Retrieval

6
Key Issues In This Course

How to describe information resources or
information-bearing objects in ways so that they
may be effectively used by those who need to use
them
Organizing
How to find the appropriate information resources
or information-bearing objects for someones (or
your own) needs
Retrieving

7
Key Issues
8
Modern IR Textbook Topics
9
More Detailed View
10
What Well Cover
A Lot
A Little
11
IR Topics for 202

The Search Process
Information Retrieval Models
Boolean, Vector, and Probabilistic
Content Analysis/Zipf Distributions
Evaluation of IR Systems
Precision/Recall
Relevance
User Studies
Web-Specific Issues
User Interface Issues
Special Kinds of Search

12
Lecture Overview

Review
MPEG-7
Introduction to Information Retrieval
The Information Seeking Process
Information Retrieval History and Developments
Discussion
Prep for Presentations
MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
13
The Standard Retrieval Interaction Model
14
Standard Model of IR

Assumptions
The goal is maximizing precision and recall
simultaneously
The information need remains static
The value is in the resulting document set

15
Problems with Standard Model

Users learn during the search process
Scanning titles of retrieved documents
Reading retrieved documents
Viewing lists of related topics/thesaurus terms
Navigating hyperlinks
Some users dont like long (apparently)
disorganized lists of documents

16
IR is an Iterative Process
17
IR is a Dialog

The exchange doesnt end with first answer
Users can recognize elements of a useful answer,
even when incomplete
Questions and understanding changes as the
process continues

18
Bates Berry-Picking Model

Standard IR model
Assumes the information need remains the same
throughout the search process
Berry-picking model
Interesting information is scattered like berries
among bushes
The query is continually shifting

19
Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
20
Berry-Picking Model (cont.)

The query is continually shifting
New information may yield new ideas and new
directions
The information need
Is not satisfied by a single, final retrieved set
Is satisfied by a series of selections and bits
of information found along the way

21
Information Seeking Behavior

Two parts of a process
Search and retrieval
Analysis and synthesis of search results
This is a fuzzy area
We will look at (briefly) at some different
working theories

22
Search Tactics and Strategies

Search Tactics
Bates 1979
Search Strategies
Bates 1989
ODay and Jeffries 1993

23
Tactics vs. Strategies

Tactic short term goals and maneuvers
Operators, actions
Strategy overall planning
Link a sequence of operators together to achieve
some end

24
Information Search Tactics

Monitoring tactics
Keep search on track
Source-level tactics
Navigate to and within sources
Term and Search Formulation tactics
Designing search formulation
Selection and revision of specific terms within
search formulation

25
Monitoring Tactics (Strategy-Level)

Check
Compare original goal with current state
Weigh
Make a cost/benefit analysis of current or
anticipated actions
Pattern
Recognize common strategies
Correct Errors
Record
Keep track of (incomplete) paths

26
Source-Level Tactics

Bibble
Look for a pre-defined result set
E.g., a good link page on web
Survey
Look ahead, review available options
E.g., dont simply use the first term or first
source that comes to mind
Cut
Eliminate large proportion of search domain
E.g., search on rarest term first

27
Search Formulation Tactics

Specify
Use as specific terms as possible
Exhaust
Use all possible elements in a query
Reduce
Subtract elements from a query
Parallel
Use synonyms and parallel terms
Pinpoint
Reducing parallel terms and refocusing query
Block
To reject or block some terms, even at the cost
of losing some relevant documents

28
Term Tactics

Move around the thesaurus
Superordinate, subordinate, coordinate
Neighbor (semantic or alphabetic)
Trace pull out terms from information already
seen as part of search (titles, etc.)
Morphological and other spelling variants
Antonyms (contrary)

29
Additional Considerations (Bates 79)

More detail is needed about short-term
cost/benefit decision rule strategies
When to stop?
How to judge when enough information has been
gathered?
How to decide when to give up an unsuccessful
search?
When to stop searching in one source and move to
another?

30
Implications

Search interfaces should make it easy to store
intermediate results
Interfaces should make it easy to follow trails
with unanticipated results (and find your way
back)
This all makes evaluation of the search, the
interface and the search process more difficult

31
More Later

Later in the course
More on Search Process and Strategies
User interfaces to improve IR process
Incorporation of Content Analysis into better
systems

32
Restricted Form of the IR Problem

The system has available only pre-existing,
canned text passages
Its response is limited to selecting from these
passages and presenting them to the user
It must select, say, 10 or 20 passages out of
millions or billions!

33
Information Retrieval

Revised Task Statement
Build a system that retrieves documents that
users are likely to find relevant to their
queries
This set of assumptions underlies the field of
Information Retrieval

34
Relevance (Introduction)

In what ways can a document be relevant to a
query?
Answer precise question precisely
Who is buried in grants tomb? Grant.
Partially answer question
Where is Danville? Near Walnut Creek.
Where is Dublin?
Suggest a source for more information.
What is lymphodema? Look in this Medical
Dictionary.
Give background information
Remind the user of other knowledge
Others...

35
Relevance

Intuitively, we understand quite well what
relevance means. It is a primitive y know
concept, as is information for which we hardly
need a definition. if and when any productive
contact in communication is desired,
consciously or not, we involve and use this
intuitive notion or relevance.
Saracevic, 1975 p. 324

36
Define your own relevance

Relevance is the (A) gage of relevance of an (B)
aspect of relevance existing between an (C)
object judged and a (D) frame of reference as
judged by an (E) assessor
Where

From Saracevic, 1975 and Schamber 1990
37
A. Gages

Measure
Degree
Extent
Judgement
Estimate
Appraisal
Relation

38
B. Aspect

Utility
Matching
Informativeness
Satisfaction
Appropriateness
Usefulness
Correspondence

39
C. Object judged

Document
Document representation
Reference
Textual form
Information provided
Fact
Article

40
D. Frame of reference

Question
Question representation
Research stage
Information need
Information used
Point of view
request

41
E. Assessor

Requester
Intermediary
Expert
User
Person
Judge
Information specialist

42
Lecture Overview

Review
MPEG-7
Introduction to Information Retrieval
The Information Seeking Process
Information Retrieval History and Developments
(view from 100,000 Ft.)
Discussion
Prep for Presentations
MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
43
Visions of IR Systems

Rev. John Wilkins, 1600s The Philosophic
Language and tables
Wilhelm Ostwald and Paul Otlet, 1910s The
monographic principle and Universal
Classification
Emanuel Goldberg, 1920s - 1940s
H.G. Wells, World Brain The idea of a permanent
World Encyclopedia. (Introduction to the
Encyclopédie Française, 1937)
Vannevar Bush, As we may think. Atlantic
Monthly, 1945.

44
Card-Based IR Systems

Uniterm (Casey, Perry, Berry, Kent 1958)
Developed and used from mid 1940s)

EXCURSION
43821 90 241
52 63 34 25 66
17 58 49 130 281 92
83 44 75 86 57 88
119 640 122 93 104
115 146 97 158 139 870
342
157 178 199

207 248 269

298
LUNAR
12457 110 181
12 73 44 15 46 7
28 39 430 241 42 113
74 85 76 17 78
79 820 761 602 233 134 95
136 37 118 109 901
982 194 165
127 198 179

377 288
407
45
Card Systems

Batten Optical Coincidence Cards (Peek-a-Boo
Cards), 1948

46
Card Systems

Zatocode (edge-notched cards) Mooers, 1951

47
Computer-Based Systems

Bagleys 1951 MS thesis from MIT suggested that
searching 50 million item records, each
containing 30 index terms would take
approximately 41,700 hours
Due to the need to move and shift the text in
core memory while carrying out the comparisons
1957 Desk Set with Katharine Hepburn and
Spencer Tracy EMERAC

48
Historical Milestones in IR Research

1958 Statistic Language Properties (Luhn)
1960 Probabilistic Indexing (Maron Kuhns)
1961 Term association and clustering (Doyle)
1965 Vector Space Model (Salton)
1968 Query expansion (Roccio, Salton)
1972 Statistical Weighting (Sparck-Jones)
1975 2-Poisson Model (Harter, Bookstein,
Swanson)
1976 Relevance Weighting (Robertson,
Sparck-Jones)
1980 Fuzzy sets (Bookstein)
1981 Probability without training (Croft)

49
Historical Milestones in IR Research (cont.)

1983 Linear Regression (Fox)
1983 Probabilistic Dependence (Salton, Yu)
1985 Generalized Vector Space Model (Wong,
Rhagavan)
1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et
al.)
1990 Latent Semantic Indexing (Dumais,
Deerwester)
1991 Polynomial Logistic Regression (Cooper,
Gey, Fuhr)
1992 TREC (Harman)
1992 Inference networks (Turtle, Croft)
1994 Neural networks (Kwok)

50
Boolean IR Systems

Synthex at SDC, 1960
Project MAC at MIT, 1963 (interactive)
BOLD at SDC, 1964 (Harold Borko)
1964 New York Worlds Fair Becker and Hayes
produced system to answer questions (based on
airline reservation equipment)
SDC began production for a commercial service in
1967 ORBIT
NASA-RECON (1966) becomes DIALOG
1972 Data Central/Mead introduced LEXIS Full
text
Online catalogs late 1970s and 1980s

51
The Internet and the WWW

Gopher, Archie, Veronica, WAIS
Tim Berners-Lee, 1991 creates WWW at CERN
originally hypertext only
Web-crawler
Lycos
Alta Vista
Inktomi
Google
(and many others)

52
Information Retrieval Historical View
Research
Industry

Boolean model, statistics of language (1950s)
Vector space model, probablistic indexing,
relevance feedback (1960s)
Probabilistic querying (1970s)
Fuzzy set/logic, evidential reasoning (1980s)
Regression, neural nets, inference networks,
latent semantic indexing, TREC (1990s)

DIALOG, Lexus-Nexus,
STAIRS (Boolean based)
Information industry (O(B))
Verity TOPIC (fuzzy logic)
Internet search engines (O(100B?)) (vector
space, probabilistic)

53
Lecture Overview

Review
MPEG-7
Introduction to Information Retrieval
The Information Seeking Process
Information Retrieval History and Developments
Discussion
Prep for Presentations
MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
54
Discussion Joe Hall on MIR

Why does there have to be such a schism between
computer-centered and human-centered IR? Would
it not be more wise to approach IR from both
directions simultaneously?
How do you find information on a regular basis?
Is Google your first-order attack? What do you
do when Google wouldn't return anything useful...
for example, if Kate was looking for information
on music from "The The" or Peaches"? What are
some useful, domain-specific tools out there that
you use (like IMDB, or The All Music Guide)?

55
Discussion Joe Hall on MIR

What would a Venn diagram of Information
Retrieval and Information Organization look like?
With systems like Google that rely on a very
simplistic ranking system, complex Information
Organization seems not necessary for certain
types of information. There seems to be an OI/IR
trade-off here... that is, the more organized
your information, the less sophisticated a
retreival system needs to be.

56
Paul Laskowski on Berlin

How many people can participate in a group
memory? I would happily share my 202-related
emails with my phone project group (Go
MonkeyBots!!!), but I might want to be more
selective when writing to the entire class
there might be strange people here I haven't met
yet. Can a group memory benefit from some notion
of social distance and privacy?

57
Paul Laskowski on Berlin

TeamInfo demonstrates that separating discussions
into categories is difficult, and expensive to
maintain. Part of the problem is that categories
are always evolving. Is there a way to exploit
references, keywords, or shared language among
emails to automatically infer a structure in
subject space?

58
David Schlossberg on Munro

While the article points out that we lack
knowledge in social navigation, it implies we
also lack technology to make this social
navigation possible. Are improvements in social
navigation limited by current technology? If so,
what innovations are needed to make those
improvements? What are the limits of Technology
to solve these problems?

59
David Schlossberg on Munro

What information domains lend themselves best to
social navigation? Which domains are not well
suited for social navigation? Another way of
thinking about this is where would you like to
see changes in interaction or information
retrieval with your computer? For instance, the
article mentions that chatting could be much more
natural with avatars or virtual spaces.

60
David Schlossberg on Munro

One example of existing social navigation is how
Google does its ranking based on how people
previously chose from the search results. What
other examples of social navigation of
information space already exist either on the
Internet or in the physical world?

61
Lecture Overview