Prof. Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Ray Larson

Description:

Title: PowerPoint Presentation Author: Valued Gateway Client Last modified by: ray Created Date: 9/3/2002 3:52:45 AM Document presentation format – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 63
Provided by: ValuedGate1518
Category:

less

Transcript and Presenter's Notes

Title: Prof. Ray Larson


1
Lecture 16 Intro to Information Retrieval
SIMS 202 Information Organization and Retrieval
  • Prof. Ray Larson Prof. Marc Davis
  • UC Berkeley SIMS
  • Tuesday and Thursday 1030 am - 1200 pm
  • Fall 2003
  • http//www.sims.berkeley.edu/academics/courses/is2
    02/f03/

2
Lecture Overview
  • Review
  • MPEG-7
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • Information Retrieval History and Developments
  • Discussion
  • Prep for Presentations
  • MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
3
Lecture Overview
  • Review
  • MPEG-7
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • Information Retrieval History and Developments
  • Discussion
  • Prep for Presentations
  • MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
4
Review Information Overload
  • The world's total yearly production of print,
    film, optical, and magnetic content would require
    roughly 1.5 billion gigabytes of storage. This is
    the equivalent of 250 megabytes per person for
    each man, woman, and child on earth. (Varian
    Lyman)
  • The greatest problem of today is how to teach
    people to ignore the irrelevant, how to refuse to
    know things, before they are suffocated. For too
    many facts are as bad as none at all. (W.H.
    Auden)

5
Course Outline
  • Organization
  • Overview
  • Categorization
  • Knowledge Representation
  • Metadata Introduction
  • Controlled Vocabularies Introduction
  • Thesaurus Design and Construction
  • Multimedia Information Organization and Retrieval
  • Metadata for Media
  • Database Design
  • XML
  • Retrieval
  • Introduction to Search Process
  • Boolean Queries and Text Processing
  • Statistical Properties of Text and Vector
    Representation
  • Probabilistic Ranking and Relevance Feedback
  • Evaluation
  • Web Search Issues and Architecture
  • Interfaces for Information Retrieval

6
Key Issues In This Course
  • How to describe information resources or
    information-bearing objects in ways so that they
    may be effectively used by those who need to use
    them
  • Organizing
  • How to find the appropriate information resources
    or information-bearing objects for someones (or
    your own) needs
  • Retrieving

7
Key Issues
8
Modern IR Textbook Topics
9
More Detailed View
10
What Well Cover
A Lot
A Little
11
IR Topics for 202
  • The Search Process
  • Information Retrieval Models
  • Boolean, Vector, and Probabilistic
  • Content Analysis/Zipf Distributions
  • Evaluation of IR Systems
  • Precision/Recall
  • Relevance
  • User Studies
  • Web-Specific Issues
  • User Interface Issues
  • Special Kinds of Search

12
Lecture Overview
  • Review
  • MPEG-7
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • Information Retrieval History and Developments
  • Discussion
  • Prep for Presentations
  • MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
13
The Standard Retrieval Interaction Model
14
Standard Model of IR
  • Assumptions
  • The goal is maximizing precision and recall
    simultaneously
  • The information need remains static
  • The value is in the resulting document set

15
Problems with Standard Model
  • Users learn during the search process
  • Scanning titles of retrieved documents
  • Reading retrieved documents
  • Viewing lists of related topics/thesaurus terms
  • Navigating hyperlinks
  • Some users dont like long (apparently)
    disorganized lists of documents

16
IR is an Iterative Process
17
IR is a Dialog
  • The exchange doesnt end with first answer
  • Users can recognize elements of a useful answer,
    even when incomplete
  • Questions and understanding changes as the
    process continues

18
Bates Berry-Picking Model
  • Standard IR model
  • Assumes the information need remains the same
    throughout the search process
  • Berry-picking model
  • Interesting information is scattered like berries
    among bushes
  • The query is continually shifting

19
Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
20
Berry-Picking Model (cont.)
  • The query is continually shifting
  • New information may yield new ideas and new
    directions
  • The information need
  • Is not satisfied by a single, final retrieved set
  • Is satisfied by a series of selections and bits
    of information found along the way

21
Information Seeking Behavior
  • Two parts of a process
  • Search and retrieval
  • Analysis and synthesis of search results
  • This is a fuzzy area
  • We will look at (briefly) at some different
    working theories

22
Search Tactics and Strategies
  • Search Tactics
  • Bates 1979
  • Search Strategies
  • Bates 1989
  • ODay and Jeffries 1993

23
Tactics vs. Strategies
  • Tactic short term goals and maneuvers
  • Operators, actions
  • Strategy overall planning
  • Link a sequence of operators together to achieve
    some end

24
Information Search Tactics
  • Monitoring tactics
  • Keep search on track
  • Source-level tactics
  • Navigate to and within sources
  • Term and Search Formulation tactics
  • Designing search formulation
  • Selection and revision of specific terms within
    search formulation

25
Monitoring Tactics (Strategy-Level)
  • Check
  • Compare original goal with current state
  • Weigh
  • Make a cost/benefit analysis of current or
    anticipated actions
  • Pattern
  • Recognize common strategies
  • Correct Errors
  • Record
  • Keep track of (incomplete) paths

26
Source-Level Tactics
  • Bibble
  • Look for a pre-defined result set
  • E.g., a good link page on web
  • Survey
  • Look ahead, review available options
  • E.g., dont simply use the first term or first
    source that comes to mind
  • Cut
  • Eliminate large proportion of search domain
  • E.g., search on rarest term first

27
Search Formulation Tactics
  • Specify
  • Use as specific terms as possible
  • Exhaust
  • Use all possible elements in a query
  • Reduce
  • Subtract elements from a query
  • Parallel
  • Use synonyms and parallel terms
  • Pinpoint
  • Reducing parallel terms and refocusing query
  • Block
  • To reject or block some terms, even at the cost
    of losing some relevant documents

28
Term Tactics
  • Move around the thesaurus
  • Superordinate, subordinate, coordinate
  • Neighbor (semantic or alphabetic)
  • Trace pull out terms from information already
    seen as part of search (titles, etc.)
  • Morphological and other spelling variants
  • Antonyms (contrary)

29
Additional Considerations (Bates 79)
  • More detail is needed about short-term
    cost/benefit decision rule strategies
  • When to stop?
  • How to judge when enough information has been
    gathered?
  • How to decide when to give up an unsuccessful
    search?
  • When to stop searching in one source and move to
    another?

30
Implications
  • Search interfaces should make it easy to store
    intermediate results
  • Interfaces should make it easy to follow trails
    with unanticipated results (and find your way
    back)
  • This all makes evaluation of the search, the
    interface and the search process more difficult

31
More Later
  • Later in the course
  • More on Search Process and Strategies
  • User interfaces to improve IR process
  • Incorporation of Content Analysis into better
    systems

32
Restricted Form of the IR Problem
  • The system has available only pre-existing,
    canned text passages
  • Its response is limited to selecting from these
    passages and presenting them to the user
  • It must select, say, 10 or 20 passages out of
    millions or billions!

33
Information Retrieval
  • Revised Task Statement
  • Build a system that retrieves documents that
    users are likely to find relevant to their
    queries
  • This set of assumptions underlies the field of
    Information Retrieval

34
Relevance (Introduction)
  • In what ways can a document be relevant to a
    query?
  • Answer precise question precisely
  • Who is buried in grants tomb? Grant.
  • Partially answer question
  • Where is Danville? Near Walnut Creek.
  • Where is Dublin?
  • Suggest a source for more information.
  • What is lymphodema? Look in this Medical
    Dictionary.
  • Give background information
  • Remind the user of other knowledge
  • Others...

35
Relevance
  • Intuitively, we understand quite well what
    relevance means. It is a primitive y know
    concept, as is information for which we hardly
    need a definition. if and when any productive
    contact in communication is desired,
    consciously or not, we involve and use this
    intuitive notion or relevance.
  • Saracevic, 1975 p. 324

36
Define your own relevance
  • Relevance is the (A) gage of relevance of an (B)
    aspect of relevance existing between an (C)
    object judged and a (D) frame of reference as
    judged by an (E) assessor
  • Where

From Saracevic, 1975 and Schamber 1990
37
A. Gages
  • Measure
  • Degree
  • Extent
  • Judgement
  • Estimate
  • Appraisal
  • Relation

38
B. Aspect
  • Utility
  • Matching
  • Informativeness
  • Satisfaction
  • Appropriateness
  • Usefulness
  • Correspondence

39
C. Object judged
  • Document
  • Document representation
  • Reference
  • Textual form
  • Information provided
  • Fact
  • Article

40
D. Frame of reference
  • Question
  • Question representation
  • Research stage
  • Information need
  • Information used
  • Point of view
  • request

41
E. Assessor
  • Requester
  • Intermediary
  • Expert
  • User
  • Person
  • Judge
  • Information specialist

42
Lecture Overview
  • Review
  • MPEG-7
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • Information Retrieval History and Developments
    (view from 100,000 Ft.)
  • Discussion
  • Prep for Presentations
  • MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
43
Visions of IR Systems
  • Rev. John Wilkins, 1600s The Philosophic
    Language and tables
  • Wilhelm Ostwald and Paul Otlet, 1910s The
    monographic principle and Universal
    Classification
  • Emanuel Goldberg, 1920s - 1940s
  • H.G. Wells, World Brain The idea of a permanent
    World Encyclopedia. (Introduction to the
    Encyclopédie Française, 1937)
  • Vannevar Bush, As we may think. Atlantic
    Monthly, 1945.

44
Card-Based IR Systems
  • Uniterm (Casey, Perry, Berry, Kent 1958)
  • Developed and used from mid 1940s)

EXCURSION
43821 90 241
52 63 34 25 66
17 58 49 130 281 92
83 44 75 86 57 88
119 640 122 93 104
115 146 97 158 139 870
342
157 178 199

207 248 269

298
LUNAR
12457 110 181
12 73 44 15 46 7
28 39 430 241 42 113
74 85 76 17 78
79 820 761 602 233 134 95
136 37 118 109 901
982 194 165
127 198 179

377 288
407
45
Card Systems
  • Batten Optical Coincidence Cards (Peek-a-Boo
    Cards), 1948

46
Card Systems
  • Zatocode (edge-notched cards) Mooers, 1951

47
Computer-Based Systems
  • Bagleys 1951 MS thesis from MIT suggested that
    searching 50 million item records, each
    containing 30 index terms would take
    approximately 41,700 hours
  • Due to the need to move and shift the text in
    core memory while carrying out the comparisons
  • 1957 Desk Set with Katharine Hepburn and
    Spencer Tracy EMERAC

48
Historical Milestones in IR Research
  • 1958 Statistic Language Properties (Luhn)
  • 1960 Probabilistic Indexing (Maron Kuhns)
  • 1961 Term association and clustering (Doyle)
  • 1965 Vector Space Model (Salton)
  • 1968 Query expansion (Roccio, Salton)
  • 1972 Statistical Weighting (Sparck-Jones)
  • 1975 2-Poisson Model (Harter, Bookstein,
    Swanson)
  • 1976 Relevance Weighting (Robertson,
    Sparck-Jones)
  • 1980 Fuzzy sets (Bookstein)
  • 1981 Probability without training (Croft)

49
Historical Milestones in IR Research (cont.)
  • 1983 Linear Regression (Fox)
  • 1983 Probabilistic Dependence (Salton, Yu)
  • 1985 Generalized Vector Space Model (Wong,
    Rhagavan)
  • 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et
    al.)
  • 1990 Latent Semantic Indexing (Dumais,
    Deerwester)
  • 1991 Polynomial Logistic Regression (Cooper,
    Gey, Fuhr)
  • 1992 TREC (Harman)
  • 1992 Inference networks (Turtle, Croft)
  • 1994 Neural networks (Kwok)

50
Boolean IR Systems
  • Synthex at SDC, 1960
  • Project MAC at MIT, 1963 (interactive)
  • BOLD at SDC, 1964 (Harold Borko)
  • 1964 New York Worlds Fair Becker and Hayes
    produced system to answer questions (based on
    airline reservation equipment)
  • SDC began production for a commercial service in
    1967 ORBIT
  • NASA-RECON (1966) becomes DIALOG
  • 1972 Data Central/Mead introduced LEXIS Full
    text
  • Online catalogs late 1970s and 1980s

51
The Internet and the WWW
  • Gopher, Archie, Veronica, WAIS
  • Tim Berners-Lee, 1991 creates WWW at CERN
    originally hypertext only
  • Web-crawler
  • Lycos
  • Alta Vista
  • Inktomi
  • Google
  • (and many others)

52
Information Retrieval Historical View
Research
Industry
  • Boolean model, statistics of language (1950s)
  • Vector space model, probablistic indexing,
    relevance feedback (1960s)
  • Probabilistic querying (1970s)
  • Fuzzy set/logic, evidential reasoning (1980s)
  • Regression, neural nets, inference networks,
    latent semantic indexing, TREC (1990s)
  • DIALOG, Lexus-Nexus,
  • STAIRS (Boolean based)
  • Information industry (O(B))
  • Verity TOPIC (fuzzy logic)
  • Internet search engines (O(100B?)) (vector
    space, probabilistic)

53
Lecture Overview
  • Review
  • MPEG-7
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • Information Retrieval History and Developments
  • Discussion
  • Prep for Presentations
  • MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
54
Discussion Joe Hall on MIR
  • Why does there have to be such a schism between
    computer-centered and human-centered IR? Would
    it not be more wise to approach IR from both
    directions simultaneously?
  • How do you find information on a regular basis?
    Is Google your first-order attack? What do you
    do when Google wouldn't return anything useful...
    for example, if Kate was looking for information
    on music from "The The" or Peaches"? What are
    some useful, domain-specific tools out there that
    you use (like IMDB, or The All Music Guide)?

55
Discussion Joe Hall on MIR
  • What would a Venn diagram of Information
    Retrieval and Information Organization look like?
    With systems like Google that rely on a very
    simplistic ranking system, complex Information
    Organization seems not necessary for certain
    types of information. There seems to be an OI/IR
    trade-off here... that is, the more organized
    your information, the less sophisticated a
    retreival system needs to be.

56
Paul Laskowski on Berlin
  • How many people can participate in a group
    memory? I would happily share my 202-related
    emails with my phone project group (Go
    MonkeyBots!!!), but I might want to be more
    selective when writing to the entire class
    there might be strange people here I haven't met
    yet. Can a group memory benefit from some notion
    of social distance and privacy?

57
Paul Laskowski on Berlin
  • TeamInfo demonstrates that separating discussions
    into categories is difficult, and expensive to
    maintain. Part of the problem is that categories
    are always evolving. Is there a way to exploit
    references, keywords, or shared language among
    emails to automatically infer a structure in
    subject space?

58
David Schlossberg on Munro
  • While the article points out that we lack
    knowledge in social navigation, it implies we
    also lack technology to make this social
    navigation possible. Are improvements in social
    navigation limited by current technology? If so,
    what innovations are needed to make those
    improvements? What are the limits of Technology
    to solve these problems?

59
David Schlossberg on Munro
  • What information domains lend themselves best to
    social navigation? Which domains are not well
    suited for social navigation? Another way of
    thinking about this is where would you like to
    see changes in interaction or information
    retrieval with your computer? For instance, the
    article mentions that chatting could be much more
    natural with avatars or virtual spaces.

60
David Schlossberg on Munro
  • One example of existing social navigation is how
    Google does its ranking based on how people
    previously chose from the search results. What
    other examples of social navigation of
    information space already exist either on the
    Internet or in the physical world?

61
Lecture Overview
  • Review
  • MPEG-7
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • Information Retrieval History and Developments
  • Discussion
  • Prep for Presentations
  • MMM Status, Web interface, Flamenco

Credit for some of the slides in this lecture
goes to Marti Hearst and Fred Gey
62
Next Time
  • Project Presentations
  • (no readings)
Write a Comment
User Comments (0)
About PowerShow.com