Prof. Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Ray Larson

Description:

Title: PowerPoint Presentation Author: Valued Gateway Client Last modified by: Ray R. Larson Created Date: 9/3/2002 3:52:45 AM Document presentation format – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 56
Provided by: ValuedGate1566
Category:

less

Transcript and Presenter's Notes

Title: Prof. Ray Larson


1
Lecture 4 Boolean IR and Text Processing
SIMS 202 Information Organization and Retrieval
  • Prof. Ray Larson Prof. Marc Davis
  • UC Berkeley SIMS
  • Tuesday and Thursday 1030 am - 1200 pm
  • Fall 2004
  • http//www.sims.berkeley.edu/academics/courses/is2
    02/f04/

2
Advertisement
  • Not doing anything on Friday afternoon?
  • Please come to the Friday Afternoon Seminar
    Open to ALL
  • This Week
  • Clifford Lynch, director of the Coalition for
    Networked Information and Adjunct Professor of
    SIMS on Research Questions in Digital
    Stewardship
  • See
  • http//www.sims.berkeley.edu/academics/courses/is2
    96a-1/f04/

3
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
4
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
5
IR is an Iterative Process
6
Berry-Picking Model
A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
7
Restricted Form of the IR Problem
  • The system has available only pre-existing,
    canned text passages
  • Its response is limited to selecting from these
    passages and presenting them to the user
  • It must select, say, 10 or 20 passages out of
    millions or billions!

8
Information Retrieval
  • Revised Task Statement
  • Build a system that retrieves documents that
    users are likely to find relevant to their
    queries
  • This set of assumptions underlies the field of
    Information Retrieval

9
Paradox
  • The Fundamental paradox of Information
    Retrieval as stated by Roland Hjerrpe
  • The need to describe that which you do not know
    in order to find it

10
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
11
Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
12
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
13
Central Concepts in IR
  • Documents
  • Queries
  • Collections
  • Evaluation
  • Relevance

14
Documents
  • What do we mean by a document?
  • Full document?
  • Document surrogates?
  • Pages?
  • Buckland (JASIS, Sept. 1997) What is a Document
  • Are IR systems better called Document Retrieval
    systems?
  • A document is a representation of some
    aggregation of information, treated as a unit

15
Collection
  • A collection is some physical or logical
    aggregation of documents
  • A database
  • A Library
  • An index?
  • Others?

16
Queries
  • A query is some expression of a users
    information needs
  • Can take many forms
  • Natural language description of need
  • Formal query in a query language
  • Queries may not be accurate expressions of the
    information need
  • Differences between conversation with a person
    and formal query expression

17
Evaluation Why Evaluate?
  • Determine if the system is desirable
  • Make comparative assessments
  • Others?

18
What To Evaluate?
  • How much of the information need was satisfied
  • How much was learned about a topic
  • Incidental learning
  • How much was learned about the collection
  • How much was learned about other topics
  • How inviting the system is

19
What To Evaluate?
  • What can be measured that reflects users
    ability to use system? (Cleverdon 66)
  • Coverage of information
  • Form of presentation
  • Effort required/ease of use
  • Time and space efficiency
  • Recall
  • Proportion of relevant material actually
    retrieved
  • Precision
  • Proportion of retrieved material actually relevant

Effectiveness
20
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
21
Query Languages
  • A way to express the question (information need)
  • Types
  • Boolean
  • Natural Language
  • Stylized Natural Language
  • Form-Based (GUI)

22
Simple Query Language Boolean
  • Terms Operators
  • Terms
  • Words
  • Normalized (stemmed) words
  • Phrases
  • Thesaurus terms
  • Boolean Operators
  • AND
  • OR
  • NOT

23
Boolean Queries
  • Cat
  • Cat OR Dog
  • Cat AND Dog
  • (Cat AND Dog)
  • (Cat AND Dog) OR Collar
  • (Cat AND Dog) OR (Collar AND Leash)
  • (Cat OR Dog) AND (Collar OR Leash)

24
Boolean Queries
  • (Cat OR Dog) AND (Collar OR Leash)
  • Each of the following combinations works

Recall the card based systems? They mechanically
implement Boolean AND
25
Boolean Queries
  • (Cat OR Dog) AND (Collar OR Leash)
  • None of the following combinations works

26
Boolean Logic
A
B
27
Boolean Queries
  • Usually expressed as INFIX operators in IR
  • ((a AND b) OR (c AND b))
  • NOT is UNARY PREFIX operator
  • ((a AND b) OR (c AND (NOT b)))
  • AND and OR can be n-ary operators
  • (a AND b AND c AND d)
  • Some rules - (De Morgan revisited)
  • NOT(a) AND NOT(b) NOT(a OR b)
  • NOT(a) OR NOT(b) NOT(a AND b)
  • NOT(NOT(a)) a

28
Boolean Logic
m1 t1 t2 t3
m2 t1 t2 t3
m3 t1 t2 t3
m4 t1 t2 t3
m5 t1 t2 t3
m6 t1 t2 t3
m7 t1 t2 t3
m8 t1 t2 t3
29
Boolean Searching
30
Pseudo-Boolean Queries
  • A new notation, from web search
  • cat dog collar leash
  • Does not mean the same thing!
  • Need a way to group combinations
  • Phrases
  • stray cat AND frayed collar
  • stray cat frayed collar

31
Another View of IR
Information Need
Collections
Pre-Process
Text Input
Index
Query
Parse
Rank
32
Result Sets
  • Run a query, get a result set
  • Two choices
  • Reformulate query, run on entire collection
  • Reformulate query, run on result set
  • Example Dialog query
  • (Redford AND Newman)
  • -gt S1 1450 documents
  • (S1 AND Sundance)
  • -gtS2 898 documents

33
Feedback Queries
34
Ordering of Retrieved Documents
  • Pure Boolean has no ordering
  • In practice
  • Order chronologically
  • Order by total number of hits on query terms
  • What if one term has more hits than others?
  • Is it better to have one of each term or many of
    one term?
  • Fancier methods have been investigated
  • p-norm is most famous
  • Usually impractical to implement
  • Usually hard for user to understand

35
Boolean
  • Advantages
  • Simple queries are easy to understand
  • Relatively easy to implement
  • Disadvantages
  • Difficult to specify what is wanted
  • Too much returned, or too little
  • Ordering not well determined
  • Dominant language in commercial IR systems until
    the WWW, and still the language of Database
    Management Systems

36
Faceted Boolean Query
  • Strategy Break query into facets (polysemous
    with earlier meaning of facets)
  • Conjunction of disjunctions
  • a1 OR a2 OR a3
  • b1 OR b2
  • c1 OR c2 OR c3 OR c4
  • Each facet expresses a topic
  • rain forest OR jungle OR amazon
  • medicine OR remedy OR cure
  • Smith OR Zhou

AND
AND
Also known as Conjunctive Normal Form or CNF
37
Faceted Boolean Query
  • Query still fails if one facet missing
  • Alternative Coordination level ranking
  • Order results in terms of how many facets
    (disjuncts) are satisfied
  • Also called Quorum ranking, Overlap ranking, and
    Best Match
  • Problem Facets still undifferentiated
  • Alternative Assign weights to facets

38
Proximity Searches
  • Proximity Terms occur within K positions of one
    another
  • pen w/5 paper
  • A Near function can be more vague
  • near(pen, paper)
  • Sometimes order can be specified
  • Also, Phrases and Collocations
  • United Nations Bill Clinton
  • Phrase Variants
  • retrieval of information information
    retrieval

39
Filters
  • Filters Reduce set of candidate docs
  • Often specified simultaneous with query
  • Usually restrictions on metadata
  • Restrict by
  • Date range
  • Internet domain (.edu .com .berkeley.edu)
  • Author
  • Size
  • Limit number of documents returned

40
Boolean Systems
  • Most of the commercial database search systems
    that pre-date the WWW are based on Boolean search
  • Dialog, Lexis-Nexis, etc.
  • Most Online Library Catalogs are Boolean systems
  • E.g., MELVYL
  • Database systems use Boolean logic for searching
  • Many of the search engines sold for intranet
    search of web sites are Boolean

41
Why Boolean?
  • Easy to implement
  • Efficient searching across very large databases
  • Easy to explain results
  • Has to have all of the words (AND)
  • Has to have at least one of the words (OR)

42
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic and Boolean IR Systems
  • Text Processing
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
43
Content Analysis
  • Automated Transformation of raw text into a form
    that represents some aspect(s) of its meaning
  • Including, but not limited to
  • Automated Thesaurus Generation
  • Phrase Detection
  • Categorization
  • Clustering
  • Summarization

44
Techniques for Content Analysis
  • Statistical
  • Single Document
  • Full Collection
  • Linguistic
  • Syntactic
  • Semantic
  • Pragmatic
  • Knowledge-Based (Artificial Intelligence)
  • Hybrid (Combinations)

45
Text Processing
  • Standard Steps
  • Recognize document structure
  • Titles, sections, paragraphs, etc.
  • Break into tokens
  • Usually space and punctuation delineated
  • Special issues with Asian languages
  • Stemming/morphological analysis
  • Store in inverted index (to be discussed later)

46
Content Analysis Areas
47
Document Processing Steps
From Modern IR Textbook
48
Stemming and Morphological Analysis
  • Goal normalize similar words
  • Morphology (form of words)
  • Inflectional Morphology
  • E.g,. inflect verb endings and noun number
  • Never change grammatical class
  • dog, dogs
  • tengo, tienes, tiene, tenemos, tienen
  • Derivational Morphology
  • Derive one word from another,
  • Often change grammatical class
  • build, building health, healthy

49
Automated Methods
  • Powerful multilingual tools exist for
    morphological analysis
  • PCKimmo, Xerox Lexical technology
  • Require a grammar and dictionary
  • Use two-level automata
  • Stemmers
  • Very dumb rules work well (for English)
  • Porter Stemmer Iteratively remove suffixes
  • Improvement Pass results through a lexicon

50
Errors Generated by Porter Stemmer
From Krovetz 93
51
Lecture Overview
  • Review
  • Introduction to Information Retrieval
  • The Information Seeking Process
  • History of IR Research
  • IR System Structure (revisited)
  • Central Concepts in IR
  • Boolean Logic
  • Boolean IR Systems
  • Discussion

Credit for some of the slides in this lecture
goes to Marti Hearst
52
Kavita Mittal on Bates
  • Given that Yahoo Search categorizes its search
    results while Google does not, do you think they
    use different types of controlled vocabularies?
    What kind/s do you think they use?
  • Can a Faceted Classification be used in a
    traditional library setting?

53
Sarita Yardi on Hearst
  • Martis article was written in 96. I wanted to
    test how well her Simple Proximity Filter theory
    worked with google. I want to know how popular
    Segways are in Europe and America. Which search
    query do you think will give better results and
    why?
  • Query 1 - Segway popular Europe America
  • Query 2 - "Segway" "popularity OR popular"
    "Europe OR America OR USA OR "United States"
  • Hint the results were inconclusive and
    arbitrary, there is no wrong answer)

54
Mini-Assigment
  • Logon to your new LexisNexis account
  • Go to http//www.nexis.com
  • Your ID is the string of letters and numbers from
    the signup sheet
  • Your password is your last name
  • Learn how to perform boolean operations on
    LexisNexis (use the online help pages)
  • Do some searches on a topic interesting to you in
    different databases.
  • (There will be an full assignment next week)

55
Next Time
  • Web Crawling
  • Readings
  • The Anatomy of a Large-Scale Hypertextual Web
    Search Engine (Brin and Page)
  • Mercator A Scalable, Extensible Web Crawler
    (Heydon and Najork)
Write a Comment
User Comments (0)
About PowerShow.com