Title: Introduction to Information Retrieval (cont.): Boolean Model
1Introduction to Information Retrieval (cont.)
Boolean Model
- University of California, Berkeley
- School of Information Management and Systems
- SIMS 202 Information Organization and Retrieval
- Lecture authors Marti Hearst Ray Larson
2The Standard Retrieval Interaction Model
3IR is an Iterative Process
4A sketch of a searcher moving through many
actions towards a general goal of satisfactory
completion of research related to an information
need. (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
5Restricted Form of the IR Problem
- The system has available only pre-existing,
canned text passages. - Its response is limited to selecting from these
passages and presenting them to the user. - It must select, say, 10 or 20 passages out of
millions or billions!
6Information Retrieval
- Revised Task Statement
- Build a system that retrieves documents that
users are likely to find relevant to their
queries. - This set of assumptions underlies the field of
Information Retrieval.
7Some IR History
- Roots in the scientific Information Explosion
following WWII - Interest in computer-based IR from mid 1950s
- H.P. Luhn at IBM (1958)
- Probabilistic models at Rand (Maron Kuhns)
(1960) - Boolean system development at Lockheed (60s)
- Vector Space Model (Salton at Cornell 1965)
- Statistical Weighting methods and theoretical
advances (70s) - Refinements and Advances in application (80s)
- User Interfaces, Large-scale testing and
application (90s)
8Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
9Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
10Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
11Structure of an IR System
Search Line
Storage Line
Interest profiles Queries
Documents data
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Adapted from Soergel, p. 19
Potentially Relevant Documents
12Relevance (introduction)
- In what ways can a document be relevant to a
query? - Answer precise question precisely.
- Who is buried in grants tomb? Grant.
- Partially answer question.
- Where is Danville? Near Walnut Creek.
- Suggest a source for more information.
- What is lymphodema? Look in this Medical
Dictionary. - Give background information.
- Remind the user of other knowledge.
- Others ...
- Ideally, IR systems should retrieve ALL and ONLY
the RELEVANT documents for a user
13Query Languages
- A way to express the question (information need)
- Types
- Boolean
- Natural Language
- Stylized Natural Language
- Form-Based (GUI)
14Simple query language Boolean
- Terms Connectors (or operators)
- terms
- words
- normalized (stemmed) words
- phrases
- thesaurus terms
- connectors
- AND
- OR
- NOT
15Boolean Queries
- Cat
- Cat OR Dog
- Cat AND Dog
- (Cat AND Dog)
- (Cat AND Dog) OR Collar
- (Cat AND Dog) OR (Collar AND Leash)
- (Cat OR Dog) AND (Collar OR Leash)
16Boolean Queries
- (Cat OR Dog) AND (Collar OR Leash)
- Each of the following combinations works
- Cat x x x x
- Dog x x x x x
- Collar x x x x
- Leash x x x x
17Boolean Queries
- (Cat OR Dog) AND (Collar OR Leash)
- None of the following combinations work
- Cat x x
- Dog x x
- Collar x x
- Leash x x
18Boolean Logic
B
A
19Boolean Queries
- Usually expressed as INFIX operators in IR
- ((a AND b) OR (c AND b))
- NOT is UNARY PREFIX operator
- ((a AND b) OR (c AND (NOT b)))
- AND and OR can be n-ary operators
- (a AND b AND c AND d)
- Some rules - (De Morgan revisited)
- NOT(a) AND NOT(b) NOT(a OR b)
- NOT(a) OR NOT(b) NOT(a AND b)
- NOT(NOT(a)) a
20Boolean Logic
t1
t2
D9
D2
D1
m3
m5
m6
m1 t1 t2 t3
D4
D11
m2 t1 t2 t3
D5
m3 t1 t2 t3
D3
m1
D6
m4 t1 t2 t3
m2
m4
D10
m5 t1 t2 t3
m6 t1 t2 t3
m7
m8
m7 t1 t2 t3
D8
D7
m8 t1 t2 t3
t3
21Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
22Psuedo-Boolean Queries
- A new notation, from web search
- cat dog collar leash
- Does not mean the same thing!
- Need a way to group combinations.
- Phrases
- stray cat AND frayed collar
- stray cat frayed collar
23Information need
Collections
text input
24Result Sets
- Run a query, get a result set
- Two choices
- Reformulate query, run on entire collection
- Reformulate query, run on result set
- Example Dialog query
- (Redford AND Newman)
- -gt S1 1450 documents
- (S1 AND Sundance)
- -gtS2 898 documents
25Information need
Collections
text input
Reformulated Query
26Ordering of Retrieved Documents
- Pure Boolean has no ordering
- In practice
- order chronologically
- order by total number of hits on query terms
- What if one term has more hits than others?
- Is it better to one of each term or many of one
term? - Fancier methods have been investigated
- p-norm is most famous
- usually impractical to implement
- usually hard for user to understand
27Boolean
- Advantages
- simple queries are easy to understand
- relatively easy to implement
- Disadvantages
- difficult to specify what is wanted
- too much returned, or too little
- ordering not well determined
- Dominant language in commercial systems until the
WWW
28Faceted Boolean Query
- Strategy break query into facets (polysemous
with earlier meaning of facets) - conjunction of disjunctions
- a1 OR a2 OR a3
- b1 OR b2
- c1 OR c2 OR c3 OR c4
- each facet expresses a topic
- rain forest OR jungle OR amazon
- medicine OR remedy OR cure
- Smith OR Zhou
AND
AND
29Faceted Boolean Query
- Query still fails if one facet missing
- Alternative Coordination level ranking
- Order results in terms of how many facets
(disjuncts) are satisfied - Also called Quorum ranking, Overlap ranking, and
Best Match - Problem Facets still undifferentiated
- Alternative assign weights to facets
30Proximity Searches
- Proximity terms occur within K positions of one
another - pen w/5 paper
- A Near function can be more vague
- near(pen, paper)
- Sometimes order can be specified
- Also, Phrases and Collocations
- United Nations Bill Clinton
- Phrase Variants
- retrieval of information information
retrieval
31Filters
- Filters Reduce set of candidate docs
- Often specified simultaneous with query
- Usually restrictions on metadata
- restrict by
- date range
- internet domain (.edu .com .berkeley.edu)
- author
- size
- limit number of documents returned
32How are the texts handled?
- What happens if you take the words exactly as
they appear in the original text? - What about punctuation, capitalization, etc.?
- What about spelling errors?
- What about plural vs. singular forms of words
- What about cases and declension in non-english
languages? - What about non-roman alphabets?
33Content Analysis
- Automated Transformation of raw text into a form
that represent some aspect(s) of its meaning - Including, but not limited to
- Automated Thesaurus Generation
- Phrase Detection
- Categorization
- Clustering
- Summarization
34Techniques for Content Analysis
- Statistical
- Single Document
- Full Collection
- Linguistic
- Syntactic
- Semantic
- Pragmatic
- Knowledge-Based (Artificial Intelligence)
- Hybrid (Combinations)
35Text Processing
- Standard Steps
- Recognize document structure
- titles, sections, paragraphs, etc.
- Break into tokens
- usually space and punctuation delineated
- special issues with Asian languages
- Stemming/morphological analysis
- Store in inverted index (to be discussed later)
36Information need
Collections
How is the query constructed?
How is the text processed?
text input
37Document Processing Steps
38Stemming and Morphological Analysis
- Goal normalize similar words
- Morphology (form of words)
- Inflectional Morphology
- E.g,. inflect verb endings and noun number
- Never change grammatical class
- dog, dogs
- tengo, tienes, tiene, tenemos, tienen
- Derivational Morphology
- Derive one word from another,
- Often change grammatical class
- build, building health, healthy
39Automated Methods
- Powerful multilingual tools exist for
morphological analysis - PCKimmo, Xerox Lexical technology
- Require a grammar and dictionary
- Use two-level automata
- Stemmers
- Very dumb rules work well (for English)
- Porter Stemmer Iteratively remove suffixes
- Improvement pass results through a lexicon
40Errors Generated by Porter Stemmer (Krovetz 93)
41Next
- Statistical Properties of Text
- Preparing information for search Lexical
analysis - Introduction to the Vector Space model of IR.