Title: Information Retrieval and Web Search
1Information Retrieval and Web Search
- Heng Ji
- hengji_at_cs.qc.cuny.edu
- Sept 16, 2008
Acknowledgement some slides from Jimmy Lin,
Victor Lavrenko
2Outline
- Introduction
- IR Approaches and Ranking
- Query Construction
- Document Indexing
- Web Search
- Project Topic Discussion and Finalization
3What is Information Retrieval?
- Most people equate IR with web-search
- highly visible, commercially successful endeavors
- leverage 3 decades of academic research
- IR finding any kind of relevant information
- web-pages, news events, answers, images,
- relevance is a key notion
4IR System
IR System
4
5What types of information?
- Text (Documents and portions thereof)
- XML and structured documents
- Images
- Audio (sound effects, songs, etc.)
- Video
- Source code
- Applications/Web services
6Interesting Examples
- Google image search
- Google video search
- NYU Prof. Sekines Ngram search
- http//linserv1.cims.nyu.edu23232/ngram/
- INDRI Demo Show
- http//www.lemurproject.org/indri/
http//images.google.com/
http//video.google.com/
7What about databases?
- What are examples of databases?
- Banks storing account information
- Retailers storing inventories
- Universities storing student grades
- What exactly is a (relational) database?
- Think of them as a collection of tables
- They model some aspect of the world
8A (Simple) Database Example
Student Table
Department Table
Course Table
Enrollment Table
9Database Queries
- What would you want to know from a database?
- What classes is John Arrow enrolled in?
- Who has the highest grade in LBSC 690?
- Whos in the history department?
- Of all the non-CLIS students taking LBSC 690 with
a last name shorter than six characters and were
born on a Monday, who has the longest email
address?
10Comparing IR to databases
11The IR Black Box
Documents
Query
Hits
12Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
13Building the IR Black Box
- Different models of information retrieval
- Boolean model
- Vector space model
- Languages models
- Representing the meaning of documents
- How do we capture the meaning of documents?
- Is meaning just the sum of all terms?
- Indexing
- How do we actually store all those words?
- How do we access indexed terms quickly?
14Outline
- Introduction
- IR Approaches and Ranking
- Query Construction
- Document Indexing
- Web Search
15The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
16Relevance
- Relevance is a subjective judgment and may
include - Being on the proper subject.
- Being timely (recent information).
- Being authoritative (from a trusted source).
- Satisfying the goals of the user and his/her
intended use of the information (information
need).
16
17IR Ranking
- Early IR focused on set-based retrieval
- Boolean queries, set of conditions to be
satisfied - document either matches the query or not
- like classifying the collection into relevant /
non-relevant sets - still used by professional searchers
- advanced search in many systems
- Modern IR ranked retrieval
- free-form query expresses users information need
- rank documents by decreasing likelihood of
relevance - many studies prove it is superior
18A heuristic formula for IR
- Rank docs by similarity to the query
- suppose the query is cryogenic labs
- Similarity query words in the doc
- favors documents with both labs and cryogenic
- mathematically
- Logical variations (set-based)
- Boolean AND (require all words)
- Boolean OR (any of the words)
19Term Frequency (TF)
- Observation
- key words tend to be repeated in a document
- Modify our similarity measure
- give more weight if word occurs multiple times
- Problem
- biased towards long documents
- spurious occurrences
- normalize by length
20Inverse Document Frequency (IDF)
- Observation
- rare words carry more meaning cryogenic, apollo
- frequent words are linguistic glue of, the,
said, went - Modify our similarity measure
- give more weight to rare words but dont be
too aggressive (why?) - C total number of documents
- df(q) total number of documents that contain q
21TF normalization
- Observation
- D1cryogenic,labs, D2 cryogenic,cryogenic
- which document is more relevant?
- which one is ranked higher? (df(labs) gt
df(cryogenic)) - Correction
- first occurrence more important than a repeat
(why?) - squash the linearity of TF
22State-of-the-art Formula
23Vector-space approach to IR
cat
pig
dog
24Language-modeling Approach
- query is a random sample from a perfect
document - words are sampled independently of each other
- rank documents by the probability of generating
query
D
query
4/9 2/9 4/9 3/9
25PageRank in Google
26PageRank in Google (Cont)
I1
A
B
I2
- Assign a numeric value to each page
- The more a page is referred to by important
pages, the more this page is important - d damping factor (0.85)
- Many other criteria e.g. proximity of query
words - information retrieval better than
information retrieval
27Outline
- Introduction
- IR Approaches and Ranking
- Query Construction
- Document Indexing
- Web Search
28Keyword Search
- Simplest notion of relevance is that the query
string appears verbatim in the document. - Slightly less strict notion is that the words in
the query appear frequently in the document, in
any order (bag of words).
28
29Problems with Keywords
- May not retrieve relevant documents that include
synonymous terms. - restaurant vs. café
- PRC vs. China
- May retrieve irrelevant documents that include
ambiguous terms. - bat (baseball vs. mammal)
- Apple (company vs. fruit)
- bit (unit of data vs. act of eating)
29
30Query Expansion
- http//www.lemurproject.org/lemur/IndriQueryLangua
ge.php - Most errors caused by vocabulary mismatch
- query cars, document automobiles
- solution automatically add highly-related words
- Thesaurus / WordNet lookup
- add semantically-related words (synonyms)
- cannot take context into account
- rail car vs. race car vs. car and cdr
- Statistical Expansion
- add statistically-related words (co-occurrence)
- very successful
31IR Query Examples
- http//nlp.cs.qc.cuny.edu/ir.zip
- Query
- ltparametersgtltquerygtcombine( weight( 0.063356
1(explosion) 0.187417 1(blast) 0.411817
1(wounded) 0.101370 1(injured) 0.161191
1(death) 0.074849 1(deaths)) weight( 0.311760
1(Davao Cityinternational airport) 0.311760
1(Tuesday) 0.103044 1(DAVAO) 0.195505
1(Philippines) 0.019817 1(DXDC) 0.058113
1(Davao Medical Center)))lt/querygtlt/parametersgt
32Outline
- Introduction
- IR Approaches and Ranking
- Query Construction
- Document Indexing
- Web Search
33Document indexing
- Goal Find the important meanings and create an
internal representation - Factors to consider
- Accuracy to represent meanings (semantics)
- Exhaustiveness (cover all the contents)
- Facility for computer to manipulate
- What is the best representation of contents?
- Char. string (char trigrams) not precise enough
- Word good coverage, not precise
- Phrase poor coverage, more precise
- Concept poor coverage, precise
Accuracy (Precision)
Coverage (Recall)
String Word Phrase Concept
34Indexer steps
- Sequence of (Modified token, Document ID) pairs.
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
35 - Multiple term entries in a single document are
merged. - Frequency information is added.
36Stopwords / Stoplist
- function words do not bear useful information for
IR - of, in, about, with, I, although,
- Stoplist contain stopwords, not to be used as
index - Prepositions
- Articles
- Pronouns
- Some adverbs and adjectives
- Some frequent words (e.g. document)
- The removal of stopwords usually improves IR
effectiveness - A few standard stoplists are commonly used.
37Stemming
- Reason
- Different word forms may bear similar meaning
(e.g. search, searching) create a standard
representation for them - Stemming
- Removing some endings of word
- computer
- compute
- computes
- computing
- computed
- computation
comput
38Lemmatization
- transform to standard form according to syntactic
category. - E.g. verb ing ? verb
- noun s ? noun
- Need POS tagging
- More accurate than stemming, but needs more
resources - crucial to choose stemming/lemmatization rules
- noise v.s. recognition rate
- compromise between precision and recall
- light/no stemming severe stemming
- -recall precision recall -precision
39Outline
- Introduction
- IR Approaches and Ranking
- Query Construction
- Document Indexing
- Web Search
40IR on the Web
- No stable document collection (spider, crawler)
- Invalid document, duplication, etc.
- Huge number of documents (partial collection)
- Multimedia documents
- Great variation of document quality
- Multilingual problem
-
41Web Search
- Application of IR to HTML documents on the World
Wide Web. - Differences
- Must assemble document corpus by spidering the
web. - Can exploit the structural layout information in
HTML (XML). - Documents change uncontrollably.
- Can exploit the link structure of the web.
41
42Web Search System
IR System
42
4343
44Some formulas for Sim
- Dot product
- Cosine
- Dice
- Jaccard
t1
D
Q
t2