Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Information Retrieval and Web Search

Description:

Most people equate IR with web-search. highly visible, ... 'bat' (baseball vs. mammal) 'Apple' (company vs. fruit) 'bit' (unit of data vs. act of eating) ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 45

Provided by: hen4

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search

1
Information Retrieval and Web Search

Heng Ji
hengji_at_cs.qc.cuny.edu
Sept 16, 2008

Acknowledgement some slides from Jimmy Lin,
Victor Lavrenko
2
Outline

Introduction
IR Approaches and Ranking
Query Construction
Document Indexing
Web Search
Project Topic Discussion and Finalization

3
What is Information Retrieval?

Most people equate IR with web-search
highly visible, commercially successful endeavors
leverage 3 decades of academic research
IR finding any kind of relevant information
web-pages, news events, answers, images,
relevance is a key notion

4
IR System
IR System
4
5
What types of information?

Text (Documents and portions thereof)
XML and structured documents
Images
Audio (sound effects, songs, etc.)
Video
Source code
Applications/Web services

6
Interesting Examples

Google image search
Google video search
NYU Prof. Sekines Ngram search
http//linserv1.cims.nyu.edu23232/ngram/
INDRI Demo Show
http//www.lemurproject.org/indri/

http//images.google.com/
http//video.google.com/
7
What about databases?

What are examples of databases?
Banks storing account information
Retailers storing inventories
Universities storing student grades
What exactly is a (relational) database?
Think of them as a collection of tables
They model some aspect of the world

8
A (Simple) Database Example
Student Table
Department Table
Course Table
Enrollment Table
9
Database Queries

What would you want to know from a database?
What classes is John Arrow enrolled in?
Who has the highest grade in LBSC 690?
Whos in the history department?
Of all the non-CLIS students taking LBSC 690 with
a last name shorter than six characters and were
born on a Monday, who has the longest email
address?

10
Comparing IR to databases
11
The IR Black Box
Documents
Query
Hits
12
Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
13
Building the IR Black Box

Different models of information retrieval
Boolean model
Vector space model
Languages models
Representing the meaning of documents
How do we capture the meaning of documents?
Is meaning just the sum of all terms?
Indexing
How do we actually store all those words?
How do we access indexed terms quickly?

14
Outline

Introduction
IR Approaches and Ranking
Query Construction
Document Indexing
Web Search

15
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
16
Relevance

Relevance is a subjective judgment and may
include
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her
intended use of the information (information
need).

16
17
IR Ranking

Early IR focused on set-based retrieval
Boolean queries, set of conditions to be
satisfied
document either matches the query or not
like classifying the collection into relevant /
non-relevant sets
still used by professional searchers
advanced search in many systems
Modern IR ranked retrieval
free-form query expresses users information need
rank documents by decreasing likelihood of
relevance
many studies prove it is superior

18
A heuristic formula for IR

Rank docs by similarity to the query
suppose the query is cryogenic labs
Similarity query words in the doc
favors documents with both labs and cryogenic
mathematically
Logical variations (set-based)
Boolean AND (require all words)
Boolean OR (any of the words)

19
Term Frequency (TF)

Observation
key words tend to be repeated in a document
Modify our similarity measure
give more weight if word occurs multiple times
Problem
biased towards long documents
spurious occurrences
normalize by length

20
Inverse Document Frequency (IDF)

Observation
rare words carry more meaning cryogenic, apollo
frequent words are linguistic glue of, the,
said, went
Modify our similarity measure
give more weight to rare words but dont be
too aggressive (why?)
C total number of documents
df(q) total number of documents that contain q

21
TF normalization

Observation
D1cryogenic,labs, D2 cryogenic,cryogenic
which document is more relevant?
which one is ranked higher? (df(labs) gt
df(cryogenic))
Correction
first occurrence more important than a repeat
(why?)
squash the linearity of TF

22
State-of-the-art Formula
23
Vector-space approach to IR
cat

cat cat

cat pig

pig

pig cat

dog
24
Language-modeling Approach

query is a random sample from a perfect
document
words are sampled independently of each other
rank documents by the probability of generating
query

D
query
4/9 2/9 4/9 3/9
25
PageRank in Google
26
PageRank in Google (Cont)
I1
A
B
I2

Assign a numeric value to each page
The more a page is referred to by important
pages, the more this page is important
d damping factor (0.85)
Many other criteria e.g. proximity of query
words
information retrieval better than
information retrieval

27
Outline

Introduction
IR Approaches and Ranking
Query Construction
Document Indexing
Web Search

28
Keyword Search

Simplest notion of relevance is that the query
string appears verbatim in the document.
Slightly less strict notion is that the words in
the query appear frequently in the document, in
any order (bag of words).

28
29
Problems with Keywords

May not retrieve relevant documents that include
synonymous terms.
restaurant vs. café
PRC vs. China
May retrieve irrelevant documents that include
ambiguous terms.
bat (baseball vs. mammal)
Apple (company vs. fruit)
bit (unit of data vs. act of eating)

29
30
Query Expansion

http//www.lemurproject.org/lemur/IndriQueryLangua
ge.php
Most errors caused by vocabulary mismatch
query cars, document automobiles
solution automatically add highly-related words
Thesaurus / WordNet lookup
add semantically-related words (synonyms)
cannot take context into account
rail car vs. race car vs. car and cdr
Statistical Expansion
add statistically-related words (co-occurrence)
very successful

31
IR Query Examples

http//nlp.cs.qc.cuny.edu/ir.zip
Query
ltparametersgtltquerygtcombine( weight( 0.063356
1(explosion) 0.187417 1(blast) 0.411817
1(wounded) 0.101370 1(injured) 0.161191
1(death) 0.074849 1(deaths)) weight( 0.311760
1(Davao Cityinternational airport) 0.311760
1(Tuesday) 0.103044 1(DAVAO) 0.195505
1(Philippines) 0.019817 1(DXDC) 0.058113
1(Davao Medical Center)))lt/querygtlt/parametersgt

32
Outline

Introduction
IR Approaches and Ranking
Query Construction
Document Indexing
Web Search

33
Document indexing

Goal Find the important meanings and create an
internal representation
Factors to consider
Accuracy to represent meanings (semantics)
Exhaustiveness (cover all the contents)
Facility for computer to manipulate
What is the best representation of contents?
Char. string (char trigrams) not precise enough
Word good coverage, not precise
Phrase poor coverage, more precise
Concept poor coverage, precise

Accuracy (Precision)
Coverage (Recall)
String Word Phrase Concept
34
Indexer steps

Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
35

Multiple term entries in a single document are
merged.
Frequency information is added.

36
Stopwords / Stoplist

function words do not bear useful information for
IR
of, in, about, with, I, although,
Stoplist contain stopwords, not to be used as
index
Prepositions
Articles
Pronouns
Some adverbs and adjectives
Some frequent words (e.g. document)
The removal of stopwords usually improves IR
effectiveness
A few standard stoplists are commonly used.

37
Stemming

Reason
Different word forms may bear similar meaning
(e.g. search, searching) create a standard
representation for them
Stemming
Removing some endings of word
computer
compute
computes
computing
computed
computation

comput
38
Lemmatization