Web Search - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Web Search

Description:

Simple queries involving relationships between terms and documents ... E.g.: Stemming 'ides' to 'IDE', 'SOCKS' to 'sock', 'gated' to 'gate', may be bad ! ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 27

Provided by: dmlab

Category:

more less

Transcript and Presenter's Notes

Title: Web Search

1
Web Search Information Retrieval
2
Boolean queries Examples

Simple queries involving relationships between
terms and documents
Documents containing the word Java
Documents containing the word Java but not the
word coffee
Proximity queries
Documents containing the phrase Java beans or the
term API
Documents where Java and island occur in the same
sentence

3
Document preprocessing

Tokenization
Filtering away tags
Tokens regarded as nonempty sequence of
characters excluding spaces and punctuations.
Token represented by a suitable integer, tid,
typically 32 bits
Optional stemming/conflation of words
Result document (did) transformed into a
sequence of integers (tid, pos)

4
Storing tokens

Straight-forward implementation using a
relational database
Example figure
Space scales to almost 10 times
Accesses to table show common pattern
reduce the storage by mapping tids to a
lexicographically sorted buffer of (did, pos)
tuples.
Indexing transposing document-term matrix

5

Two variants of the inverted index data
structure, usually stored on disk. The
simpler version in the middle does not store term
offset information the version to the right
stores term offsets. The mapping from terms to
documents and positions (written as
document/position) may be implemented using a
B-tree or a hash-table.
6
Stopwords

Function words and connectives
Appear in large number of documents and little
use in pinpointing documents
Indexing stopwords
Stopwords not indexed
For reducing index space and improving
performance
Replace stopwords with a placeholder (to remember
the offset)
Issues
Queries containing only stopwords ruled out
Polysemous words that are stopwords in one sense
but not in others
E.g. can as a verb vs. can as a noun

7
Stemming

Conflating words to help match a query term with
a morphological variant in the corpus.
Remove inflections that convey parts of speech,
tense and number
E.g. university and universal both stem to
universe.
Techniques
morphological analysis (e.g., Porter's algorithm)
dictionary lookup (e.g., WordNet).
Stemming may increase recall but at the price of
precision
Abbreviations, polysemy and names coined in the
technical and commercial sectors
E.g. Stemming ides to IDE, SOCKS to
sock, gated to gate, may be bad !

8
Maintaining indices over dynamic collections.
9
Relevance ranking

Keyword queries
In natural language
Not precise, unlike SQL
Boolean decision for response unacceptable
Solution
Rate each document for how likely it is to
satisfy the user's information need
Sort in decreasing order of the score
Present results in a ranked list.
No algorithmic way of ensuring that the ranking
strategy always favors the information need
Query only a part of the user's information need

10
Responding to queries

Set-valued response
Response set may be very large
(E.g., by recent estimates, over 12 million Web
pages contain the word java.)
Demanding selective query from user
Guessing user's information need and ranking
responses
Evaluating rankings

11
Evaluating procedure

Given benchmark
Corpus of n documents D
A set of queries Q
For each query, an exhaustive set of
relevant documents identified
manually
Query submitted system
Ranked list of documents
retrieved
compute a 0/1 relevance list
iff
otherwise.

12
Recall and precision

Recall at rank
Fraction of all relevant documents included in
.
.
Precision at rank
Fraction of the top k responses that are actually
relevant.
.

13
Other measures

Average precision
Sum of precision at each relevant hit position in
the response list, divided by the total number of
relevant documents
.

.
avg.precision 1 iff engine retrieves all
relevant documents and ranks them ahead of any
irrelevant document
Interpolated precision
To combine precision values from multiple queries
Gives precision-vs.-recall curve for the
benchmark.
For each query, take the maximum precision
obtained for the query for any recall greater
than or equal to
average them together for all queries
Others like measures of authority, prestige etc

14
Precision-Recall tradeoff

Interpolated precision cannot increase with
recall
Interpolated precision at recall level 0 may be
less than 1
At level k 0
Precision (by convention) 1, Recall 0
Inspecting more documents
Can increase recall
Precision may decrease
we will start encountering more and more
irrelevant documents
Search engine with a good ranking function will
generally show a negative relation between recall
and precision.
Higher the curve, better the engine

15
Precision and interpolated precision plotted
against recall for the given relevance
vector. Missing are zeroes.
16
The vector space model

Documents represented as vectors in a
multi-dimensional Euclidean space
Each axis a term (token)
Coordinate of document d in direction of term t
determined by
Term frequency TF(d,t)
number of times term t occurs in document d,
scaled in a variety of ways to normalize document
length
Inverse document frequency IDF(t)
to scale down the coordinates of terms that occur
in many documents

17
Term frequency

.
.
Cornell SMART system uses a smoothed version

18
Inverse document frequency

Given
D is the document collection and is the set
of documents containing t
Formulae
mostly dampened functions of
SMART
.

19
Vector space model

Coordinate of document d in axis t
.
Transformed to in the TFIDF-space
Query q
Interpreted as a document
Transformed to in the same TFIDF-space as d

20
Measures of proximity

Distance measure
Magnitude of the vector difference
.
Document vectors must be normalized to unit (
or ) length
Else shorter documents dominate (since queries
are short)
Cosine similarity
cosine of the angle between and
Shorter documents are penalized

21
Relevance feedback

Users learning how to modify queries
Response list must have least some relevant
documents
Relevance feedback
correcting' the ranks to the user's taste
automates the query refinement process
Rocchio's method
Folding-in user feedback
To query vector
Add a weighted sum of vectors for relevant
documents D
Subtract a weighted sum of the irrelevant
documents D-
.

22
Relevance feedback (contd.)

Pseudo-relevance feedback
D and D- generated automatically
E.g. Cornell SMART system
top 10 documents reported by the first round of
query execution are included in D
typically set to 0 D- not used
Not a commonly available feature
Web users want instant gratification
System complexity
Executing the second round query slower and
expensive for major search engines

23
Ranking by odds ratio

R Boolean random variable which represents the
relevance of document d w.r.t. query q.
Ranking documents by their odds ratio for
relevance
.
Approximating probability of d by product of the
probabilities of individual terms in d
.
Approximately

24
Meta-search systems

Take the search engine to the document
Forward queries to many geographically
distributed repositories
Each has its own search service
Consolidate their responses.
Advantages
Perform non-trivial query rewriting
Suit a single user query to many search engines
with different query syntax
Surprisingly small overlap between crawls
Consolidating responses
Function goes beyond just eliminating duplicates
Search services do not provide standard ranks
which can be combined meaningfully

25
Similarity search