Title: Web Search
1Web Search Information Retrieval
2Boolean queries Examples
- Simple queries involving relationships between
terms and documents - Documents containing the word Java
- Documents containing the word Java but not the
word coffee - Proximity queries
- Documents containing the phrase Java beans or the
term API - Documents where Java and island occur in the same
sentence
3Document preprocessing
- Tokenization
- Filtering away tags
- Tokens regarded as nonempty sequence of
characters excluding spaces and punctuations. - Token represented by a suitable integer, tid,
typically 32 bits - Optional stemming/conflation of words
- Result document (did) transformed into a
sequence of integers (tid, pos)
4Storing tokens
- Straight-forward implementation using a
relational database - Example figure
- Space scales to almost 10 times
- Accesses to table show common pattern
- reduce the storage by mapping tids to a
lexicographically sorted buffer of (did, pos)
tuples. - Indexing transposing document-term matrix
5Two variants of the inverted index data
structure, usually stored on disk. The
simpler version in the middle does not store term
offset information the version to the right
stores term offsets. The mapping from terms to
documents and positions (written as
document/position) may be implemented using a
B-tree or a hash-table.
6Stopwords
- Function words and connectives
- Appear in large number of documents and little
use in pinpointing documents - Indexing stopwords
- Stopwords not indexed
- For reducing index space and improving
performance - Replace stopwords with a placeholder (to remember
the offset) - Issues
- Queries containing only stopwords ruled out
- Polysemous words that are stopwords in one sense
but not in others - E.g. can as a verb vs. can as a noun
7Stemming
- Conflating words to help match a query term with
a morphological variant in the corpus. - Remove inflections that convey parts of speech,
tense and number - E.g. university and universal both stem to
universe. - Techniques
- morphological analysis (e.g., Porter's algorithm)
- dictionary lookup (e.g., WordNet).
- Stemming may increase recall but at the price of
precision - Abbreviations, polysemy and names coined in the
technical and commercial sectors - E.g. Stemming ides to IDE, SOCKS to
sock, gated to gate, may be bad !
8Maintaining indices over dynamic collections.
9Relevance ranking
- Keyword queries
- In natural language
- Not precise, unlike SQL
- Boolean decision for response unacceptable
- Solution
- Rate each document for how likely it is to
satisfy the user's information need - Sort in decreasing order of the score
- Present results in a ranked list.
- No algorithmic way of ensuring that the ranking
strategy always favors the information need - Query only a part of the user's information need
10Responding to queries
- Set-valued response
- Response set may be very large
- (E.g., by recent estimates, over 12 million Web
pages contain the word java.) - Demanding selective query from user
- Guessing user's information need and ranking
responses - Evaluating rankings
11Evaluating procedure
- Given benchmark
- Corpus of n documents D
- A set of queries Q
- For each query, an exhaustive set of
relevant documents identified
manually - Query submitted system
- Ranked list of documents
retrieved - compute a 0/1 relevance list
- iff
- otherwise.
12Recall and precision
- Recall at rank
- Fraction of all relevant documents included in
. - .
- Precision at rank
- Fraction of the top k responses that are actually
relevant. - .
13Other measures
- Average precision
- Sum of precision at each relevant hit position in
the response list, divided by the total number of
relevant documents - .
. - avg.precision 1 iff engine retrieves all
relevant documents and ranks them ahead of any
irrelevant document - Interpolated precision
- To combine precision values from multiple queries
- Gives precision-vs.-recall curve for the
benchmark. - For each query, take the maximum precision
obtained for the query for any recall greater
than or equal to - average them together for all queries
- Others like measures of authority, prestige etc
14Precision-Recall tradeoff
- Interpolated precision cannot increase with
recall - Interpolated precision at recall level 0 may be
less than 1 - At level k 0
- Precision (by convention) 1, Recall 0
- Inspecting more documents
- Can increase recall
- Precision may decrease
- we will start encountering more and more
irrelevant documents - Search engine with a good ranking function will
generally show a negative relation between recall
and precision. - Higher the curve, better the engine
15Precision and interpolated precision plotted
against recall for the given relevance
vector. Missing are zeroes.
16The vector space model
- Documents represented as vectors in a
multi-dimensional Euclidean space - Each axis a term (token)
- Coordinate of document d in direction of term t
determined by - Term frequency TF(d,t)
- number of times term t occurs in document d,
scaled in a variety of ways to normalize document
length - Inverse document frequency IDF(t)
- to scale down the coordinates of terms that occur
in many documents
17Term frequency
- .
. - Cornell SMART system uses a smoothed version
18Inverse document frequency
- Given
- D is the document collection and is the set
of documents containing t - Formulae
- mostly dampened functions of
- SMART
- .
19Vector space model
- Coordinate of document d in axis t
- .
- Transformed to in the TFIDF-space
- Query q
- Interpreted as a document
- Transformed to in the same TFIDF-space as d
20Measures of proximity
- Distance measure
- Magnitude of the vector difference
- .
- Document vectors must be normalized to unit (
or ) length - Else shorter documents dominate (since queries
are short) - Cosine similarity
- cosine of the angle between and
- Shorter documents are penalized
21Relevance feedback
- Users learning how to modify queries
- Response list must have least some relevant
documents - Relevance feedback
- correcting' the ranks to the user's taste
- automates the query refinement process
- Rocchio's method
- Folding-in user feedback
- To query vector
- Add a weighted sum of vectors for relevant
documents D - Subtract a weighted sum of the irrelevant
documents D- - .
22Relevance feedback (contd.)
- Pseudo-relevance feedback
- D and D- generated automatically
- E.g. Cornell SMART system
- top 10 documents reported by the first round of
query execution are included in D - typically set to 0 D- not used
- Not a commonly available feature
- Web users want instant gratification
- System complexity
- Executing the second round query slower and
expensive for major search engines
23Ranking by odds ratio
- R Boolean random variable which represents the
relevance of document d w.r.t. query q. - Ranking documents by their odds ratio for
relevance - .
- Approximating probability of d by product of the
probabilities of individual terms in d - .
- Approximately
24Meta-search systems
- Take the search engine to the document
- Forward queries to many geographically
distributed repositories - Each has its own search service
- Consolidate their responses.
- Advantages
- Perform non-trivial query rewriting
- Suit a single user query to many search engines
with different query syntax - Surprisingly small overlap between crawls
- Consolidating responses
- Function goes beyond just eliminating duplicates
- Search services do not provide standard ranks
which can be combined meaningfully
25Similarity search
- Cluster hypothesis
- Documents similar to relevant documents are also
likely to be relevant - Handling find similar queries
- Replication or duplication of pages
- Mirroring of sites
26Document similarity
- Jaccard coefficient of similarity between
document and - T(d) set of tokens in document d
- .
- Symmetric, reflexive, not a metric
- Forgives any number of occurrences and any
permutations of the terms. - is a metric