Title: Text and Web Search
1Text and Web Search
2Text Databases and IR
- Text databases (document databases)
- Large collections of documents from various
sources news articles, research papers, books,
digital libraries, e-mail messages, and Web
pages, library database, etc. - Information retrieval
- A field developed in parallel with database
systems - Information is organized into (a large number of)
documents - Information retrieval problem locating relevant
documents based on user input, such as keywords
or example documents
3Information Retrieval
- Typical IR systems
- Online library catalogs
- Online document management systems
- Information retrieval vs. database systems
- Some DB problems are not present in IR, e.g.,
update, transaction management, complex objects - Some IR problems are not addressed well in DBMS,
e.g., unstructured documents, approximate search
using keywords and relevance
4Basic Measures for Text Retrieval
- Precision the percentage of retrieved documents
that are in fact relevant to the query (i.e.,
correct responses) - Recall the percentage of documents that are
relevant to the query and were, in fact, retrieved
5Information Retrieval Techniques
- Index Terms (Attribute) Selection
- Stop list
- Word stem
- Index terms weighting methods
- Terms ? Documents Frequency Matrices
- Information Retrieval Models
- Boolean Model
- Vector Model
- Probabilistic Model
6Problem - Motivation
- Given a database of documents, find documents
containing data, retrieval - Applications
- Web
- law patent offices
- digital libraries
- information filtering
7Problem - Motivation
- Types of queries
- boolean (data AND retrieval AND NOT ...)
- additional features (data ADJACENT retrieval)
- keyword queries (data, retrieval)
- How to search a large collection of documents?
8Full-text scanning
- for single term
- (naive O(NM))
ABRACADABRA
text
CAB
pattern
9Full-text scanning
- for single term
- (naive O(NM))
- Knuth, Morris and Pratt (77)
- build a small FSA visit every text letter once
only, by carefully shifting more than one step
ABRACADABRA
text
CAB
pattern
10Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
...
CAB
CAB
11Full-text scanning
- for single term
- (naive O(NM))
- Knuth Morris and Pratt (77)
- Boyer and Moore (77)
- preprocess pattern start from right to left
skip!
ABRACADABRA
text
CAB
pattern
12Text - Detailed outline
- text
- problem
- full text scanning
- inversion
- signature files
- clustering
- information filtering and LSI
13Text Inverted Files
14Text Inverted Files
Q space overhead?
A mainly, the postings lists
15Text Inverted Files
- how to organize dictionary?
- stemming Y/N?
- Keep only the root of each word ex. inverted,
inversion ? invert - insertions?
16Text Inverted Files
- how to organize dictionary?
- B-tree, hashing, TRIEs, PATRICIA trees, ...
- stemming Y/N?
- insertions?
17Text Inverted Files
- postings list more Zipf distr. eg.,
rank-frequency plot of Bible
log(freq)
freq 1/rank / ln(1.78V)
log(rank)
18Text Inverted Files
- postings lists
- CuttingPedersen
- (keep first 4 in B-tree leaves)
- how to allocate space Faloutsos92
- geometric progression
- compression (Elias codes) Zobel down to 2
overhead! - Conclusions needs space overhead (2-300), but
it is the fastest
19Vector Space Model and Clustering
- Keyword (free-text) queries (vs Boolean)
- each document -gt vector (HOW?)
- each query -gt vector
- search for similar vectors
20Vector Space Model and Clustering
- main idea each document is a vector of size d d
is the number of different terms in the database
document
zoo
aaron
data
indexing
...data...
d ( vocabulary size)
21Document Vectors
- Documents are represented as bags of words
- Represented as vectors when used computationally
- A vector is like an array of floating points
- Has direction and magnitude
- Each vector holds a place for every term in the
collection - Therefore, most vectors are sparse
22Document VectorsOne location for each word.
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
23Document VectorsOne location for each word.
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
24Document Vectors
Document ids
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
25We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
26Vector Space Model and Clustering
- Then, group nearby vectors together
- Q1 cluster search?
- Q2 cluster generation?
- Two significant contributions
- ranked output
- relevance feedback
27Vector Space Model and Clustering
- cluster search visit the (k) closest
superclusters continue recursively
MD TRs
CS TRs
28Vector Space Model and Clustering
MD TRs
CS TRs
29Vector Space Model and Clustering
- relevance feedback (brilliant idea) Roccio73
MD TRs
CS TRs
30Vector Space Model and Clustering
- relevance feedback (brilliant idea) Roccio73
- How?
MD TRs
CS TRs
31Vector Space Model and Clustering
- How? A by adding the good vectors and
subtracting the bad ones
MD TRs
CS TRs
32Cluster generation
- Problem
- given N points in V dimensions,
- group them
33Cluster generation
- Problem
- given N points in V dimensions,
- group them (typically a k-means or AGNES is used)
34Assigning Weights to Terms
- Binary Weights
- Raw term frequency
- tf x idf
- Recall the Zipf distribution
- Want to weight terms highly if they are
- frequent in relevant documents BUT
- infrequent in the collection as a whole
35Binary Weights
- Only the presence (1) or absence (0) of a term is
included in the vector
36Raw Term Weights
- The frequency of occurrence for the term in each
document is included in the vector
37Assigning Weights
- tf x idf measure
- term frequency (tf)
- inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution - Goal assign a tf idf weight to each term in
each document
38tf x idf
39Inverse Document Frequency
- IDF provides high values for rare words and low
values for common words
For a collection of 10000 documents
40Similarity Measures for document vectors
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
41tf x idf normalization
- Normalize the term weights (so longer documents
are not unfairly given more weight) - normalize usually means force all values to fall
within a certain range, usually between 0 and 1,
inclusive.
42Vector space similarity(use the weights to
compare the documents)
43Computing Similarity Scores
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
44Vector Space with Term Weights and Cosine Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
1.0
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
45Text - Detailed outline
- Text databases
- problem
- full text scanning
- inversion
- signature files (a.k.a. Bloom Filters)
- Vector model and clustering
- information filtering and LSI
46Information Filtering LSI
- Foltz,92 Goal
- users specify interests ( keywords)
- system alerts them, on suitable news-documents
- Major contribution LSI Latent Semantic
Indexing - latent (hidden) concepts
47Information Filtering LSI
- Main idea
- map each document into some concepts
- map each term into some concepts
- Concept a set of terms, with weights, e.g.
- data (0.8), system (0.5), retrieval (0.6)
-gt DBMS_concept
48Information Filtering LSI
- Pictorially term-document matrix (BEFORE)
49Information Filtering LSI
- Pictorially concept-document matrix and...
50Information Filtering LSI
- ... and concept-term matrix
51Information Filtering LSI
- Q How to search, eg., for system?
52Information Filtering LSI
- A find the corresponding concept(s) and the
corresponding documents
53Information Filtering LSI
- A find the corresponding concept(s) and the
corresponding documents
54Information Filtering LSI
- Thus it works like an (automatically constructed)
thesaurus - we may retrieve documents that DONT have the
term system, but they contain almost everything
else (data, retrieval)
55SVD
56SVD - Definition
- An x m Un x r L r x r (Vm x r)T
- A n x m matrix (eg., n documents, m terms)
- U n x r matrix (n documents, r concepts)
- L r x r diagonal matrix (strength of each
concept) (r rank of the matrix) - V m x r matrix (m terms, r concepts)
57SVD - Example
retrieval
inf.
lung
brain
data
CS
x
x
MD
58SVD - Example
retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x
MD
59SVD - Example
doc-to-concept similarity matrix
retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x
MD
60SVD - Example
retrieval
strength of CS-concept
inf.
lung
brain
data
CS
x
x
MD
61SVD - Example
term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x
MD
62SVD - Example
term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x
MD
63SVD for LSI
- documents, terms and concepts
- U document-to-concept similarity matrix
- V term-to-concept sim. matrix
- L its diagonal elements strength of each
concept
64SVD for LSI
- Need to keep all the eigenvectors?
- NO, just keep the first k (concepts)
65Web Search
- What about web search?
- First you need to get all the documents of the
web. Crawlers. - Then you have to index them (inverted files, etc)
- Find the web pages that are relevant to the query
- Report the pages with their links in a sorted
order - Main difference with IR web pages have links
- may be possible to exploit the link structure for
sorting the relevant documents
66Kleinbergs Algorithm (HITS)
- Main idea In many cases, when you search the web
using some terms, the most relevant pages may not
contain this term (or contain the term only a few
times) - Harvard www.harvard.edu
- Search Engines yahoo, google, altavista
- Authorities and hubs
67Kleinbergs algorithm
- Problem dfn given the web and a query
- find the most authoritative web pages for this
query - Step 0 find all pages containing the query terms
(root set) - Step 1 expand by one move forward and backward
(base set)
68Kleinbergs algorithm
- Step 1 expand by one move forward and backward
69Kleinbergs algorithm
- on the resulting graph, give high score (
authorities) to nodes that many important nodes
point to - give high importance score (hubs) to nodes that
point to good authorities)
hubs
authorities
70Kleinbergs algorithm
- observations
- recursive definition!
- each node (say, i-th node) has both an
authoritativeness score ai and a hubness score hi
71Kleinbergs algorithm
- Let E be the set of edges and A be the adjacency
matrix - the (i,j) is 1 if the edge from i to j exists
- Let h and a be n x 1 vectors with the
hubness and authoritativiness scores. - Then
72Kleinbergs algorithm
- Then
- ai hk hl hm
- that is
- ai Sum (hj) over all j that (j,i) edge
exists - or
- a AT h
k
i
l
m
73Kleinbergs algorithm
- symmetrically, for the hubness
- hi an ap aq
- that is
- hi Sum (qj) over all j that (i,j) edge
exists - or
- h A a
n
i
p
q
74Kleinbergs algorithm
- In conclusion, we want vectors h and a such that
- h A a
- a AT h
Start from a and h to all 1. Then apply the
following trick hAaA(ATh)(AAT)h ..(AAT)2
h .. (AAT)k h a (ATA)ka
75Kleinbergs algorithm
- In short, the solutions to
- h A a
- a AT h
- are the left- and right- eigenvectors of the
adjacency matrix A. - Starting from random a and iterating, well
eventually converge - (Q to which of all the eigenvectors? why?)
76Kleinbergs algorithm
- (Q to which of all the eigenvectors? why?)
- A to the ones of the strongest eigenvalue,
because of property - (AT A ) k v (constant) v1
So, we can find the a and h vectors and the page
with the highest a values are reported!
77Kleinbergs algorithm - results
- Eg., for the query java
- 0.328 www.gamelan.com
- 0.251 java.sun.com
- 0.190 www.digitalfocus.com (the java developer)
78Kleinbergs algorithm - discussion
- authority score can be used to find similar
pages to page p - closely related to citation analysis, social
networs / small world phenomena
79google/page-rank algorithm
- closely related The Web is a directed graph of
connected nodes - imagine a particle randomly moving along the
edges () - compute its steady-state probabilities. That
gives the PageRank of each pages (the importance
of this page) - () with occasional random jumps
80PageRank Definition
- Assume a page A and pages T1, T2, , Tm that
point to A. Let d is a damping factor. PR(A) the
Pagerank of A. C(A) the out-degree of A. Then
81google/page-rank algorithm
- Compute the PR of each pageidentical problem
given a Markov Chain, compute the steady state
probabilities p1 ... p5
2
1
3
4
5
82Computing PageRank
- Iterative procedure
- Also, navigate the web by randomly follow links
or with prob p jump to a random page. Let A the
adjacency matrix (n x n), ci out-degree of page i
- Prob(Ai-gtAj) dn-1(1-d)ci1Aij
- Ai,j Prob(Ai-gtAj)
83google/page-rank algorithm
- Let A be the transition matrix ( adjacency
matrix, row-normalized sum of each row 1)
2
1
3
4
5
84google/page-rank algorithm
A p p
2
1
3
4
5
85google/page-rank algorithm
- A p p
- thus, p is the eigenvector that corresponds to
the highest eigenvalue (1, since the matrix is
row-normalized)
86Kleinberg/google - conclusions
- SVD helps in graph analysis
- hub/authority scores strongest left- and right-
eigenvectors of the adjacency matrix - random walk on a graph steady state
probabilities are given by the strongest
eigenvector of the transition matrix
87References
- Brin, S. and L. Page (1998). Anatomy of a
Large-Scale Hypertextual Web Search Engine. 7th
Intl World Wide Web Conf.