Text and Web Search - PowerPoint PPT Presentation

About This Presentation
Title:

Text and Web Search

Description:

NO, just keep the first k (concepts) Web Search What about web search? First you need to get all the documents of the web . Crawlers. Then you have to index them ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 88
Provided by: GeorgeK47
Learn more at: https://www.cs.bu.edu
Category:
Tags: crawlers | search | text | web

less

Transcript and Presenter's Notes

Title: Text and Web Search


1
Text and Web Search
2
Text Databases and IR
  • Text databases (document databases)
  • Large collections of documents from various
    sources news articles, research papers, books,
    digital libraries, e-mail messages, and Web
    pages, library database, etc.
  • Information retrieval
  • A field developed in parallel with database
    systems
  • Information is organized into (a large number of)
    documents
  • Information retrieval problem locating relevant
    documents based on user input, such as keywords
    or example documents

3
Information Retrieval
  • Typical IR systems
  • Online library catalogs
  • Online document management systems
  • Information retrieval vs. database systems
  • Some DB problems are not present in IR, e.g.,
    update, transaction management, complex objects
  • Some IR problems are not addressed well in DBMS,
    e.g., unstructured documents, approximate search
    using keywords and relevance

4
Basic Measures for Text Retrieval
  • Precision the percentage of retrieved documents
    that are in fact relevant to the query (i.e.,
    correct responses)
  • Recall the percentage of documents that are
    relevant to the query and were, in fact, retrieved

5
Information Retrieval Techniques
  • Index Terms (Attribute) Selection
  • Stop list
  • Word stem
  • Index terms weighting methods
  • Terms ? Documents Frequency Matrices
  • Information Retrieval Models
  • Boolean Model
  • Vector Model
  • Probabilistic Model

6
Problem - Motivation
  • Given a database of documents, find documents
    containing data, retrieval
  • Applications
  • Web
  • law patent offices
  • digital libraries
  • information filtering

7
Problem - Motivation
  • Types of queries
  • boolean (data AND retrieval AND NOT ...)
  • additional features (data ADJACENT retrieval)
  • keyword queries (data, retrieval)
  • How to search a large collection of documents?

8
Full-text scanning
  • for single term
  • (naive O(NM))

ABRACADABRA
text
CAB
pattern
9
Full-text scanning
  • for single term
  • (naive O(NM))
  • Knuth, Morris and Pratt (77)
  • build a small FSA visit every text letter once
    only, by carefully shifting more than one step

ABRACADABRA
text
CAB
pattern
10
Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
...
CAB
CAB
11
Full-text scanning
  • for single term
  • (naive O(NM))
  • Knuth Morris and Pratt (77)
  • Boyer and Moore (77)
  • preprocess pattern start from right to left
    skip!

ABRACADABRA
text
CAB
pattern
12
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

13
Text Inverted Files
14
Text Inverted Files
Q space overhead?
A mainly, the postings lists
15
Text Inverted Files
  • how to organize dictionary?
  • stemming Y/N?
  • Keep only the root of each word ex. inverted,
    inversion ? invert
  • insertions?

16
Text Inverted Files
  • how to organize dictionary?
  • B-tree, hashing, TRIEs, PATRICIA trees, ...
  • stemming Y/N?
  • insertions?

17
Text Inverted Files
  • postings list more Zipf distr. eg.,
    rank-frequency plot of Bible

log(freq)
freq 1/rank / ln(1.78V)
log(rank)
18
Text Inverted Files
  • postings lists
  • CuttingPedersen
  • (keep first 4 in B-tree leaves)
  • how to allocate space Faloutsos92
  • geometric progression
  • compression (Elias codes) Zobel down to 2
    overhead!
  • Conclusions needs space overhead (2-300), but
    it is the fastest

19
Vector Space Model and Clustering
  • Keyword (free-text) queries (vs Boolean)
  • each document -gt vector (HOW?)
  • each query -gt vector
  • search for similar vectors

20
Vector Space Model and Clustering
  • main idea each document is a vector of size d d
    is the number of different terms in the database

document
zoo
aaron
data
indexing
...data...
d ( vocabulary size)
21
Document Vectors
  • Documents are represented as bags of words
  • Represented as vectors when used computationally
  • A vector is like an array of floating points
  • Has direction and magnitude
  • Each vector holds a place for every term in the
    collection
  • Therefore, most vectors are sparse

22
Document VectorsOne location for each word.
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
23
Document VectorsOne location for each word.
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
24
Document Vectors
Document ids
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
25
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
26
Vector Space Model and Clustering
  • Then, group nearby vectors together
  • Q1 cluster search?
  • Q2 cluster generation?
  • Two significant contributions
  • ranked output
  • relevance feedback

27
Vector Space Model and Clustering
  • cluster search visit the (k) closest
    superclusters continue recursively

MD TRs
CS TRs
28
Vector Space Model and Clustering
  • ranked output easy!

MD TRs
CS TRs
29
Vector Space Model and Clustering
  • relevance feedback (brilliant idea) Roccio73

MD TRs
CS TRs
30
Vector Space Model and Clustering
  • relevance feedback (brilliant idea) Roccio73
  • How?

MD TRs
CS TRs
31
Vector Space Model and Clustering
  • How? A by adding the good vectors and
    subtracting the bad ones

MD TRs
CS TRs
32
Cluster generation
  • Problem
  • given N points in V dimensions,
  • group them

33
Cluster generation
  • Problem
  • given N points in V dimensions,
  • group them (typically a k-means or AGNES is used)

34
Assigning Weights to Terms
  • Binary Weights
  • Raw term frequency
  • tf x idf
  • Recall the Zipf distribution
  • Want to weight terms highly if they are
  • frequent in relevant documents BUT
  • infrequent in the collection as a whole

35
Binary Weights
  • Only the presence (1) or absence (0) of a term is
    included in the vector

36
Raw Term Weights
  • The frequency of occurrence for the term in each
    document is included in the vector

37
Assigning Weights
  • tf x idf measure
  • term frequency (tf)
  • inverse document frequency (idf) -- a way to deal
    with the problems of the Zipf distribution
  • Goal assign a tf idf weight to each term in
    each document

38
tf x idf
39
Inverse Document Frequency
  • IDF provides high values for rare words and low
    values for common words

For a collection of 10000 documents
40
Similarity Measures for document vectors
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
41
tf x idf normalization
  • Normalize the term weights (so longer documents
    are not unfairly given more weight)
  • normalize usually means force all values to fall
    within a certain range, usually between 0 and 1,
    inclusive.

42
Vector space similarity(use the weights to
compare the documents)
43
Computing Similarity Scores
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
44
Vector Space with Term Weights and Cosine Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
1.0
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
45
Text - Detailed outline
  • Text databases
  • problem
  • full text scanning
  • inversion
  • signature files (a.k.a. Bloom Filters)
  • Vector model and clustering
  • information filtering and LSI

46
Information Filtering LSI
  • Foltz,92 Goal
  • users specify interests ( keywords)
  • system alerts them, on suitable news-documents
  • Major contribution LSI Latent Semantic
    Indexing
  • latent (hidden) concepts

47
Information Filtering LSI
  • Main idea
  • map each document into some concepts
  • map each term into some concepts
  • Concept a set of terms, with weights, e.g.
  • data (0.8), system (0.5), retrieval (0.6)
    -gt DBMS_concept

48
Information Filtering LSI
  • Pictorially term-document matrix (BEFORE)

49
Information Filtering LSI
  • Pictorially concept-document matrix and...

50
Information Filtering LSI
  • ... and concept-term matrix

51
Information Filtering LSI
  • Q How to search, eg., for system?

52
Information Filtering LSI
  • A find the corresponding concept(s) and the
    corresponding documents

53
Information Filtering LSI
  • A find the corresponding concept(s) and the
    corresponding documents

54
Information Filtering LSI
  • Thus it works like an (automatically constructed)
    thesaurus
  • we may retrieve documents that DONT have the
    term system, but they contain almost everything
    else (data, retrieval)

55
SVD
  • LSI find concepts

56
SVD - Definition
  • An x m Un x r L r x r (Vm x r)T
  • A n x m matrix (eg., n documents, m terms)
  • U n x r matrix (n documents, r concepts)
  • L r x r diagonal matrix (strength of each
    concept) (r rank of the matrix)
  • V m x r matrix (m terms, r concepts)

57
SVD - Example
  • A U L VT - example

retrieval
inf.
lung
brain
data
CS
x
x

MD
58
SVD - Example
  • A U L VT - example

retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x

MD
59
SVD - Example
doc-to-concept similarity matrix
  • A U L VT - example

retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x

MD
60
SVD - Example
  • A U L VT - example

retrieval
strength of CS-concept
inf.
lung
brain
data
CS
x
x

MD
61
SVD - Example
  • A U L VT - example

term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x

MD
62
SVD - Example
  • A U L VT - example

term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x

MD
63
SVD for LSI
  • documents, terms and concepts
  • U document-to-concept similarity matrix
  • V term-to-concept sim. matrix
  • L its diagonal elements strength of each
    concept

64
SVD for LSI
  • Need to keep all the eigenvectors?
  • NO, just keep the first k (concepts)

65
Web Search
  • What about web search?
  • First you need to get all the documents of the
    web. Crawlers.
  • Then you have to index them (inverted files, etc)
  • Find the web pages that are relevant to the query
  • Report the pages with their links in a sorted
    order
  • Main difference with IR web pages have links
  • may be possible to exploit the link structure for
    sorting the relevant documents

66
Kleinbergs Algorithm (HITS)
  • Main idea In many cases, when you search the web
    using some terms, the most relevant pages may not
    contain this term (or contain the term only a few
    times)
  • Harvard www.harvard.edu
  • Search Engines yahoo, google, altavista
  • Authorities and hubs

67
Kleinbergs algorithm
  • Problem dfn given the web and a query
  • find the most authoritative web pages for this
    query
  • Step 0 find all pages containing the query terms
    (root set)
  • Step 1 expand by one move forward and backward
    (base set)

68
Kleinbergs algorithm
  • Step 1 expand by one move forward and backward

69
Kleinbergs algorithm
  • on the resulting graph, give high score (
    authorities) to nodes that many important nodes
    point to
  • give high importance score (hubs) to nodes that
    point to good authorities)

hubs
authorities
70
Kleinbergs algorithm
  • observations
  • recursive definition!
  • each node (say, i-th node) has both an
    authoritativeness score ai and a hubness score hi

71
Kleinbergs algorithm
  • Let E be the set of edges and A be the adjacency
    matrix
  • the (i,j) is 1 if the edge from i to j exists
  • Let h and a be n x 1 vectors with the
    hubness and authoritativiness scores.
  • Then

72
Kleinbergs algorithm
  • Then
  • ai hk hl hm
  • that is
  • ai Sum (hj) over all j that (j,i) edge
    exists
  • or
  • a AT h

k
i
l
m
73
Kleinbergs algorithm
  • symmetrically, for the hubness
  • hi an ap aq
  • that is
  • hi Sum (qj) over all j that (i,j) edge
    exists
  • or
  • h A a

n
i
p
q
74
Kleinbergs algorithm
  • In conclusion, we want vectors h and a such that
  • h A a
  • a AT h

Start from a and h to all 1. Then apply the
following trick hAaA(ATh)(AAT)h ..(AAT)2
h .. (AAT)k h a (ATA)ka
75
Kleinbergs algorithm
  • In short, the solutions to
  • h A a
  • a AT h
  • are the left- and right- eigenvectors of the
    adjacency matrix A.
  • Starting from random a and iterating, well
    eventually converge
  • (Q to which of all the eigenvectors? why?)

76
Kleinbergs algorithm
  • (Q to which of all the eigenvectors? why?)
  • A to the ones of the strongest eigenvalue,
    because of property
  • (AT A ) k v (constant) v1

So, we can find the a and h vectors and the page
with the highest a values are reported!
77
Kleinbergs algorithm - results
  • Eg., for the query java
  • 0.328 www.gamelan.com
  • 0.251 java.sun.com
  • 0.190 www.digitalfocus.com (the java developer)

78
Kleinbergs algorithm - discussion
  • authority score can be used to find similar
    pages to page p
  • closely related to citation analysis, social
    networs / small world phenomena

79
google/page-rank algorithm
  • closely related The Web is a directed graph of
    connected nodes
  • imagine a particle randomly moving along the
    edges ()
  • compute its steady-state probabilities. That
    gives the PageRank of each pages (the importance
    of this page)
  • () with occasional random jumps

80
PageRank Definition
  • Assume a page A and pages T1, T2, , Tm that
    point to A. Let d is a damping factor. PR(A) the
    Pagerank of A. C(A) the out-degree of A. Then

81
google/page-rank algorithm
  • Compute the PR of each pageidentical problem
    given a Markov Chain, compute the steady state
    probabilities p1 ... p5

2
1
3
4
5
82
Computing PageRank
  • Iterative procedure
  • Also, navigate the web by randomly follow links
    or with prob p jump to a random page. Let A the
    adjacency matrix (n x n), ci out-degree of page i
  • Prob(Ai-gtAj) dn-1(1-d)ci1Aij
  • Ai,j Prob(Ai-gtAj)

83
google/page-rank algorithm
  • Let A be the transition matrix ( adjacency
    matrix, row-normalized sum of each row 1)

2
1
3

4
5
84
google/page-rank algorithm
  • A p p

A p p
2
1
3

4
5
85
google/page-rank algorithm
  • A p p
  • thus, p is the eigenvector that corresponds to
    the highest eigenvalue (1, since the matrix is
    row-normalized)

86
Kleinberg/google - conclusions
  • SVD helps in graph analysis
  • hub/authority scores strongest left- and right-
    eigenvectors of the adjacency matrix
  • random walk on a graph steady state
    probabilities are given by the strongest
    eigenvector of the transition matrix

87
References
  • Brin, S. and L. Page (1998). Anatomy of a
    Large-Scale Hypertextual Web Search Engine. 7th
    Intl World Wide Web Conf.
Write a Comment
User Comments (0)
About PowerShow.com