Web%20search%20engines - PowerPoint PPT Presentation

About This Presentation
Title:

Web%20search%20engines

Description:

Remove inflections that convey parts of speech, tense and number ... dictionary lookup (e.g., WordNet). Stemming may increase recall but at the price of precision ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 43
Provided by: csC5
Learn more at: https://cs.ccsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Web%20search%20engines


1
Web search engines
  • Rooted in Information Retrieval (IR) systems
  • Prepare a keyword index for corpus
  • Respond to keyword queries with a ranked list of
    documents.
  • ARCHIE
  • Earliest application of rudimentary IR systems to
    the Internet
  • Title search across sites serving files over FTP

2
Boolean queries Examples
  • Simple queries involving relationships between
    terms and documents
  • Documents containing the word Java
  • Documents containing the word Java but not the
    word coffee
  • Proximity queries
  • Documents containing the phrase Java beans or the
    term API
  • Documents where Java and island occur in the same
    sentence

3
Document preprocessing
  • Tokenization
  • Filtering away tags
  • Tokens regarded as nonempty sequence of
    characters excluding spaces and punctuations.
  • Token represented by a suitable integer, tid,
    typically 32 bits
  • Optional stemming/conflation of words
  • Result document (did) transformed into a
    sequence of integers (tid, pos)

4
Storing tokens
  • Straight-forward implementation using a
    relational database
  • Example figure
  • Space scales to almost 10 times
  • Accesses to table show common pattern
  • reduce the storage by mapping tids to a
    lexicographically sorted buffer of (did, pos)
    tuples.
  • Indexing transposing document-term matrix

5

Two variants of the inverted index data
structure, usually stored on disk. The
simpler version in the middle does not store term
offset information the version to the right
stores term offsets. The mapping from terms to
documents and positions (written as
document/position) may be implemented using a
B-tree or a hash-table.
6
Storage
  • For dynamic corpora
  • Berkeley DB2 storage manager
  • Can frequently add, modify and delete documents
  • For static collections
  • Index compression techniques (to be discussed)

7
Stopwords
  • Function words and connectives
  • Appear in large number of documents and little
    use in pinpointing documents
  • Indexing stopwords
  • Stopwords not indexed
  • For reducing index space and improving
    performance
  • Replace stopwords with a placeholder (to remember
    the offset)
  • Issues
  • Queries containing only stopwords ruled out
  • Polysemous words that are stopwords in one sense
    but not in others
  • E.g. can as a verb vs. can as a noun

8
Stemming
  • Conflating words to help match a query term with
    a morphological variant in the corpus.
  • Remove inflections that convey parts of speech,
    tense and number
  • E.g. university and universal both stem to
    universe.
  • Techniques
  • morphological analysis (e.g., Porter's algorithm)
  • dictionary lookup (e.g., WordNet).
  • Stemming may increase recall but at the price of
    precision
  • Abbreviations, polysemy and names coined in the
    technical and commercial sectors
  • E.g. Stemming ides to IDE, SOCKS to
    sock, gated to gate, may be bad !

9
Batch indexing and updates
  • Incremental indexing
  • Time-consuming due to random disk IO
  • High level of disk block fragmentation
  • Simple sort-merges.
  • To replace the indexed update of variable-length
    postings
  • For a dynamic collection
  • single document-level change may need to update
    hundreds to thousands of records.
  • Solution create an additional stop-press
    index.

10
Maintaining indices over dynamic collections.
11
Stop-press index
  • Collection of document in flux
  • Model document modification as deletion followed
    by insertion
  • Documents in flux represented by a signed record
    (d,t,s)
  • s specifies if d has been deleted or
    inserted.
  • Getting the final answer to a query
  • Main index returns a document set D0.
  • Stop-press index returns two document sets
  • D documents not yet indexed in D0 matching the
    query
  • D- documents matching the query removed from
    the collection since D0 was constructed.
  • Stop-press index getting too large
  • Rebuild the main index
  • signed (d, t, s) records are sorted in (t, d, s)
    order and merge-purged into the master (t, d)
    records
  • Stop-press index can be emptied out.

12
Relevance ranking
  • Keyword queries
  • In natural language
  • Not precise, unlike SQL
  • Boolean decision for response unacceptable
  • Solution
  • Rate each document for how likely it is to
    satisfy the user's information need
  • Sort in decreasing order of the score
  • Present results in a ranked list.
  • No algorithmic way of ensuring that the ranking
    strategy always favors the information need
  • Query only a part of the user's information need

13
Responding to queries
  • Set-valued response
  • Response set may be very large
  • (E.g., by recent estimates, over 12 million Web
    pages contain the word java.)
  • Demanding selective query from user
  • Guessing user's information need and ranking
    responses
  • Evaluating rankings

14
Evaluating procedure
  • Given benchmark
  • Corpus of n documents D
  • A set of queries Q
  • For each query, an exhaustive set of
    relevant documents identified
    manually
  • Query submitted system
  • Ranked list of documents
    retrieved
  • compute a 0/1 relevance list
  • iff
  • otherwise.

15
Recall and precision
  • Recall at rank
  • Fraction of all relevant documents included in
    .
  • .
  • Precision at rank
  • Fraction of the top k responses that are actually
    relevant.
  • .

16
Other measures
  • Average precision
  • Sum of precision at each relevant hit position in
    the response list, divided by the total number of
    relevant documents
  • .

    .
  • avg.precision 1 iff engine retrieves all
    relevant documents and ranks them ahead of any
    irrelevant document
  • Interpolated precision
  • To combine precision values from multiple queries
  • Gives precision-vs.-recall curve for the
    benchmark.
  • For each query, take the maximum precision
    obtained for the query for any recall greater
    than or equal to
  • average them together for all queries

17
Precision-Recall tradeoff
  • Interpolated precision cannot increase with
    recall
  • Interpolated precision at recall level 0 may be
    less than 1
  • At level k 0
  • Precision (by convention) 1, Recall 0
  • Inspecting more documents
  • Can increase recall
  • Precision may decrease
  • we will start encountering more and more
    irrelevant documents
  • Search engine with a good ranking function will
    generally show a negative relation between recall
    and precision.
  • Higher the curve, better the engine

18
Precision and interpolated precision plotted
against recall for the given relevance vector.
Missing are zeroes.
19
The vector space model
  • Documents represented as vectors in a
    multi-dimensional Euclidean space
  • Each axis a term (token)
  • Coordinate of document d in direction of term t
    determined by
  • Term frequency TF(d,t)
  • number of times term t occurs in document d,
    scaled in a variety of ways to normalize document
    length
  • Inverse document frequency IDF(t)
  • to scale down the coordinates of terms that occur
    in many documents

20
Term frequency
  • .
    .
  • Cornell SMART system uses a smoothed version

21
Inverse document frequency
  • Given
  • D is the document collection and is the set
    of documents containing t
  • Formulae
  • mostly dampened functions of
  • SMART
  • .

22
Vector space model
  • Coordinate of document d in axis t
  • .
  • Transformed to in the TFIDF-space
  • Query q
  • Interpreted as a document
  • Transformed to in the same TFIDF-space as d

23
Measures of proximity
  • Distance measure
  • Magnitude of the vector difference
  • .
  • Document vectors must be normalized to unit
    length
  • Else shorter documents dominate (since queries
    are short)
  • Cosine similarity
  • cosine of the angle between and
  • Shorter documents are penalized

24
Relevance feedback
  • Users learning how to modify queries
  • Response list must have least some relevant
    documents
  • Relevance feedback
  • correcting' the ranks to the user's taste
  • automates the query refinement process
  • Rocchio's method
  • Folding-in user feedback
  • To query vector
  • Add a weighted sum of vectors for relevant
    documents D
  • Subtract a weighted sum of the irrelevant
    documents D-
  • .

25
Relevance feedback (contd.)
  • Pseudo-relevance feedback
  • D and D- generated automatically
  • E.g. Cornell SMART system
  • top 10 documents reported by the first round of
    query execution are included in D
  • typically set to 0 D- not used
  • Not a commonly available feature
  • Web users want instant gratification
  • System complexity
  • Executing the second round query slower and
    expensive for major search engines

26
Bayesian Inferencing
Bayesian inference network for relevance ranking.
A document is relevant to the extent that setting
its corresponding belief node to true lets us
assign a high degree of belief in the node
corresponding to the query.
Manual specification of mappings between terms to
approximate concepts.
27
Bayesian Inferencing (contd.)
  • Four layers
  • Document layer
  • Representation layer
  • Query concept layer
  • Query
  • Each node is associated with a random Boolean
    variable, reflecting belief
  • Directed arcs signify that the belief of a node
    is a function of the belief of its immediate
    parents (and so on..)

28
Bayesian Inferencing systems
  • 2 3 same for basic vector-space IR systems
  • Verity's Search97
  • Allows administrators and users to define
    hierarchies of concepts in files
  • Estimation of relevance of a document d w.r.t.
    the query q
  • Set the belief of the corresponding node to 1
  • Set all other document beliefs to 0
  • Compute the belief of the query
  • Rank documents in decreasing order of belief that
    they induce in the query

29
Other issues
  • Spamming
  • Adding popular query terms to a page unrelated to
    those terms
  • E.g. Adding Hawaii vacation rental to a page
    about Internet gambling
  • Little setback due to hyperlink-based ranking
  • Titles, headings, meta tags and anchor-text
  • TFIDF framework treats all terms the same
  • Meta search engines
  • Assign weight age to text occurring in tags,
    meta-tags
  • Using anchor-text on pages u which link to v
  • Anchor-text on u offers valuable editorial
    judgment about v as well.

30
Other issues (contd..)
  • Including phrases to rank complex queries
  • Operators to specify word inclusions and
    exclusions
  • With operators and phrases queries/documents can
    no longer be treated as ordinary points in vector
    space
  • Dictionary of phrases
  • Could be cataloged manually
  • Could be derived from the corpus itself using
    statistical techniques
  • Two separate indices
  • one for single terms and another for phrases

31
Corpus derived phrase dictionary
  • Two terms and
  • Null hypothesis occurrences of and
    are independent
  • To the extent the pair violates the null
    hypothesis, it is likely to be a phrase
  • Measuring violation with likelihood ratio of the
    hypothesis
  • Pick phrases that violate the null hypothesis
    with large confidence
  • Contingency table built from statistics

32
Corpus derived phrase dictionary
  • Hypotheses
  • Null hypothesis
  • Alternative hypothesis
  • Likelihood ratio

33
Approximate string matching
  • Non-uniformity of word spellings
  • dialects of English
  • transliteration from other languages
  • Two ways to reduce this problem.
  • Aggressive conflation mechanism to collapse
    variant spellings into the same token
  • Decompose terms into a sequence of q-grams or
    sequences of q characters

34
Approximate string matching
  • Aggressive conflation mechanism to collapse
    variant spellings into the same token
  • E.g. Soundex takes phonetics and pronunciation
    details into account
  • used with great success in indexing and searching
    last names in census and telephone directory
    data.
  • Decompose terms into a sequence of q-grams or
    sequences of q characters
  • Check for similarity in the
    grams
  • Looking up the inverted index a two-stage
    affair
  • Smaller index of q-grams consulted to expand each
    query term into a set of slightly distorted query
    terms
  • These terms are submitted to the regular index
  • Used by Google for spelling correction
  • Idea also adopted for eliminating near-duplicate
    pages

35
Meta-search systems
  • Take the search engine to the document
  • Forward queries to many geographically
    distributed repositories
  • Each has its own search service
  • Consolidate their responses.
  • Advantages
  • Perform non-trivial query rewriting
  • Suit a single user query to many search engines
    with different query syntax
  • Surprisingly small overlap between crawls
  • Consolidating responses
  • Function goes beyond just eliminating duplicates
  • Search services do not provide standard ranks
    which can be combined meaningfully

36
Similarity search
  • Cluster hypothesis
  • Documents similar to relevant documents are also
    likely to be relevant
  • Handling find similar queries
  • Replication or duplication of pages
  • Mirroring of sites

37
Document similarity
  • Jaccard coefficient of similarity between
    document and
  • T(d) set of tokens in document d
  • .
  • Symmetric, reflexive, not a metric
  • Forgives any number of occurrences and any
    permutations of the terms.
  • is a metric

38
Estimating Jaccard coefficient with random
permutations
  1. Generate a set of m random permutations
  2. for each do
  3. compute and
  4. check if
  5. end for
  6. if equality was observed in k cases, estimate.

39
Fast similarity search with random permutations
  1. for each random permutation do
  2. create a file
  3. for each document d do
  4. write out
    to
  5. end for
  6. sort using key s--this results in
    contiguous blocks with fixed s containing all
    associated
  7. create a file
  8. for each pair within a run of
    having a given s do
  9. write out a document-pair record
    to g
  10. end for
  11. sort on key
  12. end for
  13. merge for all in
    order, counting the number of entries

40
Eliminating near-duplicates via shingling
  • Find-similar algorithm reports all
    duplicate/near-duplicate pages
  • Eliminating duplicates
  • Maintain a checksum with every page in the corpus
  • Eliminating near-duplicates
  • Represent each document as a set T(d) of q-grams
    (shingles)
  • Find Jaccard similarity between
    and
  • Eliminate the pair from step 9 if it has
    similarity above a threshold

41
Detecting locally similar sub-graphs of the Web
  • Similarity search and duplicate elimination on
    the graph structure of the web
  • To improve quality of hyperlink-assisted ranking
  • Detecting mirrored sites
  • Approach 1 Bottom-up Approach
  • Start process with textual duplicate detection
  • cleaned URLs are listed and sorted to find
    duplicates/near-duplicates
  • each set of equivalent URLs is assigned a unique
    token ID
  • each page is stripped of all text, and
    represented as a sequence of outlink IDs
  • Continue using link sequence representation
  • Until no further collapse of multiple URLs are
    possible
  • Approach 2 Bottom-up Approach
  • identify single nodes which are near duplicates
    (using text-shingling)
  • extend single-node mirrors to two-node mirrors
  • continue on to larger and larger graphs which are
    likely mirrors of one another

42
Detecting mirrored sites (contd.)
  • Approach 3 Step before fetching all pages
  • Uses regularity in URL strings to identify
    host-pairs which are mirrors
  • Preprocessing
  • Host are represented as sets of positional
    bigrams
  • Convert host and path to all lowercase characters
  • Let any punctuation or digit sequence be a token
    separator
  • Tokenize the URL into a sequence of tokens,
    (e.g., www6.infoseek.com gives www, infoseek,
    com)
  • Eliminate stop terms such as htm, html, txt,
    main, index, home, bin, cgi
  • Form positional bigrams from the token sequence
  • Two hosts are said to be mirrors if
  • A large fraction of paths are valid on both web
    sites
  • These common paths link to pages that are
    near-duplicates.
Write a Comment
User Comments (0)
About PowerShow.com