Search - on the Web and Locally - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Search - on the Web and Locally

Description:

A detailed literature review on one of the topics of this course. ... example, you might compare several photo management tools, describing each and ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 17
Provided by: lilliann
Category:
Tags: locally | search | web

less

Transcript and Presenter's Notes

Title: Search - on the Web and Locally


1
Search - on the Web and Locally
Related directly to Web Search Engines Part 1
and Part 2. IEEE Computer. June August 2006
2
First -- projects
  • The class web page suggests these types of
    projects
  • A detailed literature review on one of the topics
    of this course. This involves discovering,
    reading, summarizing and comparing published
    material about either search technology or
    personal information management. Conference
    papers are an appropriate source of materials.
    Materials found on the web are fine, as long as
    you do a suitable evaluation of the credibility
    of the resource.
  • A comparative review of a number of tools for one
    type of information management. For example, you
    might compare several photo management tools,
    describing each and listing the features that set
    each apart from the others and then summarizing
    their strengths and weaknesses. Your report
    would conclude with your evaluation of the state
    of the art of this type of information management
    based on your review of these materials.
  • A significant contribution to an open source
    project related to our topics. Do you have a way
    to improve Lucene? Can you find a tool for
    managing e-mail that you can improve? You must
    prepare your project for evaluation by the class
    and for submission to the open source project
    organization.
  • A totally new tool that you have created. Have
    you had an idea for a useful tool and never got
    around to doing anything about it? Maybe this
    will be the beginning of an important product.

3
First - the search
  • Describe your experience in finding the required
    reading
  • What steps did you take?
  • Were there any problems?
  • Was anything about the search difficult?
  • Was anything different from what you expected?

4
Initial discussion
  • What surprised you in these articles?
  • What did you recognize from previous courses but
    did not expect to see in discussion of Web
    Search?
  • What works differently from the image you had?
  • What would you like to have learned that was not
    included?
  • What are the biggest areas of challenge to the
    Web Search enterprise?
  • Are there things that cannot be solved?
  • Are there issues of scale that are just
    impossible?
  • Are there limitations that just cannot be
    overcome?
  • Are there problems to solve that require more
    work but are within the range of manageable
    improvements?

5
The Web Search
  • Three Distinct Phases
  • Crawling
  • Indexing
  • Searching
  • Each has specific challenges to address

6
(No Transcript)
7
Crawlers
  • Basic process
  • Open an HTML page that has at least one anchor
    tag
  • (lta href..gt link description lt/agt
  • Send HTTP request to the site and receive the
    page.
  • Parse the page, looking for other anchor tags
  • Place anchors on a queue for further processing
  • Submit the actual page for indexing and storing

8
Indexers
  • Scanning
  • For each indexable term the indexer writes a
    posting consisting of a document number and a
    term number to a temporary file.
  • Parse this sentence What is an indexable term?
    Posting? Document number? Term number? What does
    a posting look like?
  • Invert the file
  • Sort by term, secondarily by document number
  • Record start location and list length for each
    term

9
Searching (Query Processing)
  • Look up query term in term dictionary
  • Get the postings list
  • Find documents that match all search terms
  • Find documents for each term and merge lists
    where common documents occur
  • Rank documents and report
  • As many as required or until end of the list
  • Still possible to find a result on one search and
    not find that same item on a subsequent search of
    the same terms

10
Expanding from the basics
  • Each of the phases of web searching is simple in
    concept, but complicated by the sheer magnitude
    of the task.
  • The same ideas applied on a smaller scale -- in a
    company intra-net, for example, can be done
    efficiently.
  • The Web presents special challenges.

11
Crawling
  • A single machine running a simple crawling
    algorithm would not do well in finding all Web
    pages.
  • Large data centers
  • Redundancy and fault tolerance
  • Parallel operation
  • (SIGCSE talk by Marissa Mayer of Google)

12
Crawling reality
  • Speed - amazing numbers
  • _at_ .5 sec per http request, max 86,400 per day
    634 years for 20 billion pages
  • Politeness -
  • Overwhelming web servers
  • Excluded content
  • Robots.txt
  • Duplicate content
  • Identifying duplicates can be tricky - why?
  • Continuous crawling
  • Keeping current
  • Note comment about current time - how would you
    fix that?
  • Priority queue for crawling schedule - why?
  • Spam

13
Indexing large collections
  • The Web is the ultimate large collection
  • Estimating 500 terms in each of 20 billion
    pages --gt 10 trillion entries!
  • Divide and conquer, as the crawler did
  • Each indexer builds a partial file in memory
  • Stops when memory is full
  • Write to disk, clear memory, and start over
  • Merge the partial files to make the full index

14
Data structures for indexing
  • Trees, tries, hash tables
  • Various ways to organize the terms for easy
    lookup
  • Numbers of terms
  • Not just all words in all languages
  • Acronyms, proper names, etc.
  • Must deal with common phrases also
  • Separate index entries (postings) for common word
    combinations
  • Compression
  • Saves space, increases processing
  • Anchor text -- fie on those who use click
    here!!
  • Link popularity score
  • Give a score to a page based on popularity, also
    on query-independent factors.
  • Think about the implications of this.

15
Query Processing
  • Most queries are short, do not provide much
    context
  • Result quality -- use some of the techniques from
    information retrieval
  • Once a preliminary list of responses is obtained,
    treat that as the collection and use IR
    techniques to improve the quality of the
    response.
  • Some limitations. No way to judge how complete
    the initial list is.
  • Techniques are part of the trade secrets of the
    companies
  • Speeding
  • Skipping
  • Early termination
  • Document numbering
  • Caching

16
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com