CS/INFO 430 Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

CS/INFO 430 Information Retrieval

Description:

Web search services are centralized systems ... The technology developed for Web search services has many other applications. ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 26
Provided by: wya1
Category:

less

Transcript and Presenter's Notes

Title: CS/INFO 430 Information Retrieval


1
CS/INFO 430Information Retrieval
Lecture 17 Web Search 3
2
Course Administration
3
Information Retrieval Using PageRank
Simple Method Rank by Popularity Consider all
hits (i.e., all document that match the query in
the Boolean sense) as equal. Display the hits
ranked by PageRank. The disadvantage of this
method is that it gives no attention to how
closely a document matches a query
4
Combining Term Weighting with Reference Pattern
Ranking
Combined Method 1. Find all documents that
contain the terms in the query vector. 2. Let
sj be the similarity between the query and
document j, calculated using tf.idf or a related
method. 3. Let pj be the popularity of document
j, calculated using PageRank or another measure
of importance. 4. The combined rank cj ?sj
(1- ?)pj, where ? is a constant. 5. Display the
hits ranked by cj.
5
Questions about PageRank
Most pages have very small page ranks For
searches that return large numbers of hits, there
are usually a reasonable number of pages with
high PageRank. For searches that return smaller
numbers of hits, e.g, highly specific queries,
all the pages may have very small PageRanks, so
that it is difficult to rank them in a sensible
order. Example A search by a customer for
information about a product may rank a large
number of mail order businesses that sell the
product above the manufacturer's site that
provides a specification for the product. Small
numbers of links may make big changes to rank.
6
Advanced Graphical Methods www.teoma.com
Carry out a search Divide Web sites found by
a search into clusters, known as
communities Calculate authority within
communities Calculate hubs within communities,
known as experts Note Teoma does not publish the
precise algorithms it uses
7
Other Factors in Ranking
Coefficient sj and pj may be varied by adding
other evidence. Similarity ranking sj might
weight structural mark-up, e.g., headings,
bold, etc. meta-tags anchor text and
adjacent text in the linking page file
names Popularity ranking pj might weight
usage data of page previous searches by same
user
8
Anchor Text and Adjacent Text
Anchor text
Document A provides information about document B
Adjacent text
9
Anchor Text and File Names
The source of Document A contains the marked-up
text
lta href"http//www.cis.cornell.edu/"gtThe Faculty
of Computing and Information Sciencelt/agt
This string provides the following index terms
about Document B Anchor text faculty,
computing, information, science File name cis,
cornell Note A specific stop list is needed for
each category of text.
10
Indexing Non-Textual Materials
Factors that can be used to index non-textual
materials anchor text, including ltaltgt
tags text adjacent to an anchor file
names PageRank This is the concept behind
image searching on the Web.
11
Context Image Searching
HTML source
Captions and other adjacent text on the web page
From the Information Science web site
12
Evaluation Web Searching
Test corpus must be dynamic The web is dynamic
(10-20) of URLs change every month Spam
methods change change continually Queries are
time sensitive Topic are hot and then not Need
to have a sample of real queries Languages At
least 90 different languages Reflected in
cultural and technical differences Amil Singhal,
Google, 2004
13
Evaluation Search Browse
Users give queries of 2 to 4 words Most users
click only on the first few results few go
beyond the fold on the first page 80 of users,
use search engine to find sites search to find
site browse to find information Amil Singhal,
Google, 2004
Browsing is a major topic in the lectures on
Usability
14
Evaluation The Human in the Loop
Return objects
Return hits
Browse documents
Search index
15
Scalability
  • Question How big is the Web and how fast is it
    growing?
  • Answer Nobody knows
  • Estimates of the Crawled Web
  • 1994 100,000 pages
  • 1997 1,000,000 pages
  • 2000 1,000,000,000 pages
  • 2005 8,000,000,000 pages
  • Rough estimates of the Crawlable Web suggest at
    least 4x
  • Rough estimates of the Deep Web suggest at least
    100x

16
Scalability Software and Hardware Replication
Search service
advertisement server
advertisement server
index server
advertisement server
index server
advertisement server
index server
advertisement server
index server
advertisement server
index server
index server
index server
spell checking
document server
spell checking
document server
spell checking
document server
spell checking
document server
spell checking
document server
spell checking
document server
spell checker
document server
17
Scalability Large-scale Clusters of Commodity
Computers
"Component failures are the norm rather than the
exception.... The quantity and quality of the
components virtually guarantee that some are not
functional at any given time and some will not
recover from their current failures. We have seen
problems caused by application bugs, operating
system bugs, human errors, and the failures of
disks, memory, connectors, networking, and power
supplies...." Sanjay Ghemawat, Howard Gobioff,
and Shun-Tak Leung, "The Google File System."
19th ACM Symposium on Operating Systems
Principles, October 2003. http//portal.acm.org/ci
tation.cfm?doid945445.945450
18
Scalability Performance
  • Very large numbers of commodity computers
  • Algorithms and data structures scale linearly
  • Storage
  • Scale with the size of the Web
  • Compression/decompression
  • System
  • Crawling, indexing, sorting simultaneously
  • Searching
  • Bounded by disk I/O

19
Scalability of Staff Growth of Google
  • In 2000 85 people
  • 50 technical, 14 Ph.D. in Computer Science
  • In 2000 Equipment
  • 2,500 Linux machines
  • 80 terabytes of spinning disks
  • 30 new machines installed daily
  • Reported by Larry Page, Google, March 2000
  • At that time, Google was handling 5.5 million
    searches per day
  • Increase rate was 20 per month
  • By fall 2002, Google had grown to over 400
    people.
  • By fall 2006, Google had over 9,000 people.

20
Scalability Numbers of Computers
Very rough calculation In March 2000, 5.5 million
searches per day, required 2,500 computers In
fall 2004, computers were about 8 times more
powerful. Estimated number of computers for 250
million searches per day (250/5.5) x
2,500/8 about 15,000 Some industry
estimates (based on Google's capital expenditure)
suggest that Google and Yahoo may have had as
many as 250,000 computers in fall 2006.
21
Scalability Staff
  • Programming As the number of programmers grows
    it becomes increasingly difficult to maintain the
    quality of software.
  • Have very well trained staff. Isolate complex
    code. Most coding is single image.
  • System maintenance Organize for minimal staff
    (e.g., automated log analysis, do not fix broken
    computers).
  • Customer service Automate everything possible,
    but complaints, large collections, etc. still
    require staff.

22
Scalability of Staff The Neptune Project
  • The Neptune Clustering Software
  • Programming API and runtime support, which allows
    a network service to be programmed quickly for
    execution on a large-scale cluster in handling
    high-volume user traffic.
  • The system shields application programmers from
    the complexities of replication, service
    discovery, failure detection and recovery, load
    balancing, resource monitoring and management.
  • Tao Yang, University of California, Santa Barbara
  • http//www.cs.ucsb.edu/projects/neptune/

23
Scalability the Long Term
Web search services are centralized
systems Over the past 12 years, Moore's Law has
enabled Web search services to keep pace with the
growth of the Web and the number of users, while
adding extra function. Will this
continue? Possible areas for concern are staff
costs, telecommunications costs, disk and memory
access rates, equipment costs.
24
Growth of Web Searching
  • In November 1997
  • AltaVista was handling 20 million
    searches/day.
  • Google forecast for 2000 was 100s of millions
    of searches/day.
  • In 2004, Google reported 250 million webs
    searches/day, and estimated that the total number
    over all engines was 500 million searches/day.
  • Moore's Law and Web searching
  • In 7 years, Moore's Law predicts computer power
    increased by a factor of at least 24 16.
  • It appears that computing power is growing at
    least as fast as web searching.

25
Other Uses of Web Crawling and Associated
Technology
  • The technology developed for Web search services
    has many other applications.
  • Conversely, technology developed for other
    Internet applications can be applied in Web
    searching
  • Related objects (e.g., Amazon's "Other people
    bought the
  • following").
  • Recommender and reputation systems (e.g.,
    ePinion's
  • reputation system).
Write a Comment
User Comments (0)
About PowerShow.com