ISP 433633 Week 7 - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

ISP 433633 Week 7

Description:

ISP 433633 Week 7 – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 25
Provided by: huaha
Learn more at: http://www.albany.edu
Category:
Tags: isp | week

less

Transcript and Presenter's Notes

Title: ISP 433633 Week 7


1
ISP 433/633 Week 7
  • Web IR

2
Web is a unique collection
  • Largest repository of data
  • Unedited
  • Can be anything
  • Information type
  • Sources
  • Changing
  • Growing exponentially
  • 320 Million Web pages Lawrence Giles 1998
  • 800 Million Web pages, 15 TB Lawrence Giles
    1999
  • 3 Billion Web pages indexed Google 2003

3
Web serves a unique user base
  • Virtually Anyone
  • No training
  • All kinds of information needs

4
What Do People Search for on the Web? (from Spink
et al. 1998 study)
  • Topics
  • Genealogy/Public Figure 12
  • Computer related 12
  • Business 12
  • Entertainment 8
  • Medical 8
  • Politics Government 7
  • News 7
  • Hobbies 6
  • General info/surfing 6
  • Science 6
  • Travel 5
  • Arts/education/shopping/images 14

5
Web Queries
  • Short
  • 2.4 words on average (Aug 2000)
  • Has increased, was 1.7 (1997)
  • User Expectations
  • Many say the first item shown should be what I
    want to see!
  • This works if the user has the most
    popular/common notion in mind

6
How to do Web IR?
Hyperlinks
  • Take advantage of
  • Social network analysis
  • E.g. Small world phenomenon six degree of
    separation
  • Some people are more popular than others
  • Citation analysis
  • ISIs Impact Factor NumOfCitations/NumOfPapers
  • The same type of analysis can be applied for Web
    page linkage
  • Link analysis

7
Link Analysis
  • Assumptions
  • If the pages pointing to this page are good, then
    this is also a good page.
  • The words on the links pointing to this page are
    useful indicators of what this page is about.
  • Does it work?
  • Apparently, Google uses it

8
PageRank
  • Googles trademarked algorithm (Page etc. 1998)
  • Named after Larry Page, co-founder of Google
  • Rank importance of a page based on the Web graph
  • 3 billion nodes (pages) and 20 billion edges
    (links)
  • Independent of query

9
PageRank Intuition
  • A pages rank is determined by the sum of its
    citing pages ranks

10
PageRank calculation
  • Assume page A has pages T1...Tn which point to it
    (i.e., citations). The parameter d is a damping
    factor which can be set between 0 and 1(usually
    set to 0.85). C(A) is defined as the number of
    links going out of page A. The PageRank of a page
    A is
  • PR(A) (1-d) d (PR(T1)/C(T1) ...
    PR(Tn)/C(Tn))
  • Start with random guesses of PageRanks
  • Iteratively compute PageRanks for all
  • Until the values are stabilized
  • The average PageRank of all pages is always 1.0

11
PageRank
  • PageRank calculator
  • http//webworkshop.net/pagerank_calculator.php3
  • Use this knowledge to enhance site ranking in
    Google
  • Structure your site links to improve the main
    pages PageRank
  • http//www.iprcom.com/papers/pagerank/

12
Anchors
  • Words on the links
  • Often accurate description of the page
  • Helpful for non-text based information
  • Assign high term-document weight to anchors
  • Google does this
  • Abuse
  • Google bombing
  • Try miserable failure with Google

13
HITS
  • Query dependent model (Kleinberg 97)
  • Hubs
  • Pages that have many outgoing links
  • Authorities
  • Pages have many links pointing to them
  • Interconnected
  • A positive two-way feedback
  • Can be used to calculate each other

14
HITS
  • Algorithm
  • obtain root set using input query (via regular
    search engine)
  • expand the root set by radius one
  • run iterations on the hub and authority scores
    together
  • report top-ranking authorities and hubs
  • Can find relevant authorities that do not even
    contain the original query words

15
Subject-specific popularity
  • Similar to HITS idea
  • Without prior query
  • Ranks a site based on the number of same-subject
    pages that reference it
  • Clustering sites in to communities
  • http//www.teoma.com/

16
Other Useful Information
  • Directories and categories
  • E.g. Yahoo
  • Capitalization, font, title, etc.
  • E.g. Google use these information
  • "click popularity" number of click on the site
  • "stickiness" time spent on the site

17
Web Search Architecture
  • Preprocessing
  • Collection gathering phase
  • Web crawling
  • Collection indexing phase
  • Online
  • Query servers

18
Standard Web Search Engine Architecture
Eliminate duplicates
DocIds
crawl the web
crawler
user query
create an inverted index
Inverted index
Search engine servers
Show results
19
Google Architecture
20
Google Indexing Data Structure
  • A hit is an occurrence of a term in a document
  • Each forward barrel holds a range of wordIDs
  • Short barrels for fancy (title, big font) and
    anchor hits

21
Google Query Evaluation
22
Google Statistics (1998)
23
Web Crawlers
  • Main idea
  • Start with known sites
  • Record information for these sites
  • Follow the links from each site
  • Record information found at new sites
  • Repeat
  • Page Visit Order
  • Breadth first search
  • Depth first search
  • Best first search (e.g. using PageRank)

24
Crawling Web Issues
  • Keep out signs
  • A file called robots.txt tells the crawler which
    directories are off limits
  • Freshness
  • Figure out which pages change often, then crawl
    these often
  • Duplicates, virtual hosts, etc
  • Convert page contents with a hash function
  • Lots of problems
  • Server unavailable
  • Incorrect html
  • Missing links
  • Infinite loops
  • Web crawling is difficult to do robustly
Write a Comment
User Comments (0)
About PowerShow.com