WebBase: Building a Web Warehouse - PowerPoint PPT Presentation

About This Presentation
Title:

WebBase: Building a Web Warehouse

Description:

crawling. archive distribution. index construction. storage ... Crawling Deep Web. 43. Final Conclusion. Many challenges ahead... Additional information: ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 44
Provided by: Hec17
Category:

less

Transcript and Presenter's Notes

Title: WebBase: Building a Web Warehouse


1
WebBaseBuilding a Web Warehouse
  • Hector Garcia-Molina
  • Stanford University

Work with Sergey Brin, Junghoo Cho, Taher
Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar,
Sep Kamvar, Wang Lam, Larry Page, Andreas
Paepcke, Sriram Raghavan, Gary Wesley
2
The Web
  • A universal information resource
  • Model weak, strong agreement
  • How to exploit it?

3
WebBase
WEB PAGE
4
WebBase Goals
  • Manage very large collections of Web pages
  • Today 1500GB HTML, 200 M pages
  • Enable large-scale Web-related research
  • Locally provide a significant portion of the Web
  • Efficient wide-area Web data distribution

5
WebBase Architecture
6
WebBase Remote Users
  • Berkeley
  • Columbia
  • U. Washington
  • Harvey Mudd
  • Università degliStudi di Milano
  • U. of Arizona
  • California Digital Library
  • Cornell
  • U. of Houston
  • Learning LabLower Saxony (L3S)
  • France Telecom
  • U. Texas

7
Outline
  • Technical Challenges
  • WebBase Use
  • The Future

8
Challenges
  • Archiving
  • units
  • coordination
  • IP Management
  • copy access
  • link access
  • access control
  • Hidden Web
  • Topic-Specific Collection Building
  • Scalability
  • crawling
  • archive distribution
  • index construction
  • storage
  • Consistency
  • freshness
  • versions
  • Dissemination

9
What is a Crawler?
initial urls
init
to visit urls
get next url
web
get page
visited urls
extract urls
web pages
10
Parallel Crawling
web
...
11
Independent Crawlers
12
Partition Firewall
partition
  • URL hash
  • Site hash
  • Hierarchical

13
Partition Cross-Over
partition
14
Partition Cross-Over
partition
15
Partition Exchange
partition
16
Partition Exchange
partition
17
Coverage vs Overlap
cross-over crawler 5 random seeds per C-proc
18
WebBase Parallel Crawling
computer
coordinator
web
...
other computers
19
WebBase Parallel Crawling
2 cpu utilzation
200
100
0
number of processes
20
Challenges
  • Archiving
  • units
  • coordination
  • IP Management
  • copy access
  • link access
  • access control
  • Hidden Web
  • Topic-Specific Collection Building
  • Scalability
  • crawling
  • archive distribution
  • index construction
  • storage
  • Consistency
  • freshness
  • versions
  • Dissemination

done
next
21
How to Refresh?
a
a
a changes daily
can visit one page per week
b
b
b changes once a week
web
repository
  • How should we visit pages?
  • a a a a a a a ...
  • b b b b b b b ...
  • a b a b a b a b ... uniform
  • a a a a a a b a a a ... proportional
  • ?

22
Using WebBase
  • Fast Page Rank
  • Complex Queries

23
Structure of the Web
Color the nodes by their domain red
stanford.edu green berkeley.edu blue mit.edu
24
Structure of the Web
berkeley.edu
stanford.edu
mit.edu
25
Nested Block Structure of the Web
to
Berkeley
Stanford
from
26
Personalized Page Rank
a
b
27
Complex Queries
Text search E.g., Search for SARS Symptoms
Stanford WebBase Repository
Complex queries Declarative analysis interface
28
Example of a Complex Query
Web
Entire Web
Compute S stanford.edu pages containing phrase
Mobile networking
stanford.edu
Mobile networking pages (S)
find universities collaborating with Stanford on
mobile networking
Compute R set of all .edu domains pointed to
by pages in S
S
R
29
Supernodes
N2
N1
P3
N3
P1
P4
P5
P2
Web graph
? N1, N2, N3
30
Growth of Supernode Graph
100
90
80
70
82MB, 115M pages
(830 GB ofraw HTML)
60
Size of supernode graph (MB)
50
40
30
20
120
0
20
40
60
80
100
Number of pages (Millions)
31
Query Execution Times
600
S-Node representation
Relational DB
500
Connectivity Server
Files of adjacency lists
400
300
Time for navigation operation (secs)
200
100
0
Query 1
Query 2
Query 3
Query 4
Query 5
Query 6
Query
32
Query Optimization
33
Impact of cluster-based optimization
35-million page dataset 600 million links 300GB
of HTML
40-45 reduction in query execution times
34
Conclusion (So Far)
  • Web is universal information resource
  • WebBase exploits this resource
  • WebBase Challenges
  • scalability, consitency, complex queries...

35
Will WebBase Scale?
webBase capacity (optimistic)
web content (indexable)
webBase capacity (pesimistic)
today
time
36
Pessimistic Scenario
web content (indexable)
  • Specialized WebBases
  • sports
  • shopping
  • ...

webBase capacity (pesimistic)
today
time
37
Optimistic Scenario
webBase capacity (optimistic)
  • Web in a Box
  • web delivered in CD monthy
  • search engine handles updates

web content (indexable)
today
time
38
Legal Issues?
  • Is WebBase legal?
  • copies
  • links, deep linking
  • International regulations

39
Biasing Results
  • How long will Google, Altavista, etc.resist
    temptations?
  • Biasing Crawler
  • Link and Content Spam

40
Access Data
  • WebBase does not capture access patterns

WebBase
?
41
Semantic Web?
semantic tags
WebBase
?
  • Will tags be generated?
  • By whom?
  • Agreement?

42
Future Technical Challenges
  • Incremental Updates
  • Query Optimization
  • Crawling Deep Web

43
Final Conclusion
  • Many challenges ahead...
  • Additional informationGoogle Stanford WebBase
Write a Comment
User Comments (0)
About PowerShow.com