Title: Internet Archive
1Internet ArchiveWeb Datamining
- Raymie Stata
- UC Santa Cruz Internet Archive
2Agenda
- State of the Archive
- Collections
- Infrastructure (freecache)
- Internet Analytics
- Information carnivores
3Archive Overview
- Started in 1996
- Transitioned from Archive of the Internet to
Archive on the Internet - Transitioning to Digital Library of the Future
- Funding from private foundations, plus lots of
volunteers
4Digital Library of the Future
Universal Access to Human Knowledge
- Information is accessible to anyone from anywhere
- The best and broadest information is available
- We imagine a small network of very large,
regional, mega digital libraries
5Web collection
- Over 10B pages, 200TB, 50M sites
- Broad crawls (20TB snapshot/2 months)
- Narrow crawls (elections, 9/11)
- Heritage crawls
- Writing new crawler -(
- Wayback machine
- Success! 4M hits/day
- Have search engine, but hidden!
- Policy has been tested, remains same
6Moving images
- 2500 Movies
- Open source movies
- Upload your movie to the Archive
- Build a movie at the Archive!
7Texts
- Have gt 20K books
- Actively involved in 1M Book and ICDL
- Bookmobile
- Protest of Eldred
- Real interest turned out to be overseas
- India (30!), Egypt, Uganda
- Spun into separate non-profit
8Audio - eTree
- Around 5,000 concerts from 250 bands
- Growing 30 concerts, 1 band/day
- Largest consumer of bandwidth
- Consistent 85Mbps (downloads)
- Same policy as Wayback
- We respect requests
9Infrastructure
- Infinite bandwidth and storage
- Core competency of the Archive
- Vision, not reality
- But striving for it makes us better
- Recent challenges
- Moving from 250TB to 1PB
- Supporting eTree bandwidth
10The Petabyte challenge
- Finally having problems predicted
- Power, cooling, disk failures dominating
- Need larger staff, real software engineering
- BUT
- Took much longer than anticipated
- Sticking to our philosophies
- Commodity hardware
- Widely used software simple scripts
11The Petabyte architecture
- New datacenter
- To solve our power and cooling problems
- Better procurement process
- File-level mirroring
- Use basic FS, simple scripts
- Preparing for geoplexing (vs. file-level RAID)
- Elimination of inter-crawl copies
- This is currently our backup
12The (eTree) bandwidth challenge
- Can we do better than simply buying more
bandwidth? - Yes! Find other people willing to help
- Cooperative/open-source CDN
13Freecache.org
- It shouldnt cost you to give away content
- To distribute using freecache, simply
- Replace hrefhttp//X/Y
- With hrefhttp//freecache.org/http//X/Y
- To be a distribution node, simply install a 1K
perl-script on your Apache server
14Freecache design
- Content routing done centrally
- Right now, routing is random
- Working on closeness-driven routing
- LRU eviction policy
- Throttles cheaters
- Broken browsers have been a problem
15Web scale datamining
- Use data
- Wayback, Wayback search
- Web characterization
- Story lifecycle analyzer
Apps
Access
Feature Datamarts
- Access subsets of data fast
- Full-text index, shingleprints
- Connectivity, Term vectors
Warehouse
- Store and access pages
- Page cache
- Feature extractor
Data collection
- Download web pages
- Donations, crawling
16Tools for Web mining
- Very similar to the Astronomy project
- Need indexes, parallelism
- Need to move computation to the data
- Strategies to deal with different result-set
sizes - Current focus is on the warehouse
17Web datamining usingWeb Carnivores
18The Carnivore Analogy Etzioni96
Web pages
19The Carnivore Analogy
Search engines
Web pages
20The Carnivore Analogy
Carnivore apps
Search engines
Web pages
21Carnivores
- Search engines have what you want
- Google has 3B pages Its in there
- No need to crawl anymore
- However, their general-purpose interface do not
always yield good results for specific
information needs
22Googlisms a fun carnivore
Googlism for scott kirkpatrick scott kirkpatrick
is an associate for rossscott kirkpatrick is an
awesome drummer with many fine credits to his
namescott kirkpatrick is 17 but certified as an
adultscott kirkpatrick is listed as one of the
executors in the will of george hankins dated 1
october 1838 in jackson countyscott kirkpatrick
is the new chairpersonscott kirkpatrick is
joining the flett chiropractic clinic
Googlism for john kubiatowicz john kubiatowicz
is a professor in computer science at uc
berkeleyjohn kubiatowicz is currently an
assistant professor at the university of
california at berkeleyjohn kubiatowicz is
designing ajohn kubiatowicz is working on
oceanstorejohn kubiatowicz is a researcher at
berkeley exploring the space of introspective
computingjohn kubiatowicz is a doctoral
candidate in the department of electrical
engineering and computer science at mit
23A carnivore for genre search
- Genre classifies documents by its intent
- Why was the document written
- Search engines search by topic, not genre
- Idea build a carnivore for genre search
24Genre search engine
Query Generation
Topic (from user)
Filter
Results
Term-vector generation
Genre (static)
25Making it work
- Query templates
- Details of query matters
- PMI-IR for genre terms
- Discrimination as well as genre vector
26User study
- Genre Buying guides
- Education for product selection
- Lots on the Web, but hard to find
- (Agreement on what they are)
- Results
- Topic by itself 0 P_at_10 (ie, none in top 10)
- Topic buying guide 33
- Our carnivore 51