Internet Archive - PowerPoint PPT Presentation

About This Presentation

Title:

Internet Archive

Description:

Transitioned from 'Archive of the Internet' to 'Archive on the Internet' ... Upload your movie to the Archive. Build a movie at the Archive! Texts. Have 20K books ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 27

Provided by: raym153

Learn more at: http://www.fdis.org

Category:

more less

Transcript and Presenter's Notes

Title: Internet Archive

1
Internet ArchiveWeb Datamining

Raymie Stata
UC Santa Cruz Internet Archive

2
Agenda

State of the Archive
Collections
Infrastructure (freecache)
Internet Analytics
Information carnivores

3
Archive Overview

Started in 1996
Transitioned from Archive of the Internet to
Archive on the Internet
Transitioning to Digital Library of the Future
Funding from private foundations, plus lots of
volunteers

4
Digital Library of the Future
Universal Access to Human Knowledge

Information is accessible to anyone from anywhere
The best and broadest information is available
We imagine a small network of very large,
regional, mega digital libraries

5
Web collection

Over 10B pages, 200TB, 50M sites
Broad crawls (20TB snapshot/2 months)
Narrow crawls (elections, 9/11)
Heritage crawls
Writing new crawler -(
Wayback machine
Success! 4M hits/day
Have search engine, but hidden!
Policy has been tested, remains same

6
Moving images

2500 Movies
Open source movies
Upload your movie to the Archive
Build a movie at the Archive!

7
Texts

Have gt 20K books
Actively involved in 1M Book and ICDL
Bookmobile
Protest of Eldred
Real interest turned out to be overseas
India (30!), Egypt, Uganda
Spun into separate non-profit

8
Audio - eTree

Around 5,000 concerts from 250 bands
Growing 30 concerts, 1 band/day
Largest consumer of bandwidth
Consistent 85Mbps (downloads)
Same policy as Wayback
We respect requests

9
Infrastructure

Infinite bandwidth and storage
Core competency of the Archive
Vision, not reality
But striving for it makes us better
Recent challenges
Moving from 250TB to 1PB
Supporting eTree bandwidth

10
The Petabyte challenge

Finally having problems predicted
Power, cooling, disk failures dominating
Need larger staff, real software engineering
BUT
Took much longer than anticipated
Sticking to our philosophies
Commodity hardware
Widely used software simple scripts

11
The Petabyte architecture

New datacenter
To solve our power and cooling problems
Better procurement process
File-level mirroring
Use basic FS, simple scripts
Preparing for geoplexing (vs. file-level RAID)
Elimination of inter-crawl copies
This is currently our backup

12
The (eTree) bandwidth challenge

Can we do better than simply buying more
bandwidth?
Yes! Find other people willing to help
Cooperative/open-source CDN

13
Freecache.org

It shouldnt cost you to give away content
To distribute using freecache, simply
Replace hrefhttp//X/Y
With hrefhttp//freecache.org/http//X/Y
To be a distribution node, simply install a 1K
perl-script on your Apache server

14
Freecache design

Content routing done centrally
Right now, routing is random
Working on closeness-driven routing
LRU eviction policy
Throttles cheaters
Broken browsers have been a problem

15
Web scale datamining

Use data
Wayback, Wayback search
Web characterization
Story lifecycle analyzer

Apps
Access
Feature Datamarts

Access subsets of data fast
Full-text index, shingleprints
Connectivity, Term vectors

Warehouse

Store and access pages
Page cache
Feature extractor

Data collection

Download web pages
Donations, crawling

16
Tools for Web mining

Very similar to the Astronomy project
Need indexes, parallelism
Need to move computation to the data
Strategies to deal with different result-set
sizes
Current focus is on the warehouse

17
Web datamining usingWeb Carnivores
18
The Carnivore Analogy Etzioni96
Web pages
19
The Carnivore Analogy
Search engines
Web pages
20
The Carnivore Analogy
Carnivore apps
Search engines
Web pages
21
Carnivores

Search engines have what you want
Google has 3B pages Its in there
No need to crawl anymore
However, their general-purpose interface do not
always yield good results for specific
information needs

22
Googlisms a fun carnivore
Googlism for scott kirkpatrick scott kirkpatrick
is an associate for rossscott kirkpatrick is an
awesome drummer with many fine credits to his
namescott kirkpatrick is 17 but certified as an
adultscott kirkpatrick is listed as one of the
executors in the will of george hankins dated 1
october 1838 in jackson countyscott kirkpatrick
is the new chairpersonscott kirkpatrick is
joining the flett chiropractic clinic
Googlism for john kubiatowicz john kubiatowicz
is a professor in computer science at uc
berkeleyjohn kubiatowicz is currently an
assistant professor at the university of
california at berkeleyjohn kubiatowicz is
designing ajohn kubiatowicz is working on
oceanstorejohn kubiatowicz is a researcher at
berkeley exploring the space of introspective
computingjohn kubiatowicz is a doctoral
candidate in the department of electrical
engineering and computer science at mit
23
A carnivore for genre search