ArchiveIt Architecture Introduction

About This Presentation

Title:

Description:

Number of Views:35

Avg rating:3.0/5.0

Slides: 17

Provided by: netEdu

Learn more at: http://net.educause.edu

Category:

Tags: archiveit | architecture | crawls | introduction

Transcript and Presenter's Notes

Title: ArchiveIt Architecture Introduction

1
Archive-It Architecture Introduction

1
2
Archive-It Components

2
3
Component Integration
3
4
Crawling

4
5
Crawling - Distributed Crawling

Heritrix Cluster Controller
Java component - open source - developed by IA
http//crawler.archive.org/hcc
Provides proxy access to pool of Heritrix
instances through JMX interface
Provides crawler control and status
Currently controlling 33 crawler instances on
three commodity dual Opterons--upper bound unknown

5
6
Archive-It Web Application

6
7
Storage

7
8
Access

Internet Archive Wayback Machine
Replaying archived web pages since 2001
Current IA version written in Perl and C, with
components distributed across various machines
Not open source, but open source beta (in Java)
available now

8
9
Full-Text Indexing

Nutch (http//nutch.org)
NutchWAX (http//archive-access.sf.net) additions
create and search indexes of stored ARC files
Standard text search plus link analysis
can search by date instead of relevance, useful
for individual archives

9
10
Text Indexing Challenges

10
11
Integration

11
12
The Big Picture
12
13
Future Challenges

Crawler trap detection
Scalability
Current setup can accommodate 300 partners at
current crawling rates
During pilot we crawled/indexed/stored just over
100,000,000 documents (4TB) in eight weeks
More machines can be easily added to storage and
crawling clusters

13
14
Scalability

Current Nutch is between versions
Old version has some non-distributable pieces
New version is much more distributable and
scalable (map/reduce - Hadoop), but not ready for
incremental indexing

14
15
Looking ahead