ArchiveIt Architecture Introduction - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

ArchiveIt Architecture Introduction

Description:

Schedules new periodic crawls. Talks to crawler pool through HCC ... Incremental indexing - goal of new crawls in index within 72 hours ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 17
Provided by: netEdu
Learn more at: http://net.educause.edu
Category:

less

Transcript and Presenter's Notes

Title: ArchiveIt Architecture Introduction


1
Archive-It Architecture Introduction
  • April 18, 2006
  • Dan Avery
  • Internet Archive

1
2
Archive-It Components
  • Crawling
  • User Interface
  • Storage
  • Playback
  • Text Indexing
  • Integration

2
3
Component Integration
3
4
Crawling
  • Heritrix ( http//crawler.archive.org/ )
  • Java application
  • Open source (LGPL)
  • Crawls for completeness/depth
  • Highly configurable

4
5
Crawling - Distributed Crawling
  • Heritrix Cluster Controller
  • Java component - open source - developed by IA
  • http//crawler.archive.org/hcc
  • Provides proxy access to pool of Heritrix
    instances through JMX interface
  • Provides crawler control and status
  • Currently controlling 33 crawler instances on
    three commodity dual Opterons--upper bound unknown

5
6
Archive-It Web Application
  • User Interface and Crawl Scheduling
  • Gets seed URLs and crawl parameters from users
  • Schedules new periodic crawls
  • Talks to crawler pool through HCC
  • Provides access, search, and crawl history UI

6
7
Storage
  • archive.org ARC repository
  • custom Perl system
  • simple storage on primary/backup pairs
  • monthly MD5 digest verification
  • robust, non proprietary file format
  • Alexandria (Egypt)/Amsterdam

7
8
Access
  • Internet Archive Wayback Machine
  • Replaying archived web pages since 2001
  • Current IA version written in Perl and C, with
    components distributed across various machines
  • Not open source, but open source beta (in Java)
    available now

8
9
Full-Text Indexing
  • Nutch (http//nutch.org)
  • NutchWAX (http//archive-access.sf.net) additions
    create and search indexes of stored ARC files
  • Standard text search plus link analysis
  • can search by date instead of relevance, useful
    for individual archives

9
10
Text Indexing Challenges
  • Some parts are distributable, some are not
  • Incremental indexing - goal of new crawls in
    index within 72 hours
  • Working on Archive-It usable map/reduce version -
    July
  • In the meantime, a lot of workarounds

10
11
Integration
  • Group of Perl and bash scripts - planning more
    complex than the execution
  • Most components available individually
  • Decentralized control, centralized monitoring
  • Each component operates almost entirely
    independently

11
12
The Big Picture
12
13
Future Challenges
  • Crawler trap detection
  • Scalability
  • Current setup can accommodate 300 partners at
    current crawling rates
  • During pilot we crawled/indexed/stored just over
    100,000,000 documents (4TB) in eight weeks
  • More machines can be easily added to storage and
    crawling clusters

13
14
Scalability
  • Current Nutch is between versions
  • Old version has some non-distributable pieces
  • New version is much more distributable and
    scalable (map/reduce - Hadoop), but not ready for
    incremental indexing

14
15
Looking ahead
  • After basic UI/archiving/indexing...
  • Time-based search UI
  • Analyzing archives for research and ongoing
    collection improvement
  • Content classification
  • Rate of change
  • New site suggestions

15
16
http//www.archive-it.org
16
Write a Comment
User Comments (0)
About PowerShow.com