IIPC Web Archiving Toolset - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

IIPC Web Archiving Toolset

Description:

Ensure compatibility and 'plugability' of the resulting collection ... URI canonization. Give you appropriate information on the fly ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 20
Provided by: jm145
Category:

less

Transcript and Presenter's Notes

Title: IIPC Web Archiving Toolset


1
IIPC Web Archiving Toolset
2

Why?
  • Specific Requirements
  • Coverage quality of content
  • Time dimension
  • Ensure compatibility and 'plugability' of the
    resulting collection
  • (global cross-access to collections)
  • Save money

3
How?
  • Common specifications
  • IIPC joint project (at least 2 partners)
  • Open source (GPL)

4
Architecture of tools for Web Archives
5
Our situation in the internet environment
6
  • The Web is an information organization that just
    works so be faithful to it
  • Size problem

7
  • The Web is an information organization that just
    works so be faithful to it
  • Server side complexity

8
Web server's use of DB
9
Web server's use of DB
10
To sum up
  • Size at which you operate
  • Time dimension you introduce
  • Limited faithfulness you can achieved (sorry for
    that Tim Robert)

11
Acquisition Chain (1)
  • Large scale, archive quality crawler Heritrix
  • Specification in early 2003
  • Joint developed by IA and the Nordic Library
  • Strengths
  • Good at finding paths to content
  • Site priority implemented
  • Very configurable and modular
  • Next steps
  • Incremental crawls
  • Multi-machine

12
Acquisition Chain (2)
  • Smart Archiving Crawler Project
  • Specification in early 2003
  • Joint call for tender by BL and BnF
  • Goal to implement large scale, automatically
    focus crawls
  • Priority based on citation linking and thematic
    assessment
  • Call in October, first prototype mid-2005

13
Acquisition Chain (3)
  • Deep Arc
  • Specification in early 2002
  • Developped by bnF
  • Goal to allow site producer to easily extract DB
    to XML flat files
  • Portable, GUI and Wozard-based extractor
  • Available end of 2004

14
Arc Files managements tools
  • Several tools already here
  • Unify and release an official IIPC toolset to
  • Generate
  • Parse
  • Search
  • Access
  • Arc files
  • Early 2005

15
Access tools (1)
  • URI-based access
  • Display correctly in a controlled environment
  • Make it browsable in extension and time
  • URI canonization
  • Give you appropriate information on the fly
  • Start from achievements of NWA tools and
    experience of IA in this domain
  • 2005

16
Access tools (2)
  • Large scale indexer
  • Basic search (boolean, proximity)
  • Time dimension
  • Distributed indexing index
  • 100 M documents and more
  • Start from existing open source development
    Lucene Nutch
  • 2005

17
Access tools (3)
  • DB query interface generator
  • Developed by NLA with partial IIPC funding
  • 2005

18
IIPC toolkit ready before mid-2006
  • Robust scalable up the to the global web
  • Implement IIPC standards (ARC 3.0, metadata,
    API)
  • Easy to install and use for advanced user (web
    archiving engineers)
  • Open source and available for the all community
    of web archives

19
Enabling the web archives grid!
  • IIPC site www.netpreserve.org
  • Web Archive information list webarchive_at_cru.fr
  • Paper on Heritix for IWAW 04 www.iwaw.net/04
  • julien_at_netpreserve.org
Write a Comment
User Comments (0)
About PowerShow.com