Title: The Web Archiving Service
1The Web Archiving Service
and the Web-at-Risk NDIIPP Project
Tracy SenecaCalifornia Digital Library
National Digital Information Infrastructure
Preservation ProgramLibrary of Congress
California Digital Library
New York University
University of North Texas
2Overview
- Web archiving what why
- Web-at-Risk grant scope purpose
- Web Archiving Service Sample Screens
3Web archiving what why
4Web Archiving Assumptions
- Using automated methods to gather web content
- Building some kind of collection composed of more
than one site - Intent on preserving captured content
- Results are searchable
- Public access may not be available
5How is the material at risk?
- Vulnerability of
- Digital publications
- Web publications
- Government web publications
- Local government web publications
6 The Ephemeral Web
7Issues Unique to Government and Political Web
Documents
- Publication notification streams
- Elections, political change
- Security vs. freedom of information
- Local agencies often dont have the resources to
archive their own publications
8 Web-at-Risk grant scope purpose
9Grant ScopeJan 2005 Jun 2009
- Build tools to allow librarians to capture,
curate and preserve web-based government and
political information. - Create topical and event-based archives
- Capture individual sites and documents
- Assess the impact of these tools on traditional
collection development practices. - Explore web archiving service sustainability.
10Project Partners
11Web-at-Risk Collections
12Beyond the Grant
- Support web archiving for the University of
California - Enable collaboration across campuses
- Enable collaboration between librarians and
researchers/faculty
13Web Archiving Service (WAS)
- Tangible outcome of grant work
- Being developed and release over a series of
pilot tests - Pilot test 5 underway until May 23
- 2008-2009 develop rights management and public
access features
14WAS Production
- Early summer 2008, Web Archiving Service goes
into limited production. - Available 24/7 to the curators who have taken
part in the pilot tests so far - Expand user community within UC as CDL confirms
that WAS infrastructure, user support and
training is sufficient.
15Web Archiving ServiceWorkflow and Sample Screens
16WAS workflowProject gt Site gt Capture gt Collection
- Set up a project (usually a topic or event)
- Define the sites to capture
- Run single or multiple captures of each site
- Choose which results to add to a single,
searchable collection
17(No Transcript)
18Capture sites individually
19Set Frequency
20Add metadata (or not)
21(No Transcript)
22Sites can be captured in batches
23When Capture Finishes
24(No Transcript)
25Display Results(QA capture effectiveness)
26Display Results Overview Reports
27Display Results Full Text Search
28Display Results
29Display Results(metadata)
30(No Transcript)
31Create Collection
32Build Collection(add entire captures)
33Build Collection
34WAS features for analysis
- Its impossible to know what a web site
contains until after you capture it! - Tools for understanding where the data comes from
and how it has changed.
35Whats the nature of this content?
36What new publications are in this capture?
37Build Collection(Select files from Compare
screen)
38How volatile is this site?(Not yet available)
39Potential
- We can now capture the chit chat the popular
reaction to historic events, in ways never before
possible. - How will researchers interact with captured
content once it is in an archive? - Visualization
- Text analysis
- What is the potential, beyond simple search and
display?
40Web Archive VisualizationDoantam Phan Stanford
University
41Questions?
Web-at-Risk Wikihttp//wiki.cdlib.org/WebAtRisk
You Tube Video Web-at-Risk Collections tracy.
seneca_at_ucop.edu