Title: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving
1The Web is a Mess or How I Learned to Stop
Worrying and Love Web Archiving
Lori Donovan, Internet Archive
2About Internet Archive
- We are a Digital Library
- Mission Statement Universal access to all
knowledge - Founded by Brewster Kahle in San Francisco,
California in 1996 - Largest publicly available web archive in
existence - Officially designated a Library by the State of
California in 2007
3What is Web Archiving?
- The goal of web archiving is to document changes
to web resources over time, archive them and make
them accessible. -
4What is a Web Archive?
- A web archive is a collection of archived Urls
grouped by theme, event, subject area, or web
address. - A web archive contains as much as possible from
the original resources. It is a priority to
recreate the same experience a user would have
had if they had visited the live site. -
5 Why Web Archiving?
- Billions of people around the world have grown
accustomed to using the web as their primary
resource to acquire information. - The web is a crucial part our culture and our
social fabric, and we dont want to lose any of
it, so it is essential that we collect and
preserve these digital resources and make them
accessible in creative ways. - The availability of this digital information is
taken for granted and it is a fallacy that if
something is on the web it will be there forever.
6 Limited lifespan of a webpage
- It is a a fairly common misconception that
content that exists on the web will remain there
forever. - A report in Scientific American claims 44 days.
- A subsequent academic study in IEEE suggests 75
days. - A Washington Post article indicates the number is
100 days. - Over 95 of government information today is
born-digital. But less than 50 is being
maintained with an active preservation plan.
State of the Federal Web Report -
7Historically important events for researchers and
scholars
- Much of the record of any historic event in
todays world is born digital. And many items
born in print are also available in digital form,
or soon will be. To understand major world
eventsnot only disasters but political
upheavalsand to keep a record and a memory of
them for survivors, for scholars, for
policy-makers, and for a wider public, it is
simply essential that we collect and preserve
these digital resources and make them accessible
in creative ways. - Andrew Gordon, Harvard University.
-
8 Its a requirement.
- Records Retention policy. Several state and
federal laws or policies require universities to
maintain various statistics and reports. - Responsibility preserve things like course
information, course roster information and
policies documents now showing up only as
digital content -
9The Role of Libraries
- Libraries and archives have long collected
information that serve scholars and the general
public in understanding history, culture, and
society. - So much of today's information is easily (and
only) found on the world wide web -- web pages
have replaced hard copy records and documents,
blogs are today's diaries, and newspapers and
socio-political commentary exist solely online. - As part of an effort to appropriately document
and capture today's information for tomorrow's
use, institutions must adopt a web archiving
strategy. - However, for many institutions, the prospect of
capturing and storing web pages, websites, or
entire web domains is a daunting prospect
10 About Archive-It
- First deployed in February 2006
- Web based application allowing users to create,
manage and preserve collections of digital
content - Includes tools for selection and scoping,
harvesting, cataloging with metadata, full text
search, and QA - Ability to capture content using 10 different
crawl frequencies - Archived content includes html, videos, audio,
PDF, images, social networking sites, online
newspapers - View archived content within 24 hours after a
capture is complete - Annual subscription service, includes hosting,
access and storage (primary and back-up)
11Who Uses Archive-It?
205 partners around the world in 43 U.S. States
and 15 countries
12- How Partners Use Archive-It
13Archive-It Use Cases
- Essential part of a mandate to capture and
preserve institutional memory and history.
Construct an historical record of an
institutions web presence over time. - Capture state/ local agency publications that
arent being deposited in print form. Collect and
aggregate state/ local government websites and
presence. - Capture websites that relate to
historical/traditional collections and link them
with existing collections around the same
thematic focus. - Create a thematic/topical web archive on a
specific subject or event, including different
perspectives and social commentary (tweets,
blogs, comments). Gather thematically-related
resources of value to researchers and scholars - Support an electronic records system to meet
record retentions requirements. - Closure crawls
14Stanford University/New York UniversityIslamic
Middle Eastern Collection
- Purpose harvest and preserve Iranian Blogs
- Archiving 300 blogs written by and for Iran and
the Iranian people - Includes coverage of 2009 Iranian elections and
the current Middle East unrest
15Stanford University/New York UniversityIslamic
Middle Eastern Collection
16(No Transcript)
17University of Texas at Austin LAGDA
- Purpose Archive documents from 18 different
countries, 300 government ministries/presidencies.
- Content includes
- Full-text versions of official documents
- Original video and audio recordings of key
regional leaders - Thousands of annual and "state of the nation"
reports - Specific collections for Latin American elections
and political parties
18University of Texas at Austin LANIC Honduras
Presidential site 2008 (before the Coup)
19University of Texas at Austin LANIC Honduras
Presidential site 2009 (during the Coup)
20University of Texas at Austin LANIC Honduras
Presidential site (after the Coup)
21Electronic Literature Organization
- Purpose archive born digital literature
works created explicitly for the computer. - ELO seeks to foster and promote the reading,
writing, teaching, and understanding of
literature as it develops in a digital
environment - Content includes individual works, collections
and journals, poems and stories
22Electronic Literature Organization
23Indiana University
- Purpose archive all university records to
maintain strong electronic records systems - Main university website, 8 different campus
websites and other organizations on campus
university culture, teacher blogs, student
groups, and online publications
24Indiana UniversityMain University website
25Columbia University
- Purposes
- Archive copies of its university web presence in
order to meet required mandates - Archive websites on thematic/topical subjects.
26Columbia University Human Rights Collection
27Columbia UniversityAvery Architectural Fine
Arts Library
28Columbia University Archives Collection
29North Carolina State Archives State Library
of North Carolina
- Purpose archive state agency websites and
publications - Includes pages in a variety of formats text,
images, audio, video and social networking sites
30North Carolina State Archives State Library
of North Carolina
31Access to Collections
- Partners
- Can view through private web application with
login/password - General Public
- Can view from Archive-It website
http//www.archiveit.org/ - Can view from organizations website from a
landing page that links back to Archive-It hosted
data - Host from organizations own servers
- -Restricted and private access options are
available
32 Whats next for Archive-It
- Collaboration and Partnerships
- Web application development
- Continue to develop features and functionalities
requested by partners - Enhance our preservation policy/access model
- Integrate our data with partners external
services, systems and catalogs
33Thank you!Lori DonovanPartner
Specialistlori_at_archive.org
Questions?