Title: Kulturarw
1Kulturarw³
- Capturing the web
- The Swedish experience
- www.kb.se/kw3
2Content
- Background
- Kulturarw3
- goals
- strategy
- Sweden on the net?
- Harvesting
- Software
- Fimding links
- problem
- Statistics
- What have we got?
- The Archive
- priorities
- storage
- what we save
- Development
- IIPC
- Tools, format
- conclusion
3Background
- Legal deposit, 1661
- Latest revision 1993
- Only electronic documents in fixed form
- CD-ROM, diskettes
- New law
- juli 1st, 2002, exception from personal privacy
law. - First Swedish web news paper lost
- Printed newspapers since 1645
- Kulturarw3 started 1996
- Still waiting for new legal deposit law
4Goals
- All web pages in Sweden
- pictures, video etc.
- .se, .and other Top Level Domains
- Electronic journals
5Strategy two choices
- Select what is importantHow to know what will be
considered important in the future?Labour
intense - Everything using automatic softwareGets
everything (well, not really)Less labour intense
6Strategy
- Take snapshots of the Swedish weba few times
each year - Gets all
- Needs less labour
- Computer memory is cheap
- However, large volumes makes quality control
difficult - Selective harvestingabout 150 newspapers every
day - In the future events, eg electionsWith as
little human intervention as possible.
7Sweden on the web?
- http//www.kb.se/kbstart.htm
- Only the domain part relevant
- .se
- .nu, Niue popular in Sweden. nu means now
in Swedish - Others if the server is geographically located
in Sweden - Language?
8Harvesting software
- A harvester (crawler, spider) collects web pages
by automatically following links and saving pages
- Open-source harvester Heritrix
- Main developer Internet Archive (IA)?
- Written in Java. Active community.
- Designed for archiving. not indexing.
- Earlier Modified version of Combine
- From NetLab, Lund university.
- Important!Indexing isn't archiving and archiving
isn't indexing! - Collects also pictures, sound etc.
9Problems?
- or challenges if you are an optimist
- Scripts
- Interactive pages
- Password protected
- Video/streaming material
- Social sites
10Statistics what did we get?
- Bulk crawls (everything Swedish)
- First sweep 1997 , only .se- 6.8 million
files- 160 GB data - A sweep 2007-2008 , .se and other tlds- 270
million files- 11500 GB data
11Statistics what did we get?
- Periodika (newspapers)
- Started june 2002
- 88 miljoner URLer
- 4.0 TB
- About 40 000 URLs every day
12More statistics
- Bulk (everything Swedish)?
- 823 100 web servers (including inlines)?
- 651 700 swedish
- - .se 50
- - .nu 21
- - others 29
- 1549 different MIME-typer found.
- Html about 50
- text/html image/gif image/jpeg appl/pdf
text/plain about 97 of the documents. - A lot of garbage, miss-spellings etc.
13Trends
- Html stable, 50-60 . Increasing lately
- Jpeg increasing, 11 (-97), 27 (05)?
- Gif decreasing, 23 (-97), 11 (-05)?
- Pdf increasing, 9th to 4th position
14Accessing the archive
Firsta priority is to access the archive using
traditional web technologies. Surf, in space
and time Free text search Nb, not using
traditional library methods cataloging etc.
15Arkivet, vad vi sparar
Allting förknippat med ett objekt, inkl.
metadata, sparas i en fil)?
Metadata från insamlingsprocessen
En enhet (fil) i arkivet
Metadata om objektet (från server)?
Objektet (i ursprunglig form)?
16Development
- International Internet Preservation Consortium
(IIPC)? - Started by Internet Archive national libraries
of Sweden, Norway, Finland, Danmark, Iceland,
UK, France, Italy, Canada, Australia och USA
(LoC)Now many more? - Develop common standards, tools and methods for
web archiving. - Raise awareness
17Development, standards
- Archiving formats
- Earlier formats ?
- MIME (Multipart Mail Extension)?
- ARC
- NedLib
- WARC (Web ARChive file format)?
- File format for saving web materialeach web page
is one record in a warc-fileA record contains
metada and content - ISO 28500.
18Development, Tools
- Tools
- Harvesting Heritrix
- Designed for archiving (NOT a modified indexer)?
- Open soure Java, Linux etc.
- Supported by IIPC
- Mainly developed by Internet Archive with
contributions - Will (is) support WARC. Supports ARC and MIME
- Surfing tools
- New Wayback Machine
- WERA - surf with time line?
- WAXToolbar support when using new WM
- NutchWax
- Free text search (with time line)?
- Curator tool
- Possible for a new-technician to do collection
and quality control
19Advices
- Use Open standards, open source ? IIPC
- Get users of the archive
- Think big. Hundreds of tera bytes, billions of
files - Accept that what you do is a best effort
20Conclusion
- The web is constantly changing ? continuous
development. - Possible to get a reasonable picture of the web.
But never complete! - Do something now
21Questions? Comments?
?
?
?
22Links
- IIPC www.netpreserve.org
- Kulturarw3 www.kb.se/kw3
- Internet Archive www.archive.org