Kulturarw - PowerPoint PPT Presentation

About This Presentation
Title:

Kulturarw

Description:

Kulturarw Capturing the web The Swedish experience www.kb.se/kw3 Background Kulturarw3 goals strategy Sweden on the net? Harvesting Software Fimding links problem ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 22
Provided by: AllanAr2
Category:

less

Transcript and Presenter's Notes

Title: Kulturarw


1
Kulturarw³
  • Capturing the web
  • The Swedish experience
  • www.kb.se/kw3

2
Content
  • Background
  • Kulturarw3
  • goals
  • strategy
  • Sweden on the net?
  • Harvesting
  • Software
  • Fimding links
  • problem
  • Statistics
  • What have we got?
  • The Archive
  • priorities
  • storage
  • what we save
  • Development
  • IIPC
  • Tools, format
  • conclusion

3
Background
  • Legal deposit, 1661
  • Latest revision 1993
  • Only electronic documents in fixed form
  • CD-ROM, diskettes
  • New law
  • juli 1st, 2002, exception from personal privacy
    law.
  • First Swedish web news paper lost
  • Printed newspapers since 1645
  • Kulturarw3 started 1996
  • Still waiting for new legal deposit law

4
Goals
  • All web pages in Sweden
  • pictures, video etc.
  • .se, .and other Top Level Domains
  • Electronic journals

5
Strategy two choices
  • Select what is importantHow to know what will be
    considered important in the future?Labour
    intense
  • Everything using automatic softwareGets
    everything (well, not really)Less labour intense

6
Strategy
  • Take snapshots of the Swedish weba few times
    each year
  • Gets all
  • Needs less labour
  • Computer memory is cheap
  • However, large volumes makes quality control
    difficult
  • Selective harvestingabout 150 newspapers every
    day
  • In the future events, eg electionsWith as
    little human intervention as possible.

7
Sweden on the web?
  • http//www.kb.se/kbstart.htm
  • Only the domain part relevant
  • .se
  • .nu, Niue popular in Sweden. nu means now
    in Swedish
  • Others if the server is geographically located
    in Sweden
  • Language?

8
Harvesting software
  • A harvester (crawler, spider) collects web pages
    by automatically following links and saving pages
  • Open-source harvester Heritrix
  • Main developer Internet Archive (IA)?
  • Written in Java. Active community.
  • Designed for archiving. not indexing.
  • Earlier Modified version of Combine
  • From NetLab, Lund university.
  • Important!Indexing isn't archiving and archiving
    isn't indexing!
  • Collects also pictures, sound etc.

9
Problems?
  • or challenges if you are an optimist
  • Scripts
  • Interactive pages
  • Password protected
  • Video/streaming material
  • Social sites

10
Statistics what did we get?
  • Bulk crawls (everything Swedish)
  • First sweep 1997 , only .se- 6.8 million
    files- 160 GB data
  • A sweep 2007-2008 , .se and other tlds- 270
    million files- 11500 GB data

11
Statistics what did we get?
  • Periodika (newspapers)
  • Started june 2002
  • 88 miljoner URLer
  • 4.0 TB
  • About 40 000 URLs every day

12
More statistics
  • Bulk (everything Swedish)?
  • 823 100 web servers (including inlines)?
  • 651 700 swedish
  • - .se 50
  • - .nu 21
  • - others 29
  • 1549 different MIME-typer found.
  • Html about 50
  • text/html image/gif image/jpeg appl/pdf
    text/plain about 97 of the documents.
  • A lot of garbage, miss-spellings etc.

13
Trends
  • Html stable, 50-60 . Increasing lately
  • Jpeg increasing, 11 (-97), 27 (05)?
  • Gif decreasing, 23 (-97), 11 (-05)?
  • Pdf increasing, 9th to 4th position

14
Accessing the archive
Firsta priority is to access the archive using
traditional web technologies. Surf, in space
and time Free text search Nb, not using
traditional library methods cataloging etc.
15
Arkivet, vad vi sparar
Allting förknippat med ett objekt, inkl.
metadata, sparas i en fil)?
Metadata från insamlingsprocessen
En enhet (fil) i arkivet
Metadata om objektet (från server)?
Objektet (i ursprunglig form)?
16
Development
  • International Internet Preservation Consortium
    (IIPC)?
  • Started by Internet Archive national libraries
    of Sweden, Norway, Finland, Danmark, Iceland,
    UK, France, Italy, Canada, Australia och USA
    (LoC)Now many more?
  • Develop common standards, tools and methods for
    web archiving.
  • Raise awareness

17
Development, standards
  • Archiving formats
  • Earlier formats ?
  • MIME (Multipart Mail Extension)?
  • ARC
  • NedLib
  • WARC (Web ARChive file format)?
  • File format for saving web materialeach web page
    is one record in a warc-fileA record contains
    metada and content
  • ISO 28500.

18
Development, Tools
  • Tools
  • Harvesting Heritrix
  • Designed for archiving (NOT a modified indexer)?
  • Open soure Java, Linux etc.
  • Supported by IIPC
  • Mainly developed by Internet Archive with
    contributions
  • Will (is) support WARC. Supports ARC and MIME
  • Surfing tools
  • New Wayback Machine
  • WERA - surf with time line?
  • WAXToolbar support when using new WM
  • NutchWax
  • Free text search (with time line)?
  • Curator tool
  • Possible for a new-technician to do collection
    and quality control

19
Advices
  • Use Open standards, open source ? IIPC
  • Get users of the archive
  • Think big. Hundreds of tera bytes, billions of
    files
  • Accept that what you do is a best effort

20
Conclusion
  • The web is constantly changing ? continuous
    development.
  • Possible to get a reasonable picture of the web.
    But never complete!
  • Do something now

21
Questions? Comments?
?
?
?
22
Links
  • IIPC www.netpreserve.org
  • Kulturarw3 www.kb.se/kw3
  • Internet Archive www.archive.org
Write a Comment
User Comments (0)
About PowerShow.com