Geen diatitel - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Geen diatitel

Description:

commentary, enrichments, relations. virtual collection building. living archives 2 ... countries have 'political' situations where several centers are wanted ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 18
Provided by: kamp57
Category:
Tags: diatitel | geen

less

Transcript and Presenter's Notes

Title: Geen diatitel


1
? ? ?
Archiving/Infrastructure Costs in the Humanities
Peter Wittenburg Max Planck Institute for
Psycholinguistics Nijmegen, NL
2
? ? ?
the digital trap
45 Terabyte
increase 12 TB
  • since 1995 a belief of technologists in an
    all-digital-domain
  • since 2000 a belief of researchers in this way
  • since then a rapid increase of storage capacity
    requirements
  • since 2002 an awareness of digital archiving
    (long-term)
  • large heterogeneity (video, audio, texts, EEG,
    fMRI, eye
  • tracking, motion tracking, etc)

3
? ? ?
DOBES Impulse
From the 6500 languages every two weeks one is
dying!
  • 2000 decision of VolkswagenFoundation to fund
    Endangered
  • Languages Programme as a distributed
    international enterprise
  • 45 teams working completely independently with
    one digital archive
  • increased awareness about need for long-term
    preservation
  • relevant for linguistic theory building, but
    also part of cultural heritage

4
? ? ?
efforts for long-term preservation
  • 6 copies at large computer
  • centers with technology
  • migration strategy
  • replacement of technology
  • at regular time intervals
  • data exchange with an
  • increasing amount of regional
  • repositories set up by the
  • MPI team (about 10 new
  • requests)
  • unique effort but necessary
  • 80 of our recordings are
  • endangered (UNESCO)

5
? ? ?
living archives 1
  • digital archives in the research world are
    living entities
  • Digital Dilemma Report of Academy of Motion
    Picture Arts
  • and Sciences
  • until now "file and forget system" in an old
    mine
  • with 50 survival rate of films after 50 years
  • costs for conventional master copy 1.059
  • costs for digital master copy 12.514
  • reason for cost factor
  • digital archiving is a dynamic process

6
? ? ?
living archives 2
  • changes/migration at different levels
  • storage technology migration - old exhibits
    errors
  • encoding formats change
  • MPEG1 -gt MPEG2 -gt H.264 (MPEG4) -gt mJPEG2000
  • representation format/schemas change
  • xx -gt XML-specific -gt XML generic -gt
    RDF/OWL/SKOS
  • software technology changes
  • etc
  • research archives are even worse
  • new versions and continuous extensions
  • commentary, enrichments, relations
  • virtual collection building

7
? ? ?
need software support for
efficient resource creation
resource organization and categorization and
archive management (consistency, coherence)
accessing, searching, enriching the archive and
analyzing and visualizing its content etc
8
? ? ?
MPI cost estimates
Maintaining a large and complex living archive
costs at least 400 k/year. (linguistic support,
SW development, head etc. not calculated)
9
? ? ?
software maintenance costs
  • at MPI start in 2000 as bottom-up process
  • repository system (about 100.000 lines of code)
  • incl. metadata, access management etc
  • dedicated system tailored to language
    resources
  • utilization software (about 230.000 lines of
    code)
  • much more heterogeneous
  • rep. system now used by about 13 institutes
  • creation costs of repository system 1 M
  • sw maintenance costs for repository system about
    60k/y
  • at Max Planck Digital Library start in 2004 as
    top-down process
  • repository system with about 900.000 lines of
    code
  • much bigger claim of generality etc
  • yet little code base of utilization software
  • creation costs of repository system (without
    FEDORA) 7 M
  • maintenance costs probably more than 400 k/y
  • repository system is core - need maximal
    independence (MPG, EU)

10
? ? ?
CLARIN basic goal
  • CLARIN is a distributed and heterogeneous
    facility (research infrastructure)
  • thus different to the LHC facility in Geneva
  • need to move from a domain of accidental
    collaborations to a
  • structured domain with clear responsibilities
  • basis of such a domain is visibility and
    interoperability -gt registries, etc
  • virtual observatory of language resources and
    technology

11
? ? ?
CLARIN backbone by centers
  • various types of centers in CLARIN
  • A infrastructure centers with high availability
    and persistence
  • (AA infra, PID, center registry, metadata
    portals, concept registry, etc)
  • B resource and technology service providers
    with a certain
  • commitment for persistent services
  • (texts, lexica, multimedia recordings,
    parsers, translators, etc)
  • C metadata service providers without access to
    the content
  • (enrichment of the visibility of LRT)
  • R centers having resources and tools, but
    without machine readable
  • access level
  • E external centers offering services of various
    types
  • (libraries, national IDFs, national grid
    centers, TERENA,
  • MPG will offer PID service to research
    world, etc)

12
? ? ?
CLARIN cost efficiency
  • cost efficiency is crucial for the operating
    phase of infrastructures
  • example PID service (persistent and unique
    Identifiers)
  • one main center for all over EU would be
    sufficient
  • for high availability probably two other centers
    with mirrors
  • but a few other criteria to be respected for LRT
    centers
  • countries want to take care of their languages
  • countries have "political" situations where
    several centers are wanted
  • service often optimal where you have the
    involved experts and
  • knowledge is heterogeneous (texts, speech,
    multimodality, etc)
  • therefore country responsibility and funding
    must be basic rule
  • there are already institutes that come close to
    what we expect in CLARIN
  • this will reduce the real costs - but cannot be
    calculated yet

13
? ? ?
CLARIN cost estimates - general
  • already in 2006 ESFRI wanted to have a cost
    calculation
  • impossible task but
  • based on 10 years and 20 centers we estimated
    146 M incl.
  • preparation and construction phase
  • let's see how far we are
  • let's assume that we have 24 participating
    countries and let's assume that
  • each country has 1 resource center and 1
    technology center
  • (it can be seen already now that several
    countries will have more,
  • LT centers will cost less, but there will be
    more smaller ones)
  • let's also assume that each country has in
    average as much data and
  • complexity as the MPI
  • let's also assume that maintaining a technology
    center is as expensive
  • (which will turn out to be a too low
    assumption)

14
? ? ?
CLARIN cost - just maintenance
15
? ? ?
other CLARIN costs
  • in addition the following costs will appear in
    CLARIN in all phases
  • they do not add to the focus of the APA workshop
    on persistence
  • the ESFRI estimation was made for 10 years
  • (preparation 3, construction 4, operation 3)

16
? ? ?
CLARIN compare estimates
  • based on 48 centers we now estimate for the
    operation phase
  • 23.2 / year or 0.48 M / y / center
  • but ignored other costs (seen before)
  • share infra costs by larger group of centers
  • no final idea yet how many centers we will have
    at the end
  • personally guess that we will have a structuring
    effect (explicit costs)
  • based on 10 years and 20 centers we first
    estimated 146 M
  • incl. preparation (3), construction phase (4)
    and other costs
  • i.e. 14.6 M / year or 0.7 M / y / center
  • CLARIN will offer a number of preservation
    centers such as MPI
  • the service is already offered with some
    restrictions
  • by having some explicit curation and archiving
    centers CLARIN
  • will take care of the preservation of language
    resources
  • don't know yet how much DANS (NL) / AHDS (UK)
    etc cost

17
? ? ?
Falls nicht to end in Babylonish scenario nous
avons still etwas time om schattingen te
improve. Thanks for your attention!
Write a Comment
User Comments (0)
About PowerShow.com