Title: Geen diatitel
1? ? ?
Archiving/Infrastructure Costs in the Humanities
Peter Wittenburg Max Planck Institute for
Psycholinguistics Nijmegen, NL
2? ? ?
the digital trap
45 Terabyte
increase 12 TB
- since 1995 a belief of technologists in an
all-digital-domain - since 2000 a belief of researchers in this way
- since then a rapid increase of storage capacity
requirements - since 2002 an awareness of digital archiving
(long-term) - large heterogeneity (video, audio, texts, EEG,
fMRI, eye - tracking, motion tracking, etc)
3? ? ?
DOBES Impulse
From the 6500 languages every two weeks one is
dying!
- 2000 decision of VolkswagenFoundation to fund
Endangered - Languages Programme as a distributed
international enterprise - 45 teams working completely independently with
one digital archive - increased awareness about need for long-term
preservation - relevant for linguistic theory building, but
also part of cultural heritage
4? ? ?
efforts for long-term preservation
- 6 copies at large computer
- centers with technology
- migration strategy
- replacement of technology
- at regular time intervals
- data exchange with an
- increasing amount of regional
- repositories set up by the
- MPI team (about 10 new
- requests)
- unique effort but necessary
- 80 of our recordings are
- endangered (UNESCO)
5? ? ?
living archives 1
- digital archives in the research world are
living entities - Digital Dilemma Report of Academy of Motion
Picture Arts - and Sciences
- until now "file and forget system" in an old
mine - with 50 survival rate of films after 50 years
- costs for conventional master copy 1.059
- costs for digital master copy 12.514
- reason for cost factor
- digital archiving is a dynamic process
6? ? ?
living archives 2
- changes/migration at different levels
- storage technology migration - old exhibits
errors - encoding formats change
- MPEG1 -gt MPEG2 -gt H.264 (MPEG4) -gt mJPEG2000
- representation format/schemas change
- xx -gt XML-specific -gt XML generic -gt
RDF/OWL/SKOS - software technology changes
- etc
- research archives are even worse
- new versions and continuous extensions
- commentary, enrichments, relations
- virtual collection building
7? ? ?
need software support for
efficient resource creation
resource organization and categorization and
archive management (consistency, coherence)
accessing, searching, enriching the archive and
analyzing and visualizing its content etc
8? ? ?
MPI cost estimates
Maintaining a large and complex living archive
costs at least 400 k/year. (linguistic support,
SW development, head etc. not calculated)
9? ? ?
software maintenance costs
- at MPI start in 2000 as bottom-up process
- repository system (about 100.000 lines of code)
- incl. metadata, access management etc
- dedicated system tailored to language
resources - utilization software (about 230.000 lines of
code) - much more heterogeneous
- rep. system now used by about 13 institutes
- creation costs of repository system 1 M
- sw maintenance costs for repository system about
60k/y - at Max Planck Digital Library start in 2004 as
top-down process - repository system with about 900.000 lines of
code - much bigger claim of generality etc
- yet little code base of utilization software
- creation costs of repository system (without
FEDORA) 7 M - maintenance costs probably more than 400 k/y
- repository system is core - need maximal
independence (MPG, EU)
10? ? ?
CLARIN basic goal
- CLARIN is a distributed and heterogeneous
facility (research infrastructure) - thus different to the LHC facility in Geneva
- need to move from a domain of accidental
collaborations to a - structured domain with clear responsibilities
- basis of such a domain is visibility and
interoperability -gt registries, etc - virtual observatory of language resources and
technology
11? ? ?
CLARIN backbone by centers
- various types of centers in CLARIN
- A infrastructure centers with high availability
and persistence - (AA infra, PID, center registry, metadata
portals, concept registry, etc) - B resource and technology service providers
with a certain - commitment for persistent services
- (texts, lexica, multimedia recordings,
parsers, translators, etc) - C metadata service providers without access to
the content - (enrichment of the visibility of LRT)
- R centers having resources and tools, but
without machine readable - access level
- E external centers offering services of various
types - (libraries, national IDFs, national grid
centers, TERENA, - MPG will offer PID service to research
world, etc)
12? ? ?
CLARIN cost efficiency
- cost efficiency is crucial for the operating
phase of infrastructures - example PID service (persistent and unique
Identifiers) - one main center for all over EU would be
sufficient - for high availability probably two other centers
with mirrors - but a few other criteria to be respected for LRT
centers - countries want to take care of their languages
- countries have "political" situations where
several centers are wanted - service often optimal where you have the
involved experts and - knowledge is heterogeneous (texts, speech,
multimodality, etc) - therefore country responsibility and funding
must be basic rule - there are already institutes that come close to
what we expect in CLARIN - this will reduce the real costs - but cannot be
calculated yet
13? ? ?
CLARIN cost estimates - general
- already in 2006 ESFRI wanted to have a cost
calculation - impossible task but
- based on 10 years and 20 centers we estimated
146 M incl. - preparation and construction phase
- let's see how far we are
- let's assume that we have 24 participating
countries and let's assume that - each country has 1 resource center and 1
technology center - (it can be seen already now that several
countries will have more, - LT centers will cost less, but there will be
more smaller ones) - let's also assume that each country has in
average as much data and - complexity as the MPI
- let's also assume that maintaining a technology
center is as expensive - (which will turn out to be a too low
assumption)
14? ? ?
CLARIN cost - just maintenance
15? ? ?
other CLARIN costs
- in addition the following costs will appear in
CLARIN in all phases - they do not add to the focus of the APA workshop
on persistence - the ESFRI estimation was made for 10 years
- (preparation 3, construction 4, operation 3)
16? ? ?
CLARIN compare estimates
- based on 48 centers we now estimate for the
operation phase - 23.2 / year or 0.48 M / y / center
- but ignored other costs (seen before)
- share infra costs by larger group of centers
- no final idea yet how many centers we will have
at the end - personally guess that we will have a structuring
effect (explicit costs) - based on 10 years and 20 centers we first
estimated 146 M - incl. preparation (3), construction phase (4)
and other costs - i.e. 14.6 M / year or 0.7 M / y / center
- CLARIN will offer a number of preservation
centers such as MPI - the service is already offered with some
restrictions - by having some explicit curation and archiving
centers CLARIN - will take care of the preservation of language
resources - don't know yet how much DANS (NL) / AHDS (UK)
etc cost
17? ? ?
Falls nicht to end in Babylonish scenario nous
avons still etwas time om schattingen te
improve. Thanks for your attention!