Title: Trusted Datagrids: Library of Congress Projects with UCSD
1Trusted DatagridsLibrary of Congress Projects
with UCSD
Ardys Kozbial UCSD Libraries David Minor - SDSC
2Building Trust in a 3RD Party Repository A
Pilot Project
David Minor San Diego Supercomputer Center
3(No Transcript)
4(No Transcript)
5(No Transcript)
6How can the LC trust
someone they cant control?
7(No Transcript)
8Moving forward in the right direction requires
more than fuzzy promises
9 it takes a combination of experts and tools.
10Cyberinfrastructure is the collection of ...
Resources
Computers, data storage, networks, scientific
instruments, experts, etc.
Glue
Integrating software, systems, and organizations
11Effective cyberinfrastructure for the humanities
and social sciences will allow scholars to focus
their intellectual and scholarly energies on the
issues that engage them, and to be effective
users of new media and new technologies, rather
than having to invent them.
- ACLS Commission on Cyberinfrastructure for
the Humanities Social Sciences
12- The mission of the San Diego Supercomputer
Center (SDSC) is to empower communities in
data-oriented research, education, and practice
through the innovation and provision of
Cyberinfrastructure
13- SDSC ...
- Is one of the original NSF supercomputer centers
-
- Supports high performance computing systems
- Supports data applications for science,
engineering, social sciences, cultural heritage
institutions - Has LARGE data capabilities
- 3 PB Disk Storage
- 25 PB Tape Storage
14UCSD Libraries
- 3.5 million volumes
- Digital Access Management System (in
development) - 250,000 objects
- 15 TB
- Shared collections with UC
- California Digital Library
- Digital Preservation Repository
- eScholarship repository
15Partnerships and Collaborations
- LC Pilot Project Building Trust in a 3rd Party
Repository - Using test image collections/web crawls ingest
content to SDSC repository - Allow access for content audit
- Track usage of content over time
- Deliver content back to LC at end of project
- Library of Congress NDIIPP Chronopolis Program
- Build Production Capable Chronopolis Grid (50 TB
x 3) - Further define transmission packaging for
archival communities - Investigate best network transfer models for I2
and TeraGrid networks - California Digital Library (CDL) Mass Transit
Program - Enable UC System Libraries to transfer high-speed
mass digitization collections across CENIC/I2 - Develop transmission packaging for CDL content
- UCSD Libraries Digital Asset Management System
- RDF System with data managed in SRB at SDSC
16SDSC DPI Group
- Digital Preservation Initiatives Group
- Charged with Developing and Supporting Digital
Preservation Services within the Production
Systems Division of SDSC. - http//dpi.sdsc.edu
- Cross-Organizational Group
- SDSC Personnel/UCSD Libraries Personnel
- Libraries
- Archives
- Technology
- Information Science
17Cyberinfrastructure
Trust
18For Example
19We worked together to setup high speed data
replication services
Achieved 200Mb/s 2 TB/day Highly reliable
Checksums
Internet2
Checksums
20Network setup involved
- LC and SDSC staff working together
- Configurations on networks and computers
- Resolving different security environments
- Network monitoring
21Networking is hard!
Its not magic - theres always a reason
Lessons Learned
It highlights collaborative nature of work
Cant forget it once its setup
22Have multi-institutional issues been solved?
Does new infrastructure improve process?
Trust Elements
Has a long-term solution been found?
Is solution useful for other organizations?
23(No Transcript)
24SDSC created a robust storage environment for
this data
Multiple replications at SDSC and
geographically diverse locations
25(a process with several characteristics)
- Needed to replicate structure exactly
- This had to be done for 5 replications
- Complex environment had to be transparent
- Data had to be available for manipulation
26- The Storage Resource Broker provided replication
services ...
27... and extensive monitoring, logging and
reporting functions
(which led to many conversations)
28Logging and monitoring procedures
- Scripts which compared the files within the
system with a master list checked changes on
either side fairly straightforward - But
What is the master list and who maintains
it? Who decides what is a legitimate change? Do
you want a dark archive or an active remote data
center?
29We tested a new Front-End
30 and explored an important issue
- Reliability
- Versus
- Accessibility
31Always keep expectations aligned
Duplication of structure is complicated
Lessons Learned
Dont confuse accessibility and reliability
Communication highlights communication
32Can remote data be accessed?
Can remote data be verified?
Trust Elements
Can remote data be retrieved and re-used?
Can ownership be clearly defined?
33SDSC and LC explored a new approach to working
with web archives
Parallel indexing and display system Looked
default to the user
50,000 ARC files 6 Terabytes of data Short
processing time
34Using default tools, our initial indexing rate
was 1000 files per day
more than 6 weeks of constant computing to
index entire collection.
This was over our time budget.
35We ran 18 parallel indexing instances reduced
processing to a week
We modified the Wayback sourcecode to create a
new access infrastructure
36Default setup isnt always easiest
Time is a wonderful motivator
Lessons Learned
Sometimes you need to start over
Experts are often interested in your work
37Are the final results the same?
Can the results be reached in a better way?
Trust Elements
Can a new organization bring new expertise?
Can a new organization work with your partners?
38Next steps .
39Chronopolis A Partnership
- Chronopolis is being developed by a national
consortium led by SDSC and the UCSD Libraries. - Initial Chronopolis provider sites include
- SDSC and UCSD Libraries at UC San Diego
- University of Maryland
- National Center for Atmospheric Research (NCAR)
in Boulder, CO
40Institutions and Roles - UCSD
- SDSC
- Storage and networking services
- SRB support
- Transmission Packaging Modules
- UCSD Libraries
- Metadata services (PREMIS)
- DIPs (Dissemination Information Packages)
- Other advanced data services as needed
41Institutions and Roles - NCAR
- National Center for Atmospheric Research
- Archives Complete copy of all data
- Storage and network support
- Network testing
42Institutions and Roles - UMIACS
- University of Maryland Institute for Advanced
Computer Studies - Archives Complete copy of all data
- Advanced data services
- PAWN Producer Archive Workflow Network in
Support of Digital Preservation - ACE Auditing Control Environment to Ensure the
Long Term Integrity of Digital Archives - Other advanced data services as needed
43SDSC Chronopolis Program
44Chronopolis Vocabulary
- Partners UCSD Libraries, National Center for
Atmospheric Research, University of Maryland
Institute for Advanced Computer Studies all
provide grid enabled storage nodes for
Chronopolis services. - Clients ICPSR, CDL contribute content to the
Chronopolis preservation network. - SRB Storage Resource Broker datagrid
software. - iRODS integrated Rule Oriented Data System
datagrid software. - ACE Audit Control Cnvironment part of the
ADAPT project at UMD. - PAWN Producer Archive Workflow Network part
of the ADAPT project at UMD. - INCA user level grid monitoring - executes
periodic, automated, user-level testing of Grid
software and services grid middleware. - Bagit Transfer specification developed by CDL
and the Library of Congress. - GridFTP parallel transfer technology - moves
large collections within a grid wide-area network.
45Chronopolis Inside
- Linked by main staging grid where data is
verified for integrity, and quarantined for
security purposes. - Collections are independently pulled into each
system. - Manifest layer provides added security for
database management and data integrity
validation. - Benefits
- 3 independently managed copies of the
collection - High availability
- High reliability
Grid Brick Disks
46SDSC Leveraged Infrastructure
- Serves Both HPC Digital Preservation
- Archive
- 25 PB capacity
- Both HPSS SAM-QFS
- Online disk
- 3PB total
- HPC parallel file systems
- Collections
- Databases
- Access Tools
Adapted from Richard Moore (SDSC)
47Chronopolis Demonstration Project
- Demonstration Project 2006-2007
- Demonstration Collections Ingested
within
Chronopolis - National Virtual Observatory (NVO)
- 3 TB Hyperatlas Images (partial collection)
- Library of Congress PG Image Collection
- 600 GB Prokudin-Gorskii Image Collection
- Interuniversity Consortium for Political
and Social Research (ICPSR) - 2TB Web Accessible Data
- NCAR Observational Data
- 3TB Observational Re-Analysis Data
48NDIIPP Chronopolis Project
- Creating a 3-node federated data grid at SDSC,
NCAR and UMD up to 50 TB data from CDL and
ICPSR - Installing and testing a suite of monitoring
tools using ACE, PAWN, INCA - Creating Appropriate Transmission Information
Packages - Generating PREMIS definitions for data
- Writing Best Practices documents for clients and
partners
49Chronopolis Grid Framework
Chronopolis Data 12-25TB
Chronopolis Data 12TB
Sun 6140 62TB
SRB D-Broker
SRB D-Broker
SRB MCAT
CDL Server
ICPSR Server
CDL Server
ICPSR Network
UC BerkeleyNetwork
NCAR Network
NCAR Network
SRB MCAT
SDSC Network
SDSC Network
MarylandNetwork
UMD Network
SRB D-Broker
Apple Xsan
SRB D-Broker
SRB D-Broker
SRB D-Broker
SRB MCAT
Sun SAM-QFS
Tape Silos
Adapted from Bryan Banister (SDSC)
50 NDIIPP Chronopolis Clients-CDL
- California Digital Library
- A part of UCOP, supports the University of
California libraries - Providing up to 25TB of data Web-At-Risk project
- Five years of political and governmental websites
- ARC files created from web crawls
- Using Bagit Transfer Structure
51 Diagram of CDL Data Transfer
Wget Bagit
CDL Virtual Machine at UCB
SDSC Network
Wget files 1-10, 11-20
Parallel Wget Xfer
UMIACS Network
Possible SRB/Bagit Module
Bagit Manifest
File 1
UMIACS
Chron Staging
File n
Chron Repository
NCAR Network
NCAR
Adapted from Bryan Banister (SDSC)
52NDIIPP Chronopolis Clients-ICPSR
- Inter-University Consortium for Political and
Social Research, University of Michigan - Providing _at_12TB of data Wide variety of types
- Already working with SDSC using SRB
53Diagram of ICSPR Transfer
Sput/Srsync Files
ICPSR SRB Repository UMich
SDSC Network
Sput tar files
Parallel Sput/Srsync Xfer
UMIACS Network
Chron SRB MCAT
EMC SAN
File 1
UMIACS
Chron Staging
File n
Chron Repository
NCAR Network
NCAR
Adapted from Bryan Banister (SDSC)
54Ongoing and Future Initiatives
- Migration of Chronopolis from SRB to iRODS
- Develop Interoperability with Community Based
Archival Systems/Standards - TRAC compliance for SDSC Production Preservation
Services/Chronopolis Consortium
55Looking for Partnerships
- Repositories interested in moving large digital
collections among heterogeneous repository
systems. - Fedora, DSpace or E-Prints sites interested in
managed datagrid storage. - Institutions interested in personnel swaps to
conduct TRAC audit assessment compliance. - Community Needs for Mass-Scale Data Transmission
and Storage.
56Chronopolis Credits
- SDSC
- Fran Berman
- Richard Moore
- David Minor
- Chris Jordan
- Jim DAoust
- Robert McDonald
- Don Sutton
- Brian Banister
- Phong Dinh
- Jay Dombrowski
- Emilio Valente
- UCSD Libraries
- Brian Schottlaender
- Luc Declerck
- Ardys Kozbial
- Brad Westbrook
- Arwen Hutt
- NCAR
- Don Middleton
- Michael Burek
- Linda McGinley
- UMIACS
- Joseph JaJa
- Mike Smorul
- Mike McGann
- Library of Congress
- Martha Anderson
- Lisa Hoppis
- CACI
- Mike Ivey
57http//chronopolis.sdsc.edu
58(No Transcript)
59(No Transcript)
60(No Transcript)
61- a geographically distributed preservation
environment that supports long-term management
and stewardship of digital collections - implemented by developing and deploying a
distributed data grid, and by supporting its
human, policy, and technological infrastructure.
- technology forecasting and migration in support
of long-term life-cycle management of the
dedicated preservation environment.
62Chronopolis focuses on ...
- Assessment of the needs of potential user
communities and development of appropriate
service models - Development of Memoranda of Understanding (MOUs),
Service Level Agreements (SLAs), etc. to
formalize trust relationships and manage
expectations - Assessment and prototyping of best practices for
bit preservation, authentication, metadata, etc. - Development of cost and risk models for long-term
preservation - Development of appropriate success metrics to
evaluate usefulness, reliability, and usability
of infrastructure
63The people of Chronopolis are ...
64Organizations need ways to validate trust in 3rd
parties
In conclusion
65(No Transcript)
66SDSC and the Library of Congress explored one way
to do this
by working with Cyberinfrastructure
and demonstrating trust.
67With a trusted relationship, many journeys become
possible