Title: Peter Buneman
1Curated Databases Can libraries be of help?
Peter Buneman Laboratory for Foundations of
Computer Science School of Informatics,
University of Edinburgh and Digital Curation
Centre
With help from Rajendra Bose, James Cheney,
Carwyn Edwards, Irini Fundulaki, Wenfei Fan,
Floris Geerts, Xibei Xia, Anastasios
Kementsietsidis
2Traditional Repositories
Are they good as repositories?
What about Alexandria, Baghdad, Seville, Cotton,
Louvain?
3Traditional Repositories
Are they good for access?
4Claim
- In building centralised repositories we may
- make our data less accessible
- we cannot dictate how people use it
- threaten its survival
- what guarantees do we have that the repositories
will continue to be supported?
5A change for the better?
- Storage
- Redundant
- Persistent
- Distributed
- Readable by people
- Clear standards for citation
- Historical record (old data is useful)
- Well understood ownership/IP
- Storage
- Single-source
- Volatile
- Centralised
- Internal DBMS format
- No standards for citation
- No historical record
- Mind-boggling legal issues
20th century libraries did some things better!
6What is Digital Curation?
- Preserving
- the process of preserving digital data for future
use, once it has been created - or
- Publishing
- the process of maintaining and updating body of
knowledge for current use
7- Examples of preserving
- On-line books, journals,
- Digitised artifacts (manuscripts, paintings,
music) - Experimental data, results of analyses and
simulations.
- Examples of publishing
- Reference material (dictionaries, encyclopedias,
gazetteers, ) - On-line scientific databases
- Metadata for preservation catalogs, digital
libraries, ontologies
8The role of libraries
- Libraries have focussed recently on the
preservation aspects of digital data - Less so in publishing aspects, even though they
traditionally played an important role in
dissemination. Why? - Publishing now takes place through curated
databases
9Curated databases are about organisation and
annotation
ID 11SB_CUCMA STANDARD PRT 480
AA. AC P13744 DT 01-JAN-1990 (REL. 13,
CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE
UPDATE) DT 01-NOV-1990 (REL. 16, LAST
ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT
PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN)
(WINTER SQUASH). OC EUKARYOTA PLANTA
EMBRYOPHYTA ANGIOSPERMAE DICOTYLEDONEAE OC
VIOLALES CUCURBITACEAE. RN 1 RP SEQUENCE
FROM N.A. RC STRAINCV. KUROKAWA AMAKURI
NANKIN RX MEDLINE 88166744. RA HAYASHI M.,
MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA
I. RL EUR. J. BIOCHEM. 172627-632(1988). RN
2 RP SEQUENCE OF 22-30 ND 297-302. RA
OHMIYA M., HARA I., MASTUBARA H. RL PLANT CELL
PHYSIOL. 21157-167(1980). CC -!- FUNCTION
THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT
HEXAMER EACH SUBUNIT IS COMPOSED OF AN ACIDIC
AND A CC BASIC CHAIN DERIVED FROM A SINGLE
PRECURSOR AND LINKED BY A CC DISULFIDE
BOND. CC -!- SIMILARITY TO OTHER 11S SEED
STORAGE PROTEINS (GLOBULINS). DR EMBL M36407
G167492 -. DR PIR S00366 FWPU1B. DR
PROSITE PS00305 11S_SEED_STORAGE 1. KW SEED
STORAGE PROTEIN SIGNAL. FT SIGNAL 1
21 FT CHAIN 22 480 11S GLOBULIN
BETA SUBUNIT. FT CHAIN 22 296
GAMMA CHAIN (ACIDIC). FT CHAIN 297 480
DELTA CHAIN (BASIC). FT MOD_RES 22
22 PYRROLIDONE CARBOXYLIC ACID. FT
DISULFID 124 303 INTERCHAIN
(GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27
27 S -gt E (IN REF. 2). FT CONFLICT
30 30 E -gt S (IN REF. 2). SQ SEQUENCE
480 AA 54625 MW D515DD6E CRC32
MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH
RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ
CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA
IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV
PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL
AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE
EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE
ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF
NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY
TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI
PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM
LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE //
CIA World Factbook
Swissprot
10Why curated databases matter
- Nearly all our reference works are now curated
databases - Upwards of 800 databases in molecular biology
alone - Databases are now
- a form of publication
- a means of communication (via annotation)
11The cost of data
In // per byte
12Some of the CS issues in Curated DBs
- Publishing data, data exchange, integration and
transformation (interoperability) - Annotation
- Provenance
- Archiving/preservation
- Citation
- Security
- Data Quality
Other (related) issues
- Legal, appraisal, economic (open access?),
organisational, education
13Example IUPHAR database
IUPHAR
- Standard curated database (one of the nucleic
acid list 800) - Labour-intensive (100 contributors)
- Valuable (supported by drug companies)
- Simple, clean structure
DCC
50m
14How do you cite something in a database?
- Many scientific databases ask you to cite them,
but.. - they dont tell you how, or
- they tell you to give the URL, or
- they you to cite a paper written about the
database.
Nutrition Education for Diverse Audiences
Internet. Urbana (IL) University of Illinois
Cooperative Extension Service, Illinet
Department updated 2000 Nov 28 cited 2001 Apr
25. Diabetes mellitus lesson about 1 screen.
Available from http//www.aces.uiuc.edu/necd/inte
r2_search.cgi?ind854148396
NLM Recommended Formats for Bibliographic
Citation. Internet Supplement. NLM Technical
report Bethesda, MD 20894, July 2001.
15The population of Monaco?
2004 35,000
http//www.cybevasion.fr/tourisme/monaco.html
2004 33,300
http//www.internetworldstats.com/europa2.htm
2004 32,270 (July 2004 est.)
http//www.cia.gov/cia/publications/factbook/geo
s/mn.html 2004 32,000
http//www.studentsoftheworld.info/pageinfo_pays
.php3?PaysMCO 2004 29,972
http//worldatlas.com/webimage/countrys/europ
e/mc.htm 2003 32,130 (July 2003 est.)
http//www.greenfacts.org/studies/climate_change
/index.htm 2003 32,130 (mid 2003)
http//www.infoplease.com/ipa/A0004379.html 2003
32,000 (July 2003 estimate)
http//www.gesource.ac.uk/worldguide/html/962_peo
ple.html 2003 30,000
http//www.tlfq.ulaval.ca/axl/europe/monaco.htm
2002 31,987 (July 2002 est.)
A http//www.greekorthodoxchurch.org/wfb2002/mona
co/monaco_people.html 2001 31,842 (July 2001
est.) http//wonderclub.com/Atlas/mccia.ht
m 2001 31,842 (July 2001 est.)
http//www.worldfactsandfigures.com/countries/mo
naco.php 2001 31,842 (July 2001 est.)
A http//www.workmall.com/wfb2001/monaco/monaco_
people.html 2000 32,500 (est 2000)
http//www.atlapedia.com/online/countries/monac
o.htm 2000 32,020 (C 2000-05-03)
http//www.citypopulation.de/Monaco.html 2000
32,020 (2000). http//www.worldt
ravelguide.net/data/mco/mco.asp 2000 32,020
(2000 census!) http//www.state.gov/r/p
a/ei/bgn/3397.htm 2000 31,693 (July 2000
est.) http//geography.about.com/library/c
ia/blcmonaco.htm 2000 31,693 (July 2000
est.) http//www.abacci.com/atlas/demograp
hy.asp?countryID269 2000 31,693 (July 2000
est.) ? http//www.mapquest.com/atlas/?region
monaco 2000 31,842
http//www.fact-index.com/m/mo/monaco.html 2000
31,842
http//en.wikipedia.org/wiki/Monaco 2000
31,700 (e2000m)
http//www.library.uu.nl/wesp/populstat/Europe/m
onacoc.htm 1999 32,149 (July 1999 est.)
http//www.photius.com/wfb1999/monaco/monaco_peo
ple.html 1999 32,000
http//geography.about.com/library/weekly/aa0125
99.htm 1990 29,972 (1990
census) http//www.monte-carlo.mc/us/pr
esentation/keyfigur/
16What is a citation?
- Bard JB and Davies JA. Development, Databases and
the Internet. Bioessays. 1995 Nov
17(11)999-1001. - Location and descriptive information
Ann. Phys., Lpz 18 639-641 Nature,
171,737-738 (We often want more than location)
17Citations and Persistent Identifiers
- Persistent identifiers provide a long-term
retrieval mechanism - Citations carry
- descriptive information
- location information (useful once you have
retrieved the object) - No canonical citation
- Is the thing retrieved by a PI supposed to be
static? - PIs should be included in citations
18The IUPHAR database
1. The IUPHAR database (C1) contains no
information about Ginandtonicin. 2. The IUPHAR
database (C2) lists five ligands for Melatonin
receptor MT1. 3. The IUPHAR database (C3) asserts
that luzindole is an antagonist ligand for
receptor MT1.
19Citation Desiderata
- Defn. If C is a citation, ltCgt is the thing being
cited.
- Desiderata
- For any citation C , ltCgt should remain fixed
- Any citable thing T should contain a citation C
such that ltCgt T - There are a few more
20First, we need to make the DB citable
- What are we citing?
- the web page?
- the underlying database?
- (neither is satisfactory)
- How do we make sure the citation is stable?
- we need to archive all versions of the database
21The understood database should be an XML
document
- Human readable simple and clean XML
- Descriptive tags no clever encodings
- Corresponds to the structure of the web pages.
- Claim Will not require a Champollion or Ventris
to decipher it 100 years from now.
22What we need
- Database archiving (Carwyn Edwards and Irini
Fundulaki) - How to archive successive versions of an XML
document (stable citation) - Naïve archiving of relational data
- Database publishing (Wenfei Fan and Xibei Jia)
- How to transform relational data to conform to a
given schema (making it citable) - How to do this efficiently
23Theres more to citation
- The need to cite data at different levels
- Automatic generation of citations from the
database content (rule based system.) - Issues in data versioning.
- Other benefits. Tony Harmar could
- export his database
- give proper credit to his contributors, and
- publish an (old-fashioned) book!
24A citation generating rule
A rule
- DBIUPHAR, Versionv, Familyf Receptorr,
Contributorsa, Editore, Dated, DOIi - ?
- /Root /VersionNumber0v,Editor?e,
DOI.i, Date.d /Data /FamilyFamilyNamef
/Contributor-list/Contributora
/ReceptorReceptorName0r
What gets generated (example)
DBIUPHAR, Version11, FamilyCalcitonin,
ReceptorCALCR, ContributorsDebbie Hay, David
R. Poyner, EditorTony Harmar, DateJan
2006, DOI10.1234
25How do we Build Archival Databases?Khanna,
Tajima, Tan
- Many scientific databases keep archives. Its
important to preserve the state of knowledge as
it was in the past - Archive frequently space consuming
- Archive infrequently delay in getting recent
information published.
26Swissprot
- 6000 entries added annually
- relatively little overwriting
- entries grow 10 annually
- release every 4 months
Total size of all versions 20 x size of most
recent version
27Online Mendelian Inheritance in Man
- Printed editions stopped in 1998
- Updated daily!
28OMIM vs. Swissprot
- Both valuable curated databases
- Similar gross structure -- sequence of entries,
each with internal structure - Swissprot
- All past versions available
- Slow release -- every 3-4 months
- OMIM
- Past versions unavailable
- Rapid release -- every day (or more often)
29Why not use diff?
- Diff currently used for archival part of CVS
- Tree diffs have not yet come to market
- Line diffs sometimes work well on formatted XML
- Diffs do not preserve object-hood
- Expensive to unwind 365 diffs
30A Sequence of Versions
31Pushing time down
This relies on a deterministic / keyed model
Driscoll, Sarnak, Sleator, Tarjan Making Data
Structures Persistent.
32An initial experiment
- Recorded all OMIM versions for about 14 weeks
(100 of them) - XML-ized all of them
- Combined into archive XML format file by pushing
time down. - Also recorded diffs between versions
- Did the same the same thing for the last 20
available versions of Swissprot
33100 days of OMIM
- Uncompressed
- Archive size is
- ? 1.01 times diff repository size
- ? 1.04 times size of largest version
- Compressed
- archive size between 0.94 and 1 times compressed
diff repository size - gzip - unix compression tool
- XMill - XML compression tool
34 5 years of Swissprot
- Uncompressed
- Archive size is
- ? 1.08 times diff repository size
- ? 1.92 times size of largest version
- Compressed
- archive between 0.59 and 1 times compressed diff
repository size
35The Bottom Line
- Can archive a whole year of Swissprot or OMIM
with lt 15 overhead (size of current file) - Retrieval is a linear scan
- Works well with compression to less than 30 of
current file. - Archive as often as you like! (Almost)
- Permits temporal queries on objects
36How could libraries help?
- Storage
- Redundant
- Persistent
- Distributed
- Readable by people
- Clear standards for citation
- Historical record (old data is useful)
- Well understood ownership/IP
- Storage
- Single-source
- Volatile
- Centralised
- Internal DBMS format
- No standards for citation
- No historical record
- Mind-boggling legal issues
37Some tentative suggestions
- Economics is open access the right model for
curated databases? (Broaden subscriptions?) - Service
- Legal/IP consulting
- Dissemination
- Archiving
- Technical
- Archival software (LOCKSS for databases?)
- Publishing/dissemination software
- Domain specific publishing (Edina, so why not
IUPHAR?)
38Thank you
- for allowing me to use the library for
scholarly(?) purposes!