Peter Buneman - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Peter Buneman

Description:

With help from: Rajendra Bose, James Cheney, Carwyn Edwards, Irini Fundulaki, ... http://www.mapquest.com/atlas/?region=monaco ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 39
Provided by: dcc8
Category:
Tags: buneman | mapquest | peter | uk

less

Transcript and Presenter's Notes

Title: Peter Buneman


1
Curated Databases Can libraries be of help?
Peter Buneman Laboratory for Foundations of
Computer Science School of Informatics,
University of Edinburgh and Digital Curation
Centre
With help from Rajendra Bose, James Cheney,
Carwyn Edwards, Irini Fundulaki, Wenfei Fan,
Floris Geerts, Xibei Xia, Anastasios
Kementsietsidis
2
Traditional Repositories
Are they good as repositories?
What about Alexandria, Baghdad, Seville, Cotton,
Louvain?
3
Traditional Repositories
Are they good for access?
4
Claim
  • In building centralised repositories we may
  • make our data less accessible
  • we cannot dictate how people use it
  • threaten its survival
  • what guarantees do we have that the repositories
    will continue to be supported?

5
A change for the better?
  • Storage
  • Redundant
  • Persistent
  • Distributed
  • Readable by people
  • Clear standards for citation
  • Historical record (old data is useful)
  • Well understood ownership/IP
  • Storage
  • Single-source
  • Volatile
  • Centralised
  • Internal DBMS format
  • No standards for citation
  • No historical record
  • Mind-boggling legal issues

20th century libraries did some things better!
6
What is Digital Curation?
  • Preserving
  • the process of preserving digital data for future
    use, once it has been created
  • or
  • Publishing
  • the process of maintaining and updating body of
    knowledge for current use

7
  • Examples of preserving
  • On-line books, journals,
  • Digitised artifacts (manuscripts, paintings,
    music)
  • Experimental data, results of analyses and
    simulations.
  • Examples of publishing
  • Reference material (dictionaries, encyclopedias,
    gazetteers, )
  • On-line scientific databases
  • Metadata for preservation catalogs, digital
    libraries, ontologies

8
The role of libraries
  • Libraries have focussed recently on the
    preservation aspects of digital data
  • Less so in publishing aspects, even though they
    traditionally played an important role in
    dissemination. Why?
  • Publishing now takes place through curated
    databases

9
Curated databases are about organisation and
annotation
ID 11SB_CUCMA STANDARD PRT 480
AA. AC P13744 DT 01-JAN-1990 (REL. 13,
CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE
UPDATE) DT 01-NOV-1990 (REL. 16, LAST
ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT
PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN)
(WINTER SQUASH). OC EUKARYOTA PLANTA
EMBRYOPHYTA ANGIOSPERMAE DICOTYLEDONEAE OC
VIOLALES CUCURBITACEAE. RN 1 RP SEQUENCE
FROM N.A. RC STRAINCV. KUROKAWA AMAKURI
NANKIN RX MEDLINE 88166744. RA HAYASHI M.,
MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA
I. RL EUR. J. BIOCHEM. 172627-632(1988). RN
2 RP SEQUENCE OF 22-30 ND 297-302. RA
OHMIYA M., HARA I., MASTUBARA H. RL PLANT CELL
PHYSIOL. 21157-167(1980). CC -!- FUNCTION
THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT
HEXAMER EACH SUBUNIT IS COMPOSED OF AN ACIDIC
AND A CC BASIC CHAIN DERIVED FROM A SINGLE
PRECURSOR AND LINKED BY A CC DISULFIDE
BOND. CC -!- SIMILARITY TO OTHER 11S SEED
STORAGE PROTEINS (GLOBULINS). DR EMBL M36407
G167492 -. DR PIR S00366 FWPU1B. DR
PROSITE PS00305 11S_SEED_STORAGE 1. KW SEED
STORAGE PROTEIN SIGNAL. FT SIGNAL 1
21 FT CHAIN 22 480 11S GLOBULIN
BETA SUBUNIT. FT CHAIN 22 296
GAMMA CHAIN (ACIDIC). FT CHAIN 297 480
DELTA CHAIN (BASIC). FT MOD_RES 22
22 PYRROLIDONE CARBOXYLIC ACID. FT
DISULFID 124 303 INTERCHAIN
(GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27
27 S -gt E (IN REF. 2). FT CONFLICT
30 30 E -gt S (IN REF. 2). SQ SEQUENCE
480 AA 54625 MW D515DD6E CRC32
MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH
RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ
CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA
IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV
PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL
AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE
EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE
ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF
NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY
TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI
PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM
LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE //
CIA World Factbook
Swissprot
10
Why curated databases matter
  • Nearly all our reference works are now curated
    databases
  • Upwards of 800 databases in molecular biology
    alone
  • Databases are now
  • a form of publication
  • a means of communication (via annotation)

11
The cost of data
In // per byte
12
Some of the CS issues in Curated DBs
  • Publishing data, data exchange, integration and
    transformation (interoperability)
  • Annotation
  • Provenance
  • Archiving/preservation
  • Citation
  • Security
  • Data Quality

Other (related) issues
  • Legal, appraisal, economic (open access?),
    organisational, education

13
Example IUPHAR database
IUPHAR
  • Standard curated database (one of the nucleic
    acid list 800)
  • Labour-intensive (100 contributors)
  • Valuable (supported by drug companies)
  • Simple, clean structure

DCC
50m
14
How do you cite something in a database?
  • Many scientific databases ask you to cite them,
    but..
  • they dont tell you how, or
  • they tell you to give the URL, or
  • they you to cite a paper written about the
    database.

Nutrition Education for Diverse Audiences
Internet. Urbana (IL) University of Illinois
Cooperative Extension Service, Illinet
Department updated 2000 Nov 28 cited 2001 Apr
25. Diabetes mellitus lesson about 1 screen.
Available from http//www.aces.uiuc.edu/necd/inte
r2_search.cgi?ind854148396
NLM Recommended Formats for Bibliographic
Citation. Internet Supplement. NLM Technical
report Bethesda, MD 20894, July 2001.
15
The population of Monaco?
2004 35,000
http//www.cybevasion.fr/tourisme/monaco.html
2004 33,300
http//www.internetworldstats.com/europa2.htm
2004 32,270 (July 2004 est.)
http//www.cia.gov/cia/publications/factbook/geo
s/mn.html 2004 32,000
http//www.studentsoftheworld.info/pageinfo_pays
.php3?PaysMCO 2004 29,972
http//worldatlas.com/webimage/countrys/europ
e/mc.htm 2003 32,130 (July 2003 est.)
http//www.greenfacts.org/studies/climate_change
/index.htm 2003 32,130 (mid 2003)
http//www.infoplease.com/ipa/A0004379.html 2003
32,000 (July 2003 estimate)
http//www.gesource.ac.uk/worldguide/html/962_peo
ple.html 2003 30,000
http//www.tlfq.ulaval.ca/axl/europe/monaco.htm
2002 31,987 (July 2002 est.)
A http//www.greekorthodoxchurch.org/wfb2002/mona
co/monaco_people.html 2001 31,842 (July 2001
est.) http//wonderclub.com/Atlas/mccia.ht
m 2001 31,842 (July 2001 est.)
http//www.worldfactsandfigures.com/countries/mo
naco.php 2001 31,842 (July 2001 est.)
A http//www.workmall.com/wfb2001/monaco/monaco_
people.html 2000 32,500 (est 2000)
http//www.atlapedia.com/online/countries/monac
o.htm 2000 32,020 (C 2000-05-03)
http//www.citypopulation.de/Monaco.html 2000
32,020 (2000). http//www.worldt
ravelguide.net/data/mco/mco.asp 2000 32,020
(2000 census!) http//www.state.gov/r/p
a/ei/bgn/3397.htm 2000 31,693 (July 2000
est.) http//geography.about.com/library/c
ia/blcmonaco.htm 2000 31,693 (July 2000
est.) http//www.abacci.com/atlas/demograp
hy.asp?countryID269 2000 31,693 (July 2000
est.) ? http//www.mapquest.com/atlas/?region
monaco 2000 31,842
http//www.fact-index.com/m/mo/monaco.html 2000
31,842
http//en.wikipedia.org/wiki/Monaco 2000
31,700 (e2000m)
http//www.library.uu.nl/wesp/populstat/Europe/m
onacoc.htm 1999 32,149 (July 1999 est.)
http//www.photius.com/wfb1999/monaco/monaco_peo
ple.html 1999 32,000
http//geography.about.com/library/weekly/aa0125
99.htm 1990 29,972 (1990
census) http//www.monte-carlo.mc/us/pr
esentation/keyfigur/
16
What is a citation?
  • Bard JB and Davies JA. Development, Databases and
    the Internet. Bioessays. 1995 Nov
    17(11)999-1001.
  • Location and descriptive information

Ann. Phys., Lpz 18 639-641 Nature,
171,737-738 (We often want more than location)
17
Citations and Persistent Identifiers
  • Persistent identifiers provide a long-term
    retrieval mechanism
  • Citations carry
  • descriptive information
  • location information (useful once you have
    retrieved the object)
  • No canonical citation
  • Is the thing retrieved by a PI supposed to be
    static?
  • PIs should be included in citations

18
The IUPHAR database
1. The IUPHAR database (C1) contains no
information about Ginandtonicin. 2. The IUPHAR
database (C2) lists five ligands for Melatonin
receptor MT1. 3. The IUPHAR database (C3) asserts
that luzindole is an antagonist ligand for
receptor MT1.
19
Citation Desiderata
  • Defn. If C is a citation, ltCgt is the thing being
    cited.
  • Desiderata
  • For any citation C , ltCgt should remain fixed
  • Any citable thing T should contain a citation C
    such that ltCgt T
  • There are a few more

20
First, we need to make the DB citable
  • What are we citing?
  • the web page?
  • the underlying database?
  • (neither is satisfactory)
  • How do we make sure the citation is stable?
  • we need to archive all versions of the database

21
The understood database should be an XML
document
  • Human readable simple and clean XML
  • Descriptive tags no clever encodings
  • Corresponds to the structure of the web pages.
  • Claim Will not require a Champollion or Ventris
    to decipher it 100 years from now.

22
What we need
  • Database archiving (Carwyn Edwards and Irini
    Fundulaki)
  • How to archive successive versions of an XML
    document (stable citation)
  • Naïve archiving of relational data
  • Database publishing (Wenfei Fan and Xibei Jia)
  • How to transform relational data to conform to a
    given schema (making it citable)
  • How to do this efficiently

23
Theres more to citation
  • The need to cite data at different levels
  • Automatic generation of citations from the
    database content (rule based system.)
  • Issues in data versioning.
  • Other benefits. Tony Harmar could
  • export his database
  • give proper credit to his contributors, and
  • publish an (old-fashioned) book!

24
A citation generating rule
A rule
  • DBIUPHAR, Versionv, Familyf Receptorr,
    Contributorsa, Editore, Dated, DOIi
  • ?
  • /Root /VersionNumber0v,Editor?e,
    DOI.i, Date.d /Data /FamilyFamilyNamef
    /Contributor-list/Contributora
    /ReceptorReceptorName0r

What gets generated (example)
DBIUPHAR, Version11, FamilyCalcitonin,
ReceptorCALCR, ContributorsDebbie Hay, David
R. Poyner, EditorTony Harmar, DateJan
2006, DOI10.1234
25
How do we Build Archival Databases?Khanna,
Tajima, Tan
  • Many scientific databases keep archives. Its
    important to preserve the state of knowledge as
    it was in the past
  • Archive frequently space consuming
  • Archive infrequently delay in getting recent
    information published.

26
Swissprot
  • 6000 entries added annually
  • relatively little overwriting
  • entries grow 10 annually
  • release every 4 months

Total size of all versions 20 x size of most
recent version
27
Online Mendelian Inheritance in Man
  • Printed editions stopped in 1998
  • Updated daily!

28
OMIM vs. Swissprot
  • Both valuable curated databases
  • Similar gross structure -- sequence of entries,
    each with internal structure
  • Swissprot
  • All past versions available
  • Slow release -- every 3-4 months
  • OMIM
  • Past versions unavailable
  • Rapid release -- every day (or more often)

29
Why not use diff?
  • Diff currently used for archival part of CVS
  • Tree diffs have not yet come to market
  • Line diffs sometimes work well on formatted XML
  • Diffs do not preserve object-hood
  • Expensive to unwind 365 diffs

30
A Sequence of Versions
31
Pushing time down
This relies on a deterministic / keyed model
Driscoll, Sarnak, Sleator, Tarjan Making Data
Structures Persistent.
32
An initial experiment
  • Recorded all OMIM versions for about 14 weeks
    (100 of them)
  • XML-ized all of them
  • Combined into archive XML format file by pushing
    time down.
  • Also recorded diffs between versions
  • Did the same the same thing for the last 20
    available versions of Swissprot

33
100 days of OMIM
  • Uncompressed
  • Archive size is
  • ? 1.01 times diff repository size
  • ? 1.04 times size of largest version
  • Compressed
  • archive size between 0.94 and 1 times compressed
    diff repository size
  • gzip - unix compression tool
  • XMill - XML compression tool

34
5 years of Swissprot
  • Uncompressed
  • Archive size is
  • ? 1.08 times diff repository size
  • ? 1.92 times size of largest version
  • Compressed
  • archive between 0.59 and 1 times compressed diff
    repository size

35
The Bottom Line
  • Can archive a whole year of Swissprot or OMIM
    with lt 15 overhead (size of current file)
  • Retrieval is a linear scan
  • Works well with compression to less than 30 of
    current file.
  • Archive as often as you like! (Almost)
  • Permits temporal queries on objects

36
How could libraries help?
  • Storage
  • Redundant
  • Persistent
  • Distributed
  • Readable by people
  • Clear standards for citation
  • Historical record (old data is useful)
  • Well understood ownership/IP
  • Storage
  • Single-source
  • Volatile
  • Centralised
  • Internal DBMS format
  • No standards for citation
  • No historical record
  • Mind-boggling legal issues

37
Some tentative suggestions
  • Economics is open access the right model for
    curated databases? (Broaden subscriptions?)
  • Service
  • Legal/IP consulting
  • Dissemination
  • Archiving
  • Technical
  • Archival software (LOCKSS for databases?)
  • Publishing/dissemination software
  • Domain specific publishing (Edina, so why not
    IUPHAR?)

38
Thank you
  • for allowing me to use the library for
    scholarly(?) purposes!
Write a Comment
User Comments (0)
About PowerShow.com