Biological Database Systems - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Biological Database Systems

Description:

(in case of relational DBMS): most tables store information about domain of your ... be on a linked page and make clickable icons convey their function pictorially ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 31
Provided by: denissh
Category:

less

Transcript and Presenter's Notes

Title: Biological Database Systems


1
Biological Database Systems
  • 8.1. Design issues specific to biological
    database systems (cont.)

2
Scientific requirements to design
  • Scientific culture requirements
  • Need to track the source of data
  • Need to accommodate the fuzziness of biology
  • Database should support these aspects
  • (in case of relational DBMS) most tables store
    information about domain of your interest and
    several supplemental tables plus some attributes
    of informative tables contain technical
    information (related to tracking, versioning,
    etc.)

3
Tracking the data sources
  • It is recommended to track the source of
    features
  • E.g., features may be entered by users (rather
    than obtained from source databases)
  • Different source databases may provide
    contradictory data/metadata
  • Lack of tracking the sources may create problems
  • Seq A is annotated as a kinase because of
    sequence similarity with Seq B
  • Seq B turns out not to be a kinase
  • Most probably Seq A has same basic structure as
    Seq B, but lacks kinase function
  • Seq C is annotated as a kinase because of
    similarity to Seq A
  • Function annotations cannot be trusted unless
    function transfers are traceable

4
Tracking the data sources
  • It is crucial to be able to lookup and evaluate
    source reference
  • Science is incomplete
  • Your research may contradict the data in the
    database
  • Where is the error? Often, both right since there
    is no full picture yet
  • Researchers need to return to original source and
    evaluate experimental steps
  • Gold standard is a peer-reviewed publication
  • As a rule, indexed in the PubMed database
  • But not always there are other sources (e.g.,
    publications in other domains like chemistry,
    technical reports, webpages, companies reports,
    etc.)

5
Tracking the data sources
  • Reference data is quite complex
  • For many tasks, it is enough to link to the
    PubMed
  • Reference itself has to be stored if there is a
    need in queries into references
  • E.g., find all features supported by works
    written by John D.
  • NCBI provides a reference in XML format (can be
    parsed and stored in your database)
  • Bibliographic Query Service (access to
    heterogeneous bibliographic databases,
    http//www.ebi.ac.uk/senger/openbqs)

6
Versioning
  • Very important concept in software development
  • Some public databases version their sequences
  • For example RefSeq
  • Sequence is identified by an accession number and
    a version (e.g., NM_005842.2)
  • As a rule, only the latest version is available
  • Have to decide do you need versioning or not? If
    yes, how to handle it in database
  • Keep latest version only or all versions
  • In case of keeping all versions, do you associate
    different versions of the same sequence with each
    other?
  • What should happen to any metadata added to a
    sequence when its new version comes out?

7
Handling uncertainties
  • Fuzziness in biology there will an exception in
    almost any biological classification scheme
  • Make schemes user-extensible using lookup
    tables
  • Provide a comment field so that users can
    document the exceptions
  • Accommodate uncertainty in biological data

8
Handling uncertainties
  • Uncertainty is associated with all scientific
    data
  • Imperfections in measurement techniques
  • Incomplete knowledge
  • Handling uncertainty depends on
  • Type of uncertainty
  • Requirements of the scientists who use the data
  • Ignoring uncertainty may corrupt your database
  • Scientific conclusions may be derived from data
    in database
  • Uncertainty in data will influence conclusions
  • If users cannot assess uncertainty of data,
    database becomes less valuable

9
Handling uncertainties types
  • Uncertainty in quantitative data can be
    calculated
  • Store raw data and calculate on the fly
  • Store data and calculated error (e.g., average
    and standard deviation)
  • Uncertainty in qualitative data is more difficult
    to handle
  • Some types of experiments are inherently less
    certain than others

10
Handling uncertainties
  • Examples of bio-data with uncertainty
  • Protein-protein interactions
  • Large scale studies and individual studies have
    differing uncertainties
  • Biophysical measurements
  • Often include quantitative uncertainties
  • Protein function annotation
  • Large difference in uncertainty between
    experimental and computational annotations

11
Handling uncertainties example
12
Handling uncertainties
  • The problem illustrated on the prev.slide is not
    caused by the first annotation transfer
  • The problem is caused by the fact that a
    researcher using the data does not know that the
    annotation on protein 2 is less certain than the
    annotation on protein 1
  • Solution is to include this uncertainty in the
    data presented to a user

13
Handling uncertainties
  • Include it in the annotation text annotate
    protein 2 as sugar kinase (by similarity)
  • GenBank/other sequence databases do this way
  • Include information about the source of the
    annotation classify annotation on protein 1 as
    experimental and annotation on protein 2 as
    computational (derived)
  • Gene Ontology includes evidence classifications
  • Link annotation directly to the supporting data
  • May be appropriate for database for lab/company
    that concentrates on protein annotations

14
Handling uncertainties
With classification method
15
Handling uncertainties
With supporting evidence
16
Biological database design and implementation
  • Experience of building biological databases from
    people behind the Ensembl project (producing and
    maintaining automatic annotation on selected
    eukaryotic genomes), http//www.ensembl.org
  • See Biological database design and
    implementation by Birney Clamp, Briefings in
    Bioinformatics, 5(1)31-38, 2004
  • Best practices (very practical) to help people
    design/build bio-databases
  • Note BirneyClamp didnt really distinguish
    between database systems and software development

17
Biological database design and implementation
  • Unique problems for biological databases
  • There is no true biological interpretation of
    data stored in database
  • Interpretation can change over time
  • Discovering new relationships between some
    aspects of the data is an important part of
    motivation to store information in database
  • Technically, experiment-based data can be
    considered as invariable (e.g., read-outs from
    microarray chips), but, generally, there is
    always some agreement where data and data
    interpretation starts and ends (e.g., microarray
    data are manipulated before storing)

18
Biological database design and implementation
  • Unique problems for biological databases
  • Lack of people with good understanding of both
    biology and programming/database systems
  • Hopefully, there will be more such people
  • At least, people who involved in database
    building should have an appreciation of the other
    field

19
Biological database design and implementation
general points
  • Use Source Code Control
  • Concurrent Versions System (CVS) keeps track of
    all work and all changes in a set of files,
    typically the implementation of a software
    project, and allows several developers to
    collaborate
  • Very useful even for only one person, essential
    for group of two and more
  • Use relational databases (MySQL, PostgreSQL, ---
    SQLite)
  • Do not store your data in text files (exceptions
    images often require specialized tools sequence
    files since many tools are tied to specific
    formats)
  • Just study them

20
Biological database design and implementation
general points
  • Be aware of cutting edge technologies
  • Use only if you really have to
  • More buggy because it is new and, hence, was not
    tested/used in different scenarios
  • (!) there will be only a small number of people
    (and generally only one in your group) who
    understand this technology well
  • Advice
  • Programming languages C, C, Java, Perl,
    Python, Lisp (?)
  • Database systems relational databases, SQL

21
Biological database design and implementation
general points
  • Avoid to mix database development with
    CS/bioinformatics research
  • Particularly, CS-oriented people may be eager to
    make in advance in CS using biological data as an
    exemplar
  • Concentrate on a particular domain
  • Do not try to integrate as much biological data
    as possible
  • Make sure that no one or only a few groups
    provide a similar resource
  • Particularly, people with biological background
    like to sketch out too detailed scheme that
    incorporates almost everything in a topic of
    interest (-- see slides for Lec.7 scope
    boundaries)
  • Better find pieces of information that are
    missing in other places and focus on them

22
Biological database design and implementation
detailed advice
  • Database project should have
  • CVS repository
  • Mailing lists for all group members, for
    programming developers, for public
  • Well-backed-up production database instance
  • Relatively isolated web-viewed database instance
  • Development areas (place where developers can
    create and destroy development databases and
    develop new code)

23
Biological database design and implementation
detailed advice
  • People behind database (it is rare for a single
    person to do three or more roles)
  • Bioinformatician (understands what the data is
    and its biological meaning)
  • Web developer
  • Software developer (understands the precise
    database/software implementation and writes the
    required software)
  • System administrator (provides support during
    system specific problems)

24
Biological database design and implementation
code and schema design
  • Decide on a principal data model description
  • SQL DDL (i.e., CREATE TABLE statements)
  • XML DTD or XML Schema
  • UML
  • Try to focus on the core aspects of data or
    analysis
  • Invest time into data model design
  • Dont overdo (one month is likely to enough)
  • Almost impossible to understand consequences of
    the design until one tries it out
  • Design requirements are likely to be changed
  • Use two types of tracking identifiers
  • Internal (used within the database and related
    software)
  • Published (external)
  • When writing software
  • Use object-oriented approach (aim for simplicity
    and code reuse)
  • Test the code intensively

25
What makes a good database?
  • Annual Nucleic Acids Research Database Issues
  • A few pieces of advice from the editor of NAR
    database issue
  • See What Makes a Good Database by Batemen, NAR,
    35(Database issue) D1D2, 2007
  • Note very broad suggestions how to make your
    resource included in the database issue (i.e.,
    automatically means the database should be
    original in some way)

26
What makes a good database?
  • Data considerations
  • When thinking of a name for your database check
    if anyone else is using that name already (check
    search engines with your database name, you may
    be surprised at what it means in other languages
    )
  • Make your data as comprehensive as possible (try
    to avoid making the data collection
    overspecialized i.e., a database of promoters
    for RNA genes in a single organism is not going
    to have a wide appeal, but a database of
    promoters for RNA genes in all organisms would be
    of wide interest and utility)
  • Attribute the original sources of derived data
  • Make sure that you are not breaching any license
    terms by redistributing data
  • Include estimates of confidence in the data items
    if applicable
  • Make data available for bulk download as flat
    files or relational database tables with
    associated documentation
  • Web services are becoming popular ways to make
    databases programmatically available (i.e.,
    provide users with both user-friendly (via
    website) and program-friendly interfaces)
  • Allow users to provide feedback on your data and
    submit new data

27
What makes a good database?
  • Web interface considerations
  • Document your web interface and data (include a
    short tutorial if possible)
  • Include a brief statement of what your database
    does on the front page
  • Make sure that users can always link back to the
    home page
  • Do not make all your links pop up in new windows
  • Include example sequence/identifiers/keywords for
    every entry box on a query form
  • Keep search forms as simple as possible (most
    users will not want to do complex queries of your
    data keep advanced searches on a separate linked
    page)
  • Allow users to browse the data without searching
    for a specific entry (e.g., provide alphabetical
    lists of entries, or entries sorted by function)
  • Do not use a jargon term when a well-known term
    already exists

28
What makes a good database?
  • Web interface considerations
  • Make it obvious what information will be on a
    linked page and make clickable icons convey their
    function pictorially
  • Get a domain name for your website (URLs to
    specific IP addresses/ports are unlikely to stand
    the test of time)
  • Test your website on a range of browsers and on a
    range of operating systems and make sure that
    external users can access all the content
  • Get feedback from your user community before
    submission
  • Slow server response times will make your
    database unusable (you can help this by printing
    out at least some message to let users know when
    they may expect to obtain results)
  • Look at websites devoted to creating good web
    pages (also have a look at the web pages
    describing web page faults to avoid such as
    http//www.webpagesthatsuck.com/)

29
Concluding remarks
  • Successful design of biological databases
    requires understanding of biology and database
    principles
  • Will not necessarily have both in same person
  • Work in teams
  • Respect the complexity of the field that is not
    your own
  • For most research-scale databases, performance
    will be adequate without any tricks
  • Use relational DBMSs
  • Denormalizing is unlikely an option (denormalize
    only as a last resort)
  • There is no right (or best) answer for many of
    the design issues
  • Decisions often depend on database requirements
  • Field is too young so there is no consensus on
    best way to handle data
  • Dont get stuck in analysis paralysis take your
    best shot, and learn from how it works (or
    doesnt work)

30
  • References
  • Biological database design and implementation by
    Birney Clamp, Briefings in Bioinformatics,
    5(1)31-38, 2004
  • What Makes a Good Database by Batemen, NAR,
    35(Database issue) D1D2, 2007
Write a Comment
User Comments (0)
About PowerShow.com