Biological Database Systems

About This Presentation

Title:

Biological Database Systems

Description:

(in case of relational DBMS): most tables store information about domain of your ... be on a linked page and make clickable icons convey their function pictorially ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 31

Provided by: denissh

Category:

more less

Transcript and Presenter's Notes

Title: Biological Database Systems

1
Biological Database Systems

8.1. Design issues specific to biological
database systems (cont.)

2
Scientific requirements to design

Scientific culture requirements
Need to track the source of data
Need to accommodate the fuzziness of biology
Database should support these aspects
(in case of relational DBMS) most tables store
information about domain of your interest and
several supplemental tables plus some attributes
of informative tables contain technical
information (related to tracking, versioning,
etc.)

3
Tracking the data sources

It is recommended to track the source of
features
E.g., features may be entered by users (rather
than obtained from source databases)
Different source databases may provide
contradictory data/metadata
Lack of tracking the sources may create problems
Seq A is annotated as a kinase because of
sequence similarity with Seq B
Seq B turns out not to be a kinase
Most probably Seq A has same basic structure as
Seq B, but lacks kinase function
Seq C is annotated as a kinase because of
similarity to Seq A
Function annotations cannot be trusted unless
function transfers are traceable

4
Tracking the data sources

It is crucial to be able to lookup and evaluate
source reference
Science is incomplete
Your research may contradict the data in the
database
Where is the error? Often, both right since there
is no full picture yet
Researchers need to return to original source and
evaluate experimental steps
Gold standard is a peer-reviewed publication
As a rule, indexed in the PubMed database
But not always there are other sources (e.g.,
publications in other domains like chemistry,
technical reports, webpages, companies reports,
etc.)

5
Tracking the data sources

Reference data is quite complex
For many tasks, it is enough to link to the
PubMed
Reference itself has to be stored if there is a
need in queries into references
E.g., find all features supported by works
written by John D.
NCBI provides a reference in XML format (can be
parsed and stored in your database)
Bibliographic Query Service (access to
heterogeneous bibliographic databases,
http//www.ebi.ac.uk/senger/openbqs)

6
Versioning

Very important concept in software development
Some public databases version their sequences
For example RefSeq
Sequence is identified by an accession number and
a version (e.g., NM_005842.2)
As a rule, only the latest version is available
Have to decide do you need versioning or not? If
yes, how to handle it in database
Keep latest version only or all versions
In case of keeping all versions, do you associate
different versions of the same sequence with each
other?
What should happen to any metadata added to a
sequence when its new version comes out?

7
Handling uncertainties

Fuzziness in biology there will an exception in
almost any biological classification scheme
Make schemes user-extensible using lookup
tables
Provide a comment field so that users can
document the exceptions
Accommodate uncertainty in biological data

8
Handling uncertainties

Uncertainty is associated with all scientific
data
Imperfections in measurement techniques
Incomplete knowledge
Handling uncertainty depends on
Type of uncertainty
Requirements of the scientists who use the data
Ignoring uncertainty may corrupt your database
Scientific conclusions may be derived from data
in database
Uncertainty in data will influence conclusions
If users cannot assess uncertainty of data,
database becomes less valuable

9
Handling uncertainties types

Uncertainty in quantitative data can be
calculated
Store raw data and calculate on the fly
Store data and calculated error (e.g., average
and standard deviation)
Uncertainty in qualitative data is more difficult
to handle
Some types of experiments are inherently less
certain than others

10
Handling uncertainties

Examples of bio-data with uncertainty
Protein-protein interactions
Large scale studies and individual studies have
differing uncertainties
Biophysical measurements
Often include quantitative uncertainties
Protein function annotation
Large difference in uncertainty between
experimental and computational annotations

11
Handling uncertainties example
12
Handling uncertainties

The problem illustrated on the prev.slide is not
caused by the first annotation transfer
The problem is caused by the fact that a
researcher using the data does not know that the
annotation on protein 2 is less certain than the
annotation on protein 1
Solution is to include this uncertainty in the
data presented to a user

13
Handling uncertainties

Include it in the annotation text annotate
protein 2 as sugar kinase (by similarity)
GenBank/other sequence databases do this way
Include information about the source of the
annotation classify annotation on protein 1 as
experimental and annotation on protein 2 as
computational (derived)
Gene Ontology includes evidence classifications
Link annotation directly to the supporting data
May be appropriate for database for lab/company
that concentrates on protein annotations

14
Handling uncertainties
With classification method
15
Handling uncertainties
With supporting evidence
16
Biological database design and implementation

Experience of building biological databases from
people behind the Ensembl project (producing and
maintaining automatic annotation on selected
eukaryotic genomes), http//www.ensembl.org
See Biological database design and
implementation by Birney Clamp, Briefings in
Bioinformatics, 5(1)31-38, 2004
Best practices (very practical) to help people
design/build bio-databases
Note BirneyClamp didnt really distinguish
between database systems and software development

17
Biological database design and implementation

Unique problems for biological databases
There is no true biological interpretation of
data stored in database
Interpretation can change over time
Discovering new relationships between some
aspects of the data is an important part of
motivation to store information in database
Technically, experiment-based data can be
considered as invariable (e.g., read-outs from
microarray chips), but, generally, there is
always some agreement where data and data
interpretation starts and ends (e.g., microarray
data are manipulated before storing)

18
Biological database design and implementation

Unique problems for biological databases
Lack of people with good understanding of both
biology and programming/database systems
Hopefully, there will be more such people
At least, people who involved in database
building should have an appreciation of the other
field

19
Biological database design and implementation
general points

Use Source Code Control
Concurrent Versions System (CVS) keeps track of
all work and all changes in a set of files,
typically the implementation of a software
project, and allows several developers to
collaborate
Very useful even for only one person, essential
for group of two and more
Use relational databases (MySQL, PostgreSQL, ---
SQLite)
Do not store your data in text files (exceptions
images often require specialized tools sequence
files since many tools are tied to specific
formats)
Just study them

20
Biological database design and implementation
general points

Be aware of cutting edge technologies
Use only if you really have to
More buggy because it is new and, hence, was not
tested/used in different scenarios
(!) there will be only a small number of people
(and generally only one in your group) who
understand this technology well
Advice
Programming languages C, C, Java, Perl,
Python, Lisp (?)
Database systems relational databases, SQL

21
Biological database design and implementation
general points

Avoid to mix database development with
CS/bioinformatics research
Particularly, CS-oriented people may be eager to
make in advance in CS using biological data as an
exemplar
Concentrate on a particular domain
Do not try to integrate as much biological data
as possible
Make sure that no one or only a few groups
provide a similar resource
Particularly, people with biological background
like to sketch out too detailed scheme that
incorporates almost everything in a topic of
interest (-- see slides for Lec.7 scope
boundaries)
Better find pieces of information that are
missing in other places and focus on them

22
Biological database design and implementation
detailed advice

Database project should have
CVS repository
Mailing lists for all group members, for
programming developers, for public
Well-backed-up production database instance
Relatively isolated web-viewed database instance
Development areas (place where developers can
create and destroy development databases and
develop new code)

23
Biological database design and implementation
detailed advice

People behind database (it is rare for a single
person to do three or more roles)
Bioinformatician (understands what the data is
and its biological meaning)
Web developer
Software developer (understands the precise
database/software implementation and writes the
required software)
System administrator (provides support during
system specific problems)

24
Biological database design and implementation
code and schema design

Decide on a principal data model description
SQL DDL (i.e., CREATE TABLE statements)
XML DTD or XML Schema
UML
Try to focus on the core aspects of data or
analysis
Invest time into data model design
Dont overdo (one month is likely to enough)
Almost impossible to understand consequences of
the design until one tries it out
Design requirements are likely to be changed
Use two types of tracking identifiers
Internal (used within the database and related
software)
Published (external)
When writing software
Use object-oriented approach (aim for simplicity
and code reuse)
Test the code intensively

25
What makes a good database?

Annual Nucleic Acids Research Database Issues
A few pieces of advice from the editor of NAR
database issue
See What Makes a Good Database by Batemen, NAR,
35(Database issue) D1D2, 2007
Note very broad suggestions how to make your
resource included in the database issue (i.e.,
automatically means the database should be
original in some way)

26
What makes a good database?

Data considerations
When thinking of a name for your database check
if anyone else is using that name already (check
search engines with your database name, you may
be surprised at what it means in other languages
)
Make your data as comprehensive as possible (try
to avoid making the data collection
overspecialized i.e., a database of promoters
for RNA genes in a single organism is not going
to have a wide appeal, but a database of
promoters for RNA genes in all organisms would be
of wide interest and utility)
Attribute the original sources of derived data
Make sure that you are not breaching any license
terms by redistributing data
Include estimates of confidence in the data items
if applicable
Make data available for bulk download as flat
files or relational database tables with
associated documentation
Web services are becoming popular ways to make
databases programmatically available (i.e.,
provide users with both user-friendly (via
website) and program-friendly interfaces)
Allow users to provide feedback on your data and
submit new data

27
What makes a good database?

Web interface considerations
Document your web interface and data (include a
short tutorial if possible)
Include a brief statement of what your database
does on the front page
Make sure that users can always link back to the
home page
Do not make all your links pop up in new windows
Include example sequence/identifiers/keywords for
every entry box on a query form
Keep search forms as simple as possible (most
users will not want to do complex queries of your
data keep advanced searches on a separate linked
page)
Allow users to browse the data without searching
for a specific entry (e.g., provide alphabetical
lists of entries, or entries sorted by function)
Do not use a jargon term when a well-known term
already exists

28
What makes a good database?

Web interface considerations
Make it obvious what information will be on a
linked page and make clickable icons convey their
function pictorially
Get a domain name for your website (URLs to
specific IP addresses/ports are unlikely to stand
the test of time)
Test your website on a range of browsers and on a
range of operating systems and make sure that
external users can access all the content
Get feedback from your user community before
submission
Slow server response times will make your
database unusable (you can help this by printing
out at least some message to let users know when
they may expect to obtain results)
Look at websites devoted to creating good web
pages (also have a look at the web pages
describing web page faults to avoid such as
http//www.webpagesthatsuck.com/)

29
Concluding remarks

Successful design of biological databases
requires understanding of biology and database
principles
Will not necessarily have both in same person
Work in teams
Respect the complexity of the field that is not
your own
For most research-scale databases, performance
will be adequate without any tricks
Use relational DBMSs
Denormalizing is unlikely an option (denormalize
only as a last resort)
There is no right (or best) answer for many of
the design issues
Decisions often depend on database requirements
Field is too young so there is no consensus on
best way to handle data
Dont get stuck in analysis paralysis take your
best shot, and learn from how it works (or
doesnt work)

References
Biological database design and implementation by
Birney Clamp, Briefings in Bioinformatics,
5(1)31-38, 2004
What Makes a Good Database by Batemen, NAR,
35(Database issue) D1D2, 2007

Write a Comment

User Comments (0)