Title: Biological Database Systems
1Biological Database Systems
- 8.1. Design issues specific to biological
database systems (cont.)
2Scientific requirements to design
- Scientific culture requirements
- Need to track the source of data
- Need to accommodate the fuzziness of biology
- Database should support these aspects
- (in case of relational DBMS) most tables store
information about domain of your interest and
several supplemental tables plus some attributes
of informative tables contain technical
information (related to tracking, versioning,
etc.)
3Tracking the data sources
- It is recommended to track the source of
features - E.g., features may be entered by users (rather
than obtained from source databases) - Different source databases may provide
contradictory data/metadata - Lack of tracking the sources may create problems
- Seq A is annotated as a kinase because of
sequence similarity with Seq B - Seq B turns out not to be a kinase
- Most probably Seq A has same basic structure as
Seq B, but lacks kinase function - Seq C is annotated as a kinase because of
similarity to Seq A - Function annotations cannot be trusted unless
function transfers are traceable
4Tracking the data sources
- It is crucial to be able to lookup and evaluate
source reference - Science is incomplete
- Your research may contradict the data in the
database - Where is the error? Often, both right since there
is no full picture yet - Researchers need to return to original source and
evaluate experimental steps - Gold standard is a peer-reviewed publication
- As a rule, indexed in the PubMed database
- But not always there are other sources (e.g.,
publications in other domains like chemistry,
technical reports, webpages, companies reports,
etc.)
5Tracking the data sources
- Reference data is quite complex
- For many tasks, it is enough to link to the
PubMed - Reference itself has to be stored if there is a
need in queries into references - E.g., find all features supported by works
written by John D. - NCBI provides a reference in XML format (can be
parsed and stored in your database) - Bibliographic Query Service (access to
heterogeneous bibliographic databases,
http//www.ebi.ac.uk/senger/openbqs)
6Versioning
- Very important concept in software development
- Some public databases version their sequences
- For example RefSeq
- Sequence is identified by an accession number and
a version (e.g., NM_005842.2) - As a rule, only the latest version is available
- Have to decide do you need versioning or not? If
yes, how to handle it in database - Keep latest version only or all versions
- In case of keeping all versions, do you associate
different versions of the same sequence with each
other? - What should happen to any metadata added to a
sequence when its new version comes out?
7Handling uncertainties
- Fuzziness in biology there will an exception in
almost any biological classification scheme - Make schemes user-extensible using lookup
tables - Provide a comment field so that users can
document the exceptions - Accommodate uncertainty in biological data
8Handling uncertainties
- Uncertainty is associated with all scientific
data - Imperfections in measurement techniques
- Incomplete knowledge
- Handling uncertainty depends on
- Type of uncertainty
- Requirements of the scientists who use the data
- Ignoring uncertainty may corrupt your database
- Scientific conclusions may be derived from data
in database - Uncertainty in data will influence conclusions
- If users cannot assess uncertainty of data,
database becomes less valuable
9Handling uncertainties types
- Uncertainty in quantitative data can be
calculated - Store raw data and calculate on the fly
- Store data and calculated error (e.g., average
and standard deviation) - Uncertainty in qualitative data is more difficult
to handle - Some types of experiments are inherently less
certain than others
10Handling uncertainties
- Examples of bio-data with uncertainty
- Protein-protein interactions
- Large scale studies and individual studies have
differing uncertainties - Biophysical measurements
- Often include quantitative uncertainties
- Protein function annotation
- Large difference in uncertainty between
experimental and computational annotations
11Handling uncertainties example
12Handling uncertainties
- The problem illustrated on the prev.slide is not
caused by the first annotation transfer - The problem is caused by the fact that a
researcher using the data does not know that the
annotation on protein 2 is less certain than the
annotation on protein 1 - Solution is to include this uncertainty in the
data presented to a user
13Handling uncertainties
- Include it in the annotation text annotate
protein 2 as sugar kinase (by similarity) - GenBank/other sequence databases do this way
- Include information about the source of the
annotation classify annotation on protein 1 as
experimental and annotation on protein 2 as
computational (derived) - Gene Ontology includes evidence classifications
- Link annotation directly to the supporting data
- May be appropriate for database for lab/company
that concentrates on protein annotations
14Handling uncertainties
With classification method
15Handling uncertainties
With supporting evidence
16Biological database design and implementation
- Experience of building biological databases from
people behind the Ensembl project (producing and
maintaining automatic annotation on selected
eukaryotic genomes), http//www.ensembl.org - See Biological database design and
implementation by Birney Clamp, Briefings in
Bioinformatics, 5(1)31-38, 2004 - Best practices (very practical) to help people
design/build bio-databases - Note BirneyClamp didnt really distinguish
between database systems and software development
17Biological database design and implementation
- Unique problems for biological databases
- There is no true biological interpretation of
data stored in database - Interpretation can change over time
- Discovering new relationships between some
aspects of the data is an important part of
motivation to store information in database - Technically, experiment-based data can be
considered as invariable (e.g., read-outs from
microarray chips), but, generally, there is
always some agreement where data and data
interpretation starts and ends (e.g., microarray
data are manipulated before storing)
18Biological database design and implementation
- Unique problems for biological databases
- Lack of people with good understanding of both
biology and programming/database systems - Hopefully, there will be more such people
- At least, people who involved in database
building should have an appreciation of the other
field
19Biological database design and implementation
general points
- Use Source Code Control
- Concurrent Versions System (CVS) keeps track of
all work and all changes in a set of files,
typically the implementation of a software
project, and allows several developers to
collaborate - Very useful even for only one person, essential
for group of two and more - Use relational databases (MySQL, PostgreSQL, ---
SQLite) - Do not store your data in text files (exceptions
images often require specialized tools sequence
files since many tools are tied to specific
formats) - Just study them
20Biological database design and implementation
general points
- Be aware of cutting edge technologies
- Use only if you really have to
- More buggy because it is new and, hence, was not
tested/used in different scenarios - (!) there will be only a small number of people
(and generally only one in your group) who
understand this technology well - Advice
- Programming languages C, C, Java, Perl,
Python, Lisp (?) - Database systems relational databases, SQL
21Biological database design and implementation
general points
- Avoid to mix database development with
CS/bioinformatics research - Particularly, CS-oriented people may be eager to
make in advance in CS using biological data as an
exemplar - Concentrate on a particular domain
- Do not try to integrate as much biological data
as possible - Make sure that no one or only a few groups
provide a similar resource - Particularly, people with biological background
like to sketch out too detailed scheme that
incorporates almost everything in a topic of
interest (-- see slides for Lec.7 scope
boundaries) - Better find pieces of information that are
missing in other places and focus on them
22Biological database design and implementation
detailed advice
- Database project should have
- CVS repository
- Mailing lists for all group members, for
programming developers, for public - Well-backed-up production database instance
- Relatively isolated web-viewed database instance
- Development areas (place where developers can
create and destroy development databases and
develop new code)
23Biological database design and implementation
detailed advice
- People behind database (it is rare for a single
person to do three or more roles) - Bioinformatician (understands what the data is
and its biological meaning) - Web developer
- Software developer (understands the precise
database/software implementation and writes the
required software) - System administrator (provides support during
system specific problems)
24Biological database design and implementation
code and schema design
- Decide on a principal data model description
- SQL DDL (i.e., CREATE TABLE statements)
- XML DTD or XML Schema
- UML
- Try to focus on the core aspects of data or
analysis - Invest time into data model design
- Dont overdo (one month is likely to enough)
- Almost impossible to understand consequences of
the design until one tries it out - Design requirements are likely to be changed
- Use two types of tracking identifiers
- Internal (used within the database and related
software) - Published (external)
- When writing software
- Use object-oriented approach (aim for simplicity
and code reuse) - Test the code intensively
25What makes a good database?
- Annual Nucleic Acids Research Database Issues
- A few pieces of advice from the editor of NAR
database issue - See What Makes a Good Database by Batemen, NAR,
35(Database issue) D1D2, 2007 - Note very broad suggestions how to make your
resource included in the database issue (i.e.,
automatically means the database should be
original in some way)
26What makes a good database?
- Data considerations
- When thinking of a name for your database check
if anyone else is using that name already (check
search engines with your database name, you may
be surprised at what it means in other languages
) - Make your data as comprehensive as possible (try
to avoid making the data collection
overspecialized i.e., a database of promoters
for RNA genes in a single organism is not going
to have a wide appeal, but a database of
promoters for RNA genes in all organisms would be
of wide interest and utility) - Attribute the original sources of derived data
- Make sure that you are not breaching any license
terms by redistributing data - Include estimates of confidence in the data items
if applicable - Make data available for bulk download as flat
files or relational database tables with
associated documentation - Web services are becoming popular ways to make
databases programmatically available (i.e.,
provide users with both user-friendly (via
website) and program-friendly interfaces) - Allow users to provide feedback on your data and
submit new data
27What makes a good database?
- Web interface considerations
- Document your web interface and data (include a
short tutorial if possible) - Include a brief statement of what your database
does on the front page - Make sure that users can always link back to the
home page - Do not make all your links pop up in new windows
- Include example sequence/identifiers/keywords for
every entry box on a query form - Keep search forms as simple as possible (most
users will not want to do complex queries of your
data keep advanced searches on a separate linked
page) - Allow users to browse the data without searching
for a specific entry (e.g., provide alphabetical
lists of entries, or entries sorted by function) - Do not use a jargon term when a well-known term
already exists
28What makes a good database?
- Web interface considerations
- Make it obvious what information will be on a
linked page and make clickable icons convey their
function pictorially - Get a domain name for your website (URLs to
specific IP addresses/ports are unlikely to stand
the test of time) - Test your website on a range of browsers and on a
range of operating systems and make sure that
external users can access all the content - Get feedback from your user community before
submission - Slow server response times will make your
database unusable (you can help this by printing
out at least some message to let users know when
they may expect to obtain results) - Look at websites devoted to creating good web
pages (also have a look at the web pages
describing web page faults to avoid such as
http//www.webpagesthatsuck.com/)
29Concluding remarks
- Successful design of biological databases
requires understanding of biology and database
principles - Will not necessarily have both in same person
- Work in teams
- Respect the complexity of the field that is not
your own - For most research-scale databases, performance
will be adequate without any tricks - Use relational DBMSs
- Denormalizing is unlikely an option (denormalize
only as a last resort) - There is no right (or best) answer for many of
the design issues - Decisions often depend on database requirements
- Field is too young so there is no consensus on
best way to handle data - Dont get stuck in analysis paralysis take your
best shot, and learn from how it works (or
doesnt work)
30- References
- Biological database design and implementation by
Birney Clamp, Briefings in Bioinformatics,
5(1)31-38, 2004 - What Makes a Good Database by Batemen, NAR,
35(Database issue) D1D2, 2007