Title: Organization of Biological Data
1Organization of Biological Data and Databases
Pramod Wangikar Dept. of Chemical Engineering IIT
Bombay
2ORGANIZATION OF BIOLOGICAL DATA
Gene i
Genomics
m-RNA i
Transcriptomics
Protein Sequence / Proteomics
Protein i
Function (Enzyme, hormone etc.)
3-D Structural Database
3Primary Structure of Deoxyribonucleic Acid (DNA)
OR
pApCpGpTpTpG
OR
ACGTTG
4The Basic Principle of Transcription
RNA Polymerase
5
Double stranded DNA
RNA
Nucleotides
5The Code
- 64 ways of writing the codon
- 20 amino acids
F
M
uac 5' 5'... aug
gaa 5' uuu ...
Adjacent mRNA codons
6The Flow of Genetic Information
Sequense same as RNA
3
5
ACTGCACCATGGGGCTCAGCGACGGGGAATGGCACTTGGTG TGACGTGG
TACCCCGAGTCGCTGCCCCTTACCGTGAACCAC
DNA
Sequence complementary to RNA
5
mRNA
ACUGCACCAUGGGGCUCAGCGACGGGGAAUGGCACUUGGUG
Initiation signal
codons
Protein
Met-Gly-Leu-Ser-Asp-Gly-Gln-Trp-His-Leu-Val
7Memory Requirements for Storing Genomes
00 a 01 c 10 g 11 t
Prokaryotic 0.5-7.0 Mbp Eukaryotic 10 Mbp -
1000 Gbp
8(No Transcript)
9How Much Data Does a Bacteria (E. coli) contain?
10E. coli and Data size
Numbers are approximate The data size increases
roughly by three orders of magnitude for human
system
11Minimal Life Self- assembly, Catalysis,
Replication, Mutation, Selection
Environment
Cell Boundary
Monomers
RNA
Growth rate
12Maximal Life Self- assembly, Catalysis,
Replication, Mutation, Selection Regulatory
Metabolic Networks
Environment
Metabolites
Interactions
RNA
DNA
Protein
Growth rate
Expression
stem cells cancer cells microbes
13Regulation More biological data
What is regulation A catalogue of possible
scenarios and respective course of action.
- The information for regulation can be stored in
the form of - Protein-protein interaction
- Protein-DNA interaction
- Protein-metabolite interaction
- Molecular switches, controls, set-points, etc.
Genome Environment Input file Biological
Machinery Executable program Observations
Output file
Can we crack the executable program?
14Some useful regulatory signals on Genes
Upstream activating sequences (UAS)
m-RNA expression start end
TATA box
DNA
x
x
mRNA
Ribosomal binding site
protein
Protein synthesis stops
Protein synthesis starts
15Minimal Gene Complement of Mycoplasma genitalium
16DESCRIPTION OF A LIVING CELL / VIRUS
Genome / Genomics
General Capability of the Cell
Readyness of the Cell
Transcriptomics
Proteomics / Protein Map
Physiological state of the cell
17Paradigm Shift in the Bioinformatics Age
Conventional Path
Structure
Gene
Function
Functional Genomics
Gene sequence
Structure of Protein
Function
Protein Map 2D-PAGE, pI, mol. wt.
Proteomics
18Possible Relationships Between Databases
Genome Sequence
Protein Seqeunce
Proteomics
Transcriptomics Expression Profile
Protein Structure
Protein Profile
Protein-DNA interactions
Protein-Protein Interaction
Protein Function
Metabolome
Phenotype
19Combinatorial Problems in Biology
- Prediction of ORF gene finding
- Prediction of DNA regulatory sites
- DNA regulatory Proteins
- Protein-Protein interactions
- Protein Function
- Prediction of Metabolic capability
- Prediction of Genetic Regulatory Circuits
20 Biological Databases
- Raw databases
- Processed databases
- Querying in databases.
21Raw Databases
Conventional Ones
DNA / Gene / Genome Sequence Databases. EMBL,
GenBank, GSDB etc. gt 106 genes, Doubles every 18
months. Genome Projects E. coli, plants, Human,
Mouse, etc. Protein Sequence Databases. PIR,
SwissProt, GenBank, etc. gt 105 protein
sequences, Doubles every 21 months Three
Dimensional structure Database. Brookhaven
Protein Databank (PDB) gt 20,000 structures,
doubles every 24 months.
22Proteomics Database (SwissProt)
- Each Protein Identified by pI, mol wt., mass
spectra, microsequencing, peptide mass
fingerprint, etc. - Entries for E.coli, yeast, human etc.
Hoogland et al, Nucl. Acids Res. (2000) 28, 286
23Cluster of Orthologous Groups (COG) of Proteins
A Processes Database
- Compares genes from different genomes.
- Forms clusters with similar sequences.
- Each COG contains genes connected through
vertical evolutionary descent. - 30 genomes (68,571 genes), 2,791 COGs with 45,350
genes - Assignment of function for genes based on known
functions for some members of the cluster. - Highly useful for functional assignments for
newly sequenced genomes.
24EcoCyc Database Encyclopedia of E. coli genes
and Metabolism
4300 genes, 695 enzymes, 595 reactions, 123
pathways Blue E. coli only Green both E. coli
and H. influenzae.
Karp et al, Nucl. Acids Res. (1998) 26, 50
25Querying in Databases
- Based on sequence similarity gives similar
sequences and the similarity score or expectation
value. - Normally a BLAST, FASTA search (local alignment).
Can look for a sequence motif. - Gene names, biological source, functional
category, cellular location / role. - Structural features (for known 3-D structures).
26Bioinformatics A multidisciplinary effort is
required
- Generation of biological data
- Storage and Retrieval of Data
- Conversion of known biological hypotheses into
mathematical/statistical models - Building models from data
- Fitting new data to existing models.
- Searching for patterns in data
- Derive new biological knowledge from Data