Title: Essential Bioinformatics and Biocomputing LSM2104: Section I Biological Databases and Bioinformatics
1Essential Bioinformatics and Biocomputing
(LSM2104 Section I) Biological Databases
andBioinformatics SoftwareProf. Chen Yu
ZongTel 6874-6877Email csccyz_at_nus.edu.sghttp
//xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1,
NUSJanuary 2003
2Lecture 5 Bioinformatics software
- Outline
- Types of bioinformatics software
- Sequence, pattern and domain
- Evolutionary analysis
- Visualization
- Modeling and prediction (sequence, structure and
function) - Data mining (bibliographic and text searches)
- Examples
3Types of Bioinformatics software
- Analysis of biological data/systems and
characterization of molecules and sequences. - Analysis and interpretation of experimental
results - Simulation of laboratory experiments, important
for tackling large scale problems - Predictions that lead to the design of
experiments - Bioinformatics software can be accessed via WWW,
or through integrated software packages (such as
Emboss, GCG, Staden, DNAstar, ). It may be
coupled with databases, or may stand alone.
4Bioinformatics software
- Major sources
- Software package at ExPASy Molecular Biology
Server http//www.expasy.org
http//au.expasy.org - Software at PBIL Bio-Informatique Lyonnais
http//pbil.univ-lyon1.fr/ - Toolbox at EBI European Bioinformatics Institute
http//www.ebi.ac.uk/Tools/index.html
5Bioinformatics software
- Major types of bioinformatics tools
- Sequence analysis tools
- Sequence comparison
- Pattern and domain search
- Evolutionary analysis
- Prediction of sequence structure and function
- Visualization of molecular structures
- Structure modeling
- Bibliographic and text searches
- Specialized and other tools
6Bioinformatics software
- Sequence analysis tools
- This kind of software focuses on extraction and
- comparison of properties in DNA and protein
sequences -
- Sequence analysis provides for identification of
domains, structure, and function, and other
properties - The analysis of individual sequences helps with
sequence comparison - Textbook chapter 5, pages 81-93
7Bioinformatics software
- Sequence analysis tools
- This kind of software focuses on extraction and
- comparison of DNA and protein sequence
- properties such as
- composition of nucleotide or protein sequences
- codon usage in DNA
- translation and backtranslation
- Textbook chapter 5, pages 81-93
8Bioinformatics software
- Composition of nucleotide or protein sequences
- Composition (frequency of occurrence of a
nucleotide or of an amino acid) is the most basic
analysis. It can give us important functional and
structural clues. - For example, CG-rich regions called CpG islands
are often found in promoters. A short region just
before the splice site at the end of introns
often has high CT content.
9Bioinformatics software
- Composition of protein and DNA sequences
- Web
- NPS_at_ Network Protein Sequence _at_nalysis
http//npsa-pbil.ibcp.fr/ (Amino-acid
composition) - AA Composition http//molbiol.soton.ac.uk/compute/
aacomp.html - JEMBOSS (in our own laboratory)
- http//srs1.bic.nus.edu.sg/jnlp/ (nucleic,
composition, compseq)
10Bioinformatics software
11Bioinformatics software
12Bioinformatics software
- Codon usage in DNA
- Web
- Count-codon program in Codon Usage Database
http//www.kazusa.or.jp/codon/countcodon.html
(needs start and stop codons at the start and the
end of the sequence) - Tool for Gene to Codon Usage Table
http//www.entelechon.com/eng/genetocut.html - (does not care about start and stop codons)
- JEMBOSS (in the laboratory)
- http//srs1.bic.nus.edu.sg/jnlp/ (nucleic, codon
usage, cusp) - DNA coding region should have only one stop codon
13Bioinformatics software
14Bioinformatics software
15Bioinformatics software
- Translation (DNA to protein) and back translation
- (protein to DNA)
- Web
- Translate tool at ExPASy http//au.expasy.org/tool
s/dna.html (DNA to protein) - JEMBOSS (in the laboratory)
- http//srs1.bic.nus.edu.sg/jnlp/ (DNA to protein
and reverse) - (nucleic, translation, transeq nucleic,
translation, backtranseq) - If we translate and back translate the same
sequence we will typically - not get the same sequence as the starting one.
16Bioinformatics Software
- Sequence comparison (the most important software)
- This will be taught next month by A/P Tan Tin
Wee. - Web
- Local alignment (BLAST, FASTA)
- http//www.ebi.ac.uk/fasta33/
- http//www.ncbi.nlm.nih.gov/BLAST/
- http//www.ebi.ac.uk/blast2/
- Multiple alignment (Clustal W)
- http//www.ebi.ac.uk/clustalw/index.html
- JEMBOSS (in the laboratory)
- http//srs1.bic.nus.edu.sg/jnlp/
- Local alignment Smith-Waterman (alignment,
local, water) - Global alignment Needleman-Wunsh (alignment,
global, needle)
17Bioinformatics software
- Evolutionary analysis
- Multiple sequence alignments can be used as
measures of evolutionary distance between
proteins. The phylogeny systems are used to
represent evolutionary distances between
sequences. - WebPhylip
- http//sdmc.krdl.org.sg8080/lxzhang/phylip/
- GeneBee
- http//www.genebee.msu.su/services/phtree_reduced.
html - Read textbook, page 83.
18Bioinformatics software
19Bioinformatics software
- Prediction of sequence structure and function
- Sequences that have similar structure often have
similar function. For many sequences we can
extract secondary and tertiary structure from the
PDB database. - What if our sequence is not in the PDB? We can
predict structure of a biological sequence using
appropriate software. - There are several programs for prediction of
secondary structure. For prediction of tertiary
structure we can do modelling. - http//npsa-pbil.ibcp.fr (PHD method for
secondary structure prediction)
20Bioinformatics software
- Secondary structure prediction
21Bioinformatics software
- Secondary structure prediction
- The PHD program predicted four alpha helices in
the human IL-2 (red). The number of helices is
correct, but their lengths and boundaries are not
correct (purple). - When we make a prediction in bioinformatics, we
must have an idea about the accuracy of
prediction programs. - To assess the accuracy of a program, we can test
it with known data. Our test must have sufficient
examples, so that we can make reasonable
conclusions.
22Secondary structure prediction Bioinformatics
software
- alpha Lactalbumin PDB 1A4V
- http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?p
age/NPSA/npsa_server.html
23Bioinformatics software
- We used nine different programs for prediction of
secondary structure of alphaLactalbumin (PDB
1A4V). - The results show that the best predictions for
this molecule were from Predator, while DSC was
the laggard. - This test does not mean that Predator is the best
of the tested programs, nor that DSC is the
worst. To make such conclusions we must make test
set first. The test set should contain the
examples from the family of proteins that our
query protein belongs to. - The learning point none of the prediction
programs (and this applies across all
bioinformatics software, not only secondary
structure prediction) is 100 accurate. The users
must be cautious when interpreting results from
the predictive software.
24Bioinformatics software
- Common measure (other measures also exist)
- Sensitivity SETP/(TPFN)
- Specificity SPTN/(TNFP)
- For example, prediction of binding peptides to a
particular receptor - Experimental Predicted
Class - Example 1 Binder Binder
True positive (TP) - Example 2 Non-binder Non-binder
True negative (TN) - Example 3 Binder Non-binder
False negative (FN) - Example 4 Non-binder Binder
False positive (FP) - Prediction system that has SE0.8 and SP0.9 will
correctly predict 8 of 10 experimental positives,
and for each 10 experimental negatives it will
make one false prediction. This prediction
accuracy may be very good for prediction of
peptide binding, but is not very good for some
other predictions, for example gene prediction.
25Bioinformatics software
- Prediction of 3-D structure
- Various modelling programs
- comparative modelling, using known structures as
templates - ab initio modelling, using atomic simulation,
residue statistics, etc. - These methods will be covered later in the course
- An example of the comparative modelling software
is SWISS-MODEL http//www.expasy.org/swissmod/SWIS
S-MODEL.html - This model is provided by email.
- This tool has the facility for assessing the
quality of predictions
26Bioinformatics software
27Bioinformatics software
28Bioinformatics software
29Bioinformatics software
- Software for visualisation of 3-D structures.
Provides different views to 3-D molecular
structure, which will be taught by A/P Shoba. - Chime, Rasmol (they use files in PDB format)
- Scorpion database uses Chime. Chime can be
downloaded from http//www.mdli.com/downloads/dow
nloads.html?uidkeyid1
30Bioinformatics software
31Bioinformatics software
32Bioinformatics software
- Text searches
- Text searching software is used associated with
databases. Most commonly we search by keywords or
combinations of keywords. - Examples of PubMed searches
- Diabetes 181,672
matches - Diabetes AND IDDM 35,841
- Diabetes AND IDDM AND autoimmunity 1,109
- Diabetes OR autoimmunity 190,674
- DiabetesTitle/Abstract 114,624
- The last example is more advanced PubMed option
preview/index
33Bioinformatics software Summary of Todays
lecture
- Why bioinformatics software?
- Types of software sequence, motif, evolution,
visualization, structural modeling, simulation,
test search. - Examples of selected software
- Sequence composition
- DNA-protein sequence translation
- Evolutionary analysis
- Protein secondary structure prediction
- Comparative modeling
- Text search
- To be taught later Sequence comparison,
visualization etc.
34Summary of the SectionBiological databases and
bioinformatics software
- We first focused on biological databases. We
covered topics - discussed types of biological databases
- briefly described popular databases
- structure of the GenBank and SWISS-PROT entries
- searching biological databases
- types of questions that can be answered by
searching databases - completeness and errors in the databases
35Summary of the SectionBiological databases and
bioinformatics software
- The second topic was bioinformatics software. We
covered - why do we need bioinformatics software?
- briefly described major types of bioinformatics
software - described software for sequence composition,
codon usage, translation and backtranslation - introduced the concept of sequence alignment,
evolutionary analysis - secondary and tertiary structure prediction,
molecular visualization - accuracy of prediction software
- text searching