Title: Basic Overview of Bioinformatics Tools and Biocomputing Applications II
1Basic Overview of Bioinformatics Tools and
Biocomputing Applications II
- Dr Tan Tin Wee
- Director
- Bioinformatics Centre
2Common Computational Analyses
- Sequence Assembly
- Simple sequence analysis
- Translation and reverse Complement, ORF
- Composition statistics (protein DNA)
- Molecular mass
- Total charge and pI local hydropathy
- Simple determination of secondary structures
- Restriction site analysis
- Internal repeat analysis
- Detection of active sites, functional residues,
characteristic structures, substrates, and
processing signals
3Common Computational Analyses
- Database sequence search
- Multiple alignment
- 2 and 3 Structure prediction transmembrane
helix detection - Structure modeling
- Docking prediction and design
- Hidden Markov model searches
4Database Searching
- Text-based Database Searching -using a text
string to match an annotation in a sequence
database record, ie. Keyword search - Sequence-based Database Searching -using a
biological sequence to match its whole or parts
of its sequence to the sequences of every
sequence database records
5Text-Based Database Searching
- Examples Entrez, SRS, DBGET, AceDB- common
integrated database systems - Search Concepts
- Boolean Search - AND, OR, NOT
- Broadening Search
- Narrowing the Search
- Proximity searching, soundex
- Wild Card, Stemming eg. Thala for thalasemia,
thalassemia, thalassemic - Use standard string search algorithms and boolean
operations, vocabulary matches
6Text-based Database Searching
- Example To find the human homolog of the
Drosophila per gene - Procedure
- Web to Entrez
- All Fields enter "human" "per"
- Hits returned, irrelevant - broaden search
- "human" "period" - more hits
- check every one, find the human RIGUI gene
- Hit and miss, clever guess work, free form or
controlled vocabulary (MeSH terms)?Use Boolean
searches?
7Sequence-based Database Searching
- Homology Search
- Global or Local Sequence Alignment
- Needleman-Wunch Algorithm
- Smith-Waterman Algorithm
- Lipman - Pearson FASTA
- Altschul's BLAST
- Take a sequence, pairwise comparison with each
sequence in the database
8Sequence-based Database Searching
- Basic Assumptions
- Sequences of homologous Genes/Protein diverge
over time even though structure and/or function
change little - Significant sequence similarity inferred as
potential structural /functional similarity or
common evolutionary origin - Based on well-characterised protein, infer the
function of an unknown sequence at gene or
protein sequence level.
9Sequence-based Database Searching
- Global Alignmentforces complete alignment of the
pairwise comparison of the two input sequences - Local Alignmentlooks for local stretches of
similarity and tries to align the most similar
segments - Algorithms used may be similar, but output
different, statistics needed to assess results
10Sequence-based Database Searching
- Alignment Scoring
- Substitution score and substitution matrixPAM,
BLOSUM - affine gap costs/gap penalty and gap scores
- Optimal alignments, dynamic programmingNeedleman-
Wunsch algorithm,Smith-Waterman algorithm
(SSEARCH) - Additional heuristics to speed up the search -
FASTA, BLAST
11Some definitions
- Affine gap costs - scoring system for gaps within
alignments which charges a penalty for gap
formation and additional per-residue penalty
proportional to size of gap - Alignment score - numerical value indicating the
overall quality of an alignment, the higher the
better the alignment. - Algorithm - fixed procedure embodied in a
computer program - Heuristics - a computer science term referring to
guesses made by the program to approximate
results, usually based on arbitrary or predefined
rules. - Gapped Alignment - alignment of sequences where
gaps are permitted
12Computational Genefinding
- Major challenge in genome project
- Given a DNA sequence, where does a gene begin and
stop? - ORF - Where are the exons and introns?
- Where are the transcription elements?
- Gene structure and other regulatory elements?
13Genomic Elements
- Intron-exon splice sites
- Start-Stop codons
- Branch Points
- Promoters and terminators of transcription
- Polyadenylation sites
- ribosomal binding sites
- Topoisomerase II binding sites
- Topoisomerase I cleavage sites
- Transcription factor binding sites
14Detecting Genomic Elements
- Local sites and motifs/patterns for such element
- signals and signal sensors - Extended variable-length regions eg exons and
introns- contents and content sensors - Linguistic technique - gene structure described
in formal grammar - GeneLang genefinding program
15Signal sensors
- Simple consensus sequenceUse of Pattern matching
algorithms - Weight matricesallow for weighted score for each
weight matrix sensors to be summed - Use of Artificial Neural Networks (ANN)
16Content Sensors
- Long ORF for bacteria
- Statistical models eg. Markov models -
GeneMarkstatistical models of nucleotide
frequencies and dependencies in codon structure - Neural Nets eg Grailexon detection by neural
network combined with signal sensors for
exon-intron splice sites
17Some Definitions
- Artificial Neural Nets - statistical pattern
recognition method - a type of nonlinear
regression - Markov Models - statistical models for sequences
in which the probability of each residue depends
on the residues preceding it. - Dynamic Programming - type of algorithm widely
used for constructing sequence aligments and for
evaluating all posible candidate gene structure
18Other Genefinding methods
- Use of dynamic programmingLinguistic rules for
functional featuresParameters of a Markov
Process on hidden variables - hidden Markov
Models (HMM) - HMM genefinder - EcoParse, Xpound GeneMark HMM,
Veil, HMMgene, GenScan