Title: Various Career Options Available
1 Role of Computer and Information Science in
Biology Presented By Dr G. P. S.
Raghava Co-ordinator, Bioinformatic Centre,
IMTECH, Chandigarh, India Visiting Professor,
Pohang Univ. of Science Technology, Republic of
Korea Email raghava_at_imtech.res.in Web
http//www.imtech.res.in/raghava/
2Major Applications Challenges
- Introduction to Biology
- Genome Annotation Gene Prediction
- Analysis and Comparison of Sequences
- Protein Structure Prediction
- DNA Chip (Microarray) technology
- Proteomics Analysis of 2D gel
- Fingerprinting Technique
- Drug development
- Computer-Aided Vaccine Design
3Hierarchy in Biology Atoms Molecules Macromolecule
s Organelles Cells Tissues Organs Organ
Systems Individual Organisms Populations Communiti
es Ecosystems Biosphere
4Animal cell
5Human Chromosomes
6Genes are linearly arranged along chromosomes
7Chromosomes and DNA
8DNA can be simplified to a string of four letters
GATTACA
9(RT)
10Sequence to StructureIts a matter of
dimensions!
- 1D Nucleic acid sequence
- AGT-TTC-CCA-GGG
- 1D Protein sequence
- Met-Ala-Gly-Lys-His
- M A G K H
- 3D Spatial arrangement of atoms
11Genome Annotation
- The Process of Adding Biology Information and
- Predictions to a Sequenced Genome Framework
12Importance of Sequence Comparison
- Protein Structure Prediction
- Similar sequence have similar structure
function - Phylogenetic Tree
- Homology based protein structure prediction
- Genome Annotation
- Homology based gene prediction
- Function assignment evolutionary studies
- Searching drug targets
- Searching sequence present or absent across
genomes
13Protein Sequence Alignment and Database Searching
- Alignment of Two Sequences (Pair-wise Alignment)
- The Scoring Schemes or Weight Matrices
- Techniques of Alignments
- DOTPLOT
- Multiple Sequence Alignment (Alignment of gt 2
Sequences) - Extending Dynamic Programming to more sequences
- Progressive Alignment (Tree or Hierarchical
Methods) - Iterative Techniques
- Stochastic Algorithms (SA, GA, HMM)
- Non Stochastic Algorithms
- Database Scanning
- FASTA, BLAST, PSIBLAST, ISS
- Alignment of Whole Genomes
- MUMmer (Maximal Unique Match)
14Alignment of Two Sequences
- Dealing Gaps in Pair-wise Alignment
- Sequence Comparison without Gaps
- Slide Windos method to got maximum score
- ALGAWDE
- ALATWDE
- Total score 11001115 (PID) (5100)/7
- Sequence with variable length should use dynamic
programming - Sequence Comparison with Gaps
- Insertion and deletion is common
- Slide Window method fails
- Generate all possible alignment
- 100 residue alignment require gt 1075
15Alternate Dot Matrix PlotDiagnoal shows
align/identical regions
16Dynamic Programming
- Dynamic Programming allow Optimal Alignment
between two sequences - Allow Insertion and Deletion or Alignment with
gaps - Needlman and Wunsh Algorithm (1970) for global
alignment - Smith Waterman Algorithm (1981) for local
alignment - Important Steps
- Create DOTPLOT between two sequences
- Compute SUM matrix
- Trace Optimal Path
17Alignment of Multiple Sequences
- Extending Dynamic Programming to more sequences
- Dynamic programming can be extended for more than
two - In practice it requires CPU and Memory (Murata et
al 1985) - MSA, Limited only up to 8-10 sequences (1989)
- DCA (Divide and Conquer Stoye et al., 1997),
20-25 sequences - OMA (Optimal Multiple Alignment Reinert et al.,
2000) - COSA (Althaus et al., 2002)
- Progressive or Tree or Hierarchical Methods
(CLUSTAL-W) - Practical approach for multiple alignment
- Compare all sequences pair wise
- Perform cluster analysis
- Generate a hierarchy for alignment
- first aligning the most similar pair of sequences
- Align alignment with next similar alignment or
sequence
18(No Transcript)
19 Database scanning
- Basic principles of Database searching
- Search query sequence against all sequence in
database - Calculate score and select top sequences
- Dynamic programming is best
- Approximation Algorithms
- FASTA
- Fast sequence search
- Based on dotplot
- Identify identical words (k-tuples)
- Search significant diagonals
- Use PAM 250 for further refinement
- Dynamic programming for narrow region
20Principles of FASTA Algorithms
21(No Transcript)
22Database Scanning or Fold Recognition
- Concept of PSIBLAST
- Perform the BLAST search (gap handling)
- GeneImprove the sensivity of BLAST
- rate the position-specific score matrix
- Use PSSM for next round of search
- Intermediate Sequence Search
- Search query against protein database
- Generate multiple alignment or profile
- Use profile to search against PDB
23Comparison of Whole Genomes
- MUMmer (Salzberg group, 1999, 2002)
- Pair-wise sequence alignment of genomes
- Assume that sequences are closely related
- Allow to detect repeats, inverse repeats, SNP
- Domain inserted/deleted
- Identify the exact matches
- How it works
- Identify the maximal unique match (MUM) in two
genomes - As two genome are similar so larger MUM will be
there - Sort the matches found in MUM and extract longest
set of possible matches that occurs in same order
(Ordered MUM) - Suffix tree was used to identify MUM
- Close the gaps by SNPs, large inserts
- Align region between MUMs by Smith-Waterman
24Protein Structure Prediction
- Experimental Techniques
- X-ray Crystallography
- NMR
- Limitations of Current Experimental Techniques
- Protein DataBank (PDB) -gt 24000 protein
structures - SwissProt -gt 100,000 proteins
- Non-Redudant (NR) -gt 1,000,000 proteins
- Importance of Structure Prediction
- Fill gap between known sequence and structures
- Protein Engg. To alter function of a protein
- Rational Drug Design
25Protein Structures
26Techniques of Structure Prediction
- Computer simulation based on energy calculation
- Based on physio-chemical principles
- Thermodynamic equilibrium with a minimum free
energy - Global minimum free energy of protein surface
- Knowledge Based approaches
- Homology Based Approach
- Threading Protein Sequence
- Hierarchical Methods
27Energy Minimization Techniques
- Energy Minimization based methods in their pure
form, make no priori assumptions and attempt to
locate global minma. - Static Minimization Methods
- Classical many potential-potential can be
construted - Assume that atoms in protein is in static form
- Problems(large number of variables minima and
validity of potentials) - Dynamical Minimization Methods
- Motions of atoms also considered
- Monte Carlo simulation (stochastics in nature,
time is not cosider) - Molecular Dynamics (time, quantum mechanical,
classical equ.) - Limitations
- large number of degree of freedom,CPU power not
adequate - Interaction potential is not good enough to model
28Knowledge Based Approaches
- Homology Modelling
- Need homologues of known protein structure
- Backbone modelling
- Side chain modelling
- Fail in absence of homology
- Threading Based Methods
- New way of fold recognition
- Sequence is tried to fit in known structures
- Motif recognition
- Loop Side chain modelling
- Fail in absence of known example
29Hierarcial Methods
- Intermidiate structures are predicted, instead of
predicting tertiary structure of protein from
amino acids sequence - Prediction of backbone structure
- Secondary structure (helix, sheet,coil)
- Beta Turn Prediction
- Super-secondary structure
- Tertiary structure prediction
- Limitation
- Accuracy is only 75-80
- Only three state prediction
30excitation
scanning
cDNA clones (probes)
laser 1
laser 2
PCR product amplification purification
emission
printing
mRNA target)
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
31- Major Applications
- Identification of differentially expressed genes
in diseased tissues (in presence of drug) - Classification of differentially expressed
(genes) or clustering/ grouping of genes having
similar behaviour in different conditions - Use expression profile of known disease to
diagnosis and classify of unknown genes
32Terms/Jargons
- Stanford/cDNA chip
- one slide/experiment
- one spot
- 1 gene gt one spot or few spots(replica)
- control control spots
- control two fluorescent dyes (Cy3/Cy5)
- Affymetrix/oligo chip
- one chip/experiment
- one probe/feature/cell
- 1 gene gt many probes (2025 mers)
- control match and mismatch cells.
33Images examples
Spot colour Signal strength Gene expression
yellow Control perturbed unchanged
red Control lt perturbed induced
green Control gt perturbed repressed
34Processing of images
- Addressing or gridding
- Assigning coordinates to each of the spots
- Segmentation
- Classification of pixels either as foreground or
as background - Intensity determination for each spot
- Foreground fluorescence intensity pairs (R, G)
- Background intensities
- Quality measures
35Management of Microarray Data
- Magnitude of Data
- Experiments
- 50 000 genes in human
- 320 cell types
- 2000 compunds
- 3 times points
- 2 concentrations
- 2 replicates
- Data Volume
- 41011 data-points
- 1015 1 petaB of Data
36Management of Microarray Data
- Major Issues
- Large volume of microarray data in last few years
- Storage and efficient access
- Comparison and integration of data
- Problem of data access and exchange
- Data scattered around Internet
- Supplementary material of publications
- Difficult for user to access relivent data
- Problems with existing databases
- Diverse purpose
- Developed for specific purpose
37Management of Microarray Data
- Specific Database
- Platform (eg.Stanford MA Database SMD)
- Organism (Yeast MA global viewer)
- Project (Life cycle database of Drosophila)
- Problem with Supplement and MA databases
- Lack of direct access
- Quality not checked
- No standard format
- Incomplete data
38Pre-processed cDNA Gene Expression Data
- On p genes for n slides p is O(10,000), n is
O(10-100), but growing,
Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4
Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
39Analysis of Microarray Data
- Analysis of images
- Preprocessing of gene expression data
- Normalization of data
- Subtraction of Background Noise
- Global/local Normalization
- House keeping genes (or same gene)
- Expression in ratio (test/references) in log
- Differential Gene expression
- Repeats and calculate significance (t-test)
- Significance of fold used statistical method
- Clustering
- Supervised/Unsupervised (Hierarchical, K-means,
SOM) - Prediction or Supervised Machine Learnning (SVM)
40Normalization Techniques
- Global normalization
- Divide channel value by means
- Control spots
- Common spots in both channels
- House keeping genes
- Ratio of intensity of same gene in two channel is
used for correction - Iterative linear regression
- Parametric nonlinear nomalization
- log(CY3/CY5) vs log(CY5))
- Fitted log ratio observed log ratio
- General Non Linear Normalization
- LOESS
- curve between log(R/G) vs log(sqrt(R.G))
41Classification
- Task assign objects to classes (groups) on the
basis of measurements made on the objects - Unsupervised classes unknown, want to discover
them from the data (cluster analysis) - Supervised classes are predefined, want to use
a (training or learning) set of labeled objects
to form a classifier for classification of future
observations
42Issues in Clustering
- Pre-processing (Image analysis and Normalization)
- Which genes (variables) are used
- Which samples are used
- Which distance measure is used
- Which algorithm is applied
- How to decide the number of clusters K
43Unsupervised Learnning
- Hierarchical clustering merging two branches at
the time until all vari-ables - (genes) are in one tree. it does not answer the
question of how - many gene clusters there are?
- K-mean clustering assuming there are K clusters.
what if this assump-tion - is incorrect?
- Model-based clustering the number of clusters is
determined dynami-cally - could be one of the most promising methods
44Supervised Analysis
- Fishers linear discriminant analysis
- Quadratic discriminant analysis
- Logistic regression (a linear discriminant
analysis) - Neural networks
- Support vector machine
45Traditional Proteomics
- 1D gel electrophoresis (SDS-PAGE)
- 2D gel electrophoresis
- Protein Chips
- Chips coated with proteins/Antibodies
- large scale version of ELISA
- Mass Spectrometry
- MALDI Mass fingerprinting
- Electrospray and tandem mass spectrometry
- Sequencing of Peptides (N-gtC)
- Matching in Genome/Proteome Databases
46Overview of 2D Gel
- SDS-PAGE Isoelectric focusing (IEF)
- Gene Expression Studies
- Medical Applications
- Sample Experiments
- Capturing and Analyzing Data
- Image Acquistion
- Image Sizing Orientation
- Spot Identification
- Matching and Analysis
47Comparision/Matcing of Gel Images
- Compare 2 gel images
- Set X and y axis
- Overlap matching spots
- Compare intensity of spots
- Scan against database
- Compare query gel with all gels
- Calculate similarity score
- Sort based on score
48Differential Proteomics Fingerprints of Disease
Phenotypic Changes
- Differential protein expression
- Protein nitration patterns
- Altered phosporylation
- Altered glycosylation profiles
- Utility
- Target discovery
- Disease pathways
- Disease biomarkers
49Fingerprinting Technique
- What is fingerprinting
- It is technique to create specific pattern for a
given organism/person - To compare pattern of query and target object
- To create Phylogenetic tree/classification based
on pattern - Type of Fingerprinting
- DNA Fingerprinting
- Mass/peptide fingerprinting
- Properties based (Toxicity, classification)
- Domain/conserved pattern fingerprinting
- Common Applications
- Paternity and Maternity
- Criminal Identification and Forensics
- Personal Identification
- Classification/Identification of organisms
- Classification of cells
50Fingerprinting Techniques Principles
Applications
- What is fingerprinting
- Type of Fingerprinting
- Common Applications
- Role of Computer in DNA Fingerprinting
- Searching Restriction Enzymes
- Searching VNTRs
- Computation of size of DNA fragments
- Optimization of gels
- Comparison of patterns
- Creation of Phylogenetic tree
51Drug Design
- History of Drug/Vaccine development
- Plants or Natural Product
- Plant and Natural products were source for
medical substance - Example foxglove used to treat congestive heart
failure - Foxglove contain digitalis and cardiotonic
glycoside - Identification of active component
- Accidental Observations
- Penicillin is one good example
- Alexander Fleming observed the effect of mold
- Mold(Penicillium) produce substance penicillin
- Discovery of penicillin lead to large scale
screening - Soil micoorganism were grown and tested
- Streptomycin, neomycin, gentamicin, tetracyclines
etc.
52Drug Design
- Chemical Modification of Known Drugs
- Drug improvement by chemical modification
- Pencillin G -gt Methicillin morphine-gtnalorphine
- Receptor Based drug design
- Receptor is the target (usually a protein)
- Drug molecule binds to cause biological effects
- It is also called lock and key system
- Structure determination of receptor is important
- Ligand-based drug design
- Search a lead ocompound or active ligand
- Structure of ligand guide the drug design process
53Drug Design based on Bioinformatics Tools
- Detect the Molecular Bases for Disease
- Detection of drug binding site
- Tailor drug to bind at that site
- Protein modeling techniques
- Traditional Method (brute force testing)
- Rational drug design techniques
- Screen likely compounds built
- Modeling large number of compounds (automated)
- Application of Artificial intelligence
- Limitation of known structures
54Important Points in Drug Design based on
Bioinformatics Tools
- Application of Genome
- 3 billion bases pair
- 30,000 unique genes
- Any gene may be a potential drug target
- 500 unique target
- Their may be 10 to 100 variants at each target
gene - 1.4 million SNP
- 10200 potential small molecules
55Concept of Drug and Vaccine
- Concept of Drug
- Kill invaders of foreign pathogens
- Inhibit the growth of pathogens
- Concept of Vaccine
- Generate memory cells
- Trained immune system to face various existing
disease agents
56VACCINES
- A. SUCCESS STORY
- COMPLETE ERADICATION OF SMALLPOX
- WHO PREDICTION ERADICATION OF PARALYTIC
- POLIO THROUGHOUT THE WORLD BY YEAR 2003
- SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES
- DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA,
- POLIOMYELITIS, TETANUS
- B.NEED OF AN HOUR
- 1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES
FOR - DISEASES LIKE
- MALARIA, TUBERCULOSIS AND AIDS
- 2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT
- VACCINES
- 3) LOW COST
- 4) EFFICIENT DELIVERY TO NEEDY
- 5) REDUCTION OF ADVERSE SIDE EFFECTS
57Computer Aided Vaccine Design
- Whole Organism of Pathogen
- Consists more than 4000 genes and proteins
- Genomes have millions base pair
- Target antigen to recognise pathogen
- Search vaccine target (essential and non-self)
- Consists of amino acid sequence (e.g.
A-V-L-G-Y-R-G-C-T ) - Search antigenic region (peptide of length 9
amino acids)
58Major steps of endogenous antigen processing
59Computer Aided Vaccine Design
- Problem of Pattern Recognition
- ATGGTRDAR Epitope
- LMRGTCAAY Non-epitope
- RTTGTRAWR Epitope
- EMGGTCAAY Non-epitope
- ATGGTRKAR Epitope
- GTCVGYATT Epitope
- Commonly used techniques
- Statistical (Motif and Matrix)
- AI Techniques
60Why computational tools are required for
prediction.
200 aa proteins
Chopped to overlapping peptides of 9 amino acids
Bioinformatics Tools
192 peptides
10-20 predicted peptides
invitro or invivo experiments for detecting which
snippets of protein will spark an immune response.
61Thanks