Various Career Options Available

About This Presentation

Title:

Various Career Options Available

Description:

first aligning the most similar pair of sequences ... Repeats and calculate significance (t-test) Significance of fold used statistical method ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 62

Provided by: imtec5

Category:

more less

Transcript and Presenter's Notes

Title: Various Career Options Available

1
Role of Computer and Information Science in
Biology Presented By Dr G. P. S.
Raghava Co-ordinator, Bioinformatic Centre,
IMTECH, Chandigarh, India Visiting Professor,
Pohang Univ. of Science Technology, Republic of
Korea Email raghava_at_imtech.res.in Web
http//www.imtech.res.in/raghava/
2
Major Applications Challenges

Introduction to Biology
Genome Annotation Gene Prediction
Analysis and Comparison of Sequences
Protein Structure Prediction
DNA Chip (Microarray) technology
Proteomics Analysis of 2D gel
Fingerprinting Technique
Drug development
Computer-Aided Vaccine Design

3
Hierarchy in Biology Atoms Molecules Macromolecule
s Organelles Cells Tissues Organs Organ
Systems Individual Organisms Populations Communiti
es Ecosystems Biosphere
4
Animal cell
5
Human Chromosomes
6
Genes are linearly arranged along chromosomes
7
Chromosomes and DNA
8
DNA can be simplified to a string of four letters
GATTACA
9
(RT)
10
Sequence to StructureIts a matter of
dimensions!

1D Nucleic acid sequence
AGT-TTC-CCA-GGG
1D Protein sequence
Met-Ala-Gly-Lys-His
M A G K H
3D Spatial arrangement of atoms

11
Genome Annotation

The Process of Adding Biology Information and
Predictions to a Sequenced Genome Framework

12
Importance of Sequence Comparison

Protein Structure Prediction
Similar sequence have similar structure
function
Phylogenetic Tree
Homology based protein structure prediction
Genome Annotation
Homology based gene prediction
Function assignment evolutionary studies
Searching drug targets
Searching sequence present or absent across
genomes

13
Protein Sequence Alignment and Database Searching

Alignment of Two Sequences (Pair-wise Alignment)
The Scoring Schemes or Weight Matrices
Techniques of Alignments
DOTPLOT
Multiple Sequence Alignment (Alignment of gt 2
Sequences)
Extending Dynamic Programming to more sequences
Progressive Alignment (Tree or Hierarchical
Methods)
Iterative Techniques
Stochastic Algorithms (SA, GA, HMM)
Non Stochastic Algorithms
Database Scanning
FASTA, BLAST, PSIBLAST, ISS
Alignment of Whole Genomes
MUMmer (Maximal Unique Match)

14
Alignment of Two Sequences

Dealing Gaps in Pair-wise Alignment
Sequence Comparison without Gaps
Slide Windos method to got maximum score
ALGAWDE
ALATWDE
Total score 11001115 (PID) (5100)/7
Sequence with variable length should use dynamic
programming
Sequence Comparison with Gaps
Insertion and deletion is common
Slide Window method fails
Generate all possible alignment
100 residue alignment require gt 1075

15
Alternate Dot Matrix PlotDiagnoal shows
align/identical regions
16
Dynamic Programming

Dynamic Programming allow Optimal Alignment
between two sequences
Allow Insertion and Deletion or Alignment with
gaps
Needlman and Wunsh Algorithm (1970) for global
alignment
Smith Waterman Algorithm (1981) for local
alignment
Important Steps
Create DOTPLOT between two sequences
Compute SUM matrix
Trace Optimal Path

17
Alignment of Multiple Sequences

Extending Dynamic Programming to more sequences
Dynamic programming can be extended for more than
two
In practice it requires CPU and Memory (Murata et
al 1985)
MSA, Limited only up to 8-10 sequences (1989)
DCA (Divide and Conquer Stoye et al., 1997),
20-25 sequences
OMA (Optimal Multiple Alignment Reinert et al.,
2000)
COSA (Althaus et al., 2002)
Progressive or Tree or Hierarchical Methods
(CLUSTAL-W)
Practical approach for multiple alignment
Compare all sequences pair wise
Perform cluster analysis
Generate a hierarchy for alignment
first aligning the most similar pair of sequences
Align alignment with next similar alignment or
sequence

18
(No Transcript)
19
Database scanning

Basic principles of Database searching
Search query sequence against all sequence in
database
Calculate score and select top sequences
Dynamic programming is best
Approximation Algorithms
FASTA
Fast sequence search
Based on dotplot
Identify identical words (k-tuples)
Search significant diagonals
Use PAM 250 for further refinement
Dynamic programming for narrow region

20
Principles of FASTA Algorithms
21
(No Transcript)
22
Database Scanning or Fold Recognition

Concept of PSIBLAST
Perform the BLAST search (gap handling)
GeneImprove the sensivity of BLAST
rate the position-specific score matrix
Use PSSM for next round of search
Intermediate Sequence Search
Search query against protein database
Generate multiple alignment or profile
Use profile to search against PDB

23
Comparison of Whole Genomes

MUMmer (Salzberg group, 1999, 2002)
Pair-wise sequence alignment of genomes
Assume that sequences are closely related
Allow to detect repeats, inverse repeats, SNP
Domain inserted/deleted
Identify the exact matches
How it works
Identify the maximal unique match (MUM) in two
genomes
As two genome are similar so larger MUM will be
there
Sort the matches found in MUM and extract longest
set of possible matches that occurs in same order
(Ordered MUM)
Suffix tree was used to identify MUM
Close the gaps by SNPs, large inserts
Align region between MUMs by Smith-Waterman

24
Protein Structure Prediction

Experimental Techniques
X-ray Crystallography
NMR
Limitations of Current Experimental Techniques
Protein DataBank (PDB) -gt 24000 protein
structures
SwissProt -gt 100,000 proteins
Non-Redudant (NR) -gt 1,000,000 proteins
Importance of Structure Prediction
Fill gap between known sequence and structures
Protein Engg. To alter function of a protein
Rational Drug Design

25
Protein Structures

26
Techniques of Structure Prediction

Computer simulation based on energy calculation
Based on physio-chemical principles
Thermodynamic equilibrium with a minimum free
energy
Global minimum free energy of protein surface
Knowledge Based approaches
Homology Based Approach
Threading Protein Sequence
Hierarchical Methods

27
Energy Minimization Techniques

Energy Minimization based methods in their pure
form, make no priori assumptions and attempt to
locate global minma.
Static Minimization Methods
Classical many potential-potential can be
construted
Assume that atoms in protein is in static form
Problems(large number of variables minima and
validity of potentials)
Dynamical Minimization Methods
Motions of atoms also considered
Monte Carlo simulation (stochastics in nature,
time is not cosider)
Molecular Dynamics (time, quantum mechanical,
classical equ.)
Limitations
large number of degree of freedom,CPU power not
adequate
Interaction potential is not good enough to model

28
Knowledge Based Approaches

Homology Modelling
Need homologues of known protein structure
Backbone modelling
Side chain modelling
Fail in absence of homology
Threading Based Methods
New way of fold recognition
Sequence is tried to fit in known structures
Motif recognition
Loop Side chain modelling
Fail in absence of known example

29
Hierarcial Methods

Intermidiate structures are predicted, instead of
predicting tertiary structure of protein from
amino acids sequence
Prediction of backbone structure
Secondary structure (helix, sheet,coil)
Beta Turn Prediction
Super-secondary structure
Tertiary structure prediction
Limitation
Accuracy is only 75-80
Only three state prediction

30
excitation
scanning
cDNA clones (probes)
laser 1
laser 2
PCR product amplification purification
emission
printing
mRNA target)
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
31

Major Applications
Identification of differentially expressed genes
in diseased tissues (in presence of drug)
Classification of differentially expressed
(genes) or clustering/ grouping of genes having
similar behaviour in different conditions
Use expression profile of known disease to
diagnosis and classify of unknown genes

32
Terms/Jargons

Stanford/cDNA chip
one slide/experiment
one spot
1 gene gt one spot or few spots(replica)
control control spots
control two fluorescent dyes (Cy3/Cy5)

Affymetrix/oligo chip
one chip/experiment
one probe/feature/cell
1 gene gt many probes (2025 mers)
control match and mismatch cells.

33
Images examples
Spot colour Signal strength Gene expression
yellow Control perturbed unchanged
red Control lt perturbed induced
green Control gt perturbed repressed
34
Processing of images

Addressing or gridding
Assigning coordinates to each of the spots
Segmentation
Classification of pixels either as foreground or
as background
Intensity determination for each spot
Foreground fluorescence intensity pairs (R, G)
Background intensities
Quality measures

35
Management of Microarray Data

Magnitude of Data
Experiments
50 000 genes in human
320 cell types
2000 compunds
3 times points
2 concentrations
2 replicates
Data Volume
41011 data-points
1015 1 petaB of Data

36
Management of Microarray Data

Major Issues
Large volume of microarray data in last few years
Storage and efficient access
Comparison and integration of data
Problem of data access and exchange
Data scattered around Internet
Supplementary material of publications
Difficult for user to access relivent data
Problems with existing databases
Diverse purpose
Developed for specific purpose

37
Management of Microarray Data

Specific Database
Platform (eg.Stanford MA Database SMD)
Organism (Yeast MA global viewer)
Project (Life cycle database of Drosophila)
Problem with Supplement and MA databases
Lack of direct access
Quality not checked
No standard format
Incomplete data

38
Pre-processed cDNA Gene Expression Data

On p genes for n slides p is O(10,000), n is
O(10-100), but growing,

Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4

Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
39
Analysis of Microarray Data

Analysis of images
Preprocessing of gene expression data
Normalization of data
Subtraction of Background Noise
Global/local Normalization
House keeping genes (or same gene)
Expression in ratio (test/references) in log
Differential Gene expression
Repeats and calculate significance (t-test)
Significance of fold used statistical method
Clustering
Supervised/Unsupervised (Hierarchical, K-means,
SOM)
Prediction or Supervised Machine Learnning (SVM)

40
Normalization Techniques

Global normalization
Divide channel value by means
Control spots
Common spots in both channels
House keeping genes
Ratio of intensity of same gene in two channel is
used for correction
Iterative linear regression
Parametric nonlinear nomalization
log(CY3/CY5) vs log(CY5))
Fitted log ratio observed log ratio
General Non Linear Normalization
LOESS
curve between log(R/G) vs log(sqrt(R.G))

41
Classification

Task assign objects to classes (groups) on the
basis of measurements made on the objects
Unsupervised classes unknown, want to discover
them from the data (cluster analysis)
Supervised classes are predefined, want to use
a (training or learning) set of labeled objects
to form a classifier for classification of future
observations

42
Issues in Clustering

Pre-processing (Image analysis and Normalization)
Which genes (variables) are used
Which samples are used
Which distance measure is used
Which algorithm is applied
How to decide the number of clusters K

43
Unsupervised Learnning

Hierarchical clustering merging two branches at
the time until all vari-ables
(genes) are in one tree. it does not answer the
question of how
many gene clusters there are?
K-mean clustering assuming there are K clusters.
what if this assump-tion
is incorrect?
Model-based clustering the number of clusters is
determined dynami-cally
could be one of the most promising methods

44
Supervised Analysis

Fishers linear discriminant analysis
Quadratic discriminant analysis
Logistic regression (a linear discriminant
analysis)
Neural networks
Support vector machine

45
Traditional Proteomics

1D gel electrophoresis (SDS-PAGE)
2D gel electrophoresis
Protein Chips
Chips coated with proteins/Antibodies
large scale version of ELISA
Mass Spectrometry
MALDI Mass fingerprinting
Electrospray and tandem mass spectrometry
Sequencing of Peptides (N-gtC)
Matching in Genome/Proteome Databases

46
Overview of 2D Gel

SDS-PAGE Isoelectric focusing (IEF)
Gene Expression Studies
Medical Applications
Sample Experiments
Capturing and Analyzing Data
Image Acquistion
Image Sizing Orientation
Spot Identification
Matching and Analysis

47
Comparision/Matcing of Gel Images

Compare 2 gel images
Set X and y axis
Overlap matching spots
Compare intensity of spots
Scan against database
Compare query gel with all gels
Calculate similarity score
Sort based on score

48
Differential Proteomics Fingerprints of Disease
Phenotypic Changes

Differential protein expression
Protein nitration patterns
Altered phosporylation
Altered glycosylation profiles

Utility
Target discovery
Disease pathways
Disease biomarkers

49
Fingerprinting Technique

What is fingerprinting
It is technique to create specific pattern for a
given organism/person
To compare pattern of query and target object
To create Phylogenetic tree/classification based
on pattern
Type of Fingerprinting
DNA Fingerprinting
Mass/peptide fingerprinting
Properties based (Toxicity, classification)
Domain/conserved pattern fingerprinting
Common Applications
Paternity and Maternity
Criminal Identification and Forensics
Personal Identification
Classification/Identification of organisms
Classification of cells

50
Fingerprinting Techniques Principles
Applications

What is fingerprinting
Type of Fingerprinting
Common Applications
Role of Computer in DNA Fingerprinting
Searching Restriction Enzymes
Searching VNTRs
Computation of size of DNA fragments
Optimization of gels
Comparison of patterns
Creation of Phylogenetic tree

51
Drug Design

History of Drug/Vaccine development
Plants or Natural Product
Plant and Natural products were source for
medical substance
Example foxglove used to treat congestive heart
failure
Foxglove contain digitalis and cardiotonic
glycoside
Identification of active component
Accidental Observations
Penicillin is one good example
Alexander Fleming observed the effect of mold
Mold(Penicillium) produce substance penicillin
Discovery of penicillin lead to large scale
screening
Soil micoorganism were grown and tested
Streptomycin, neomycin, gentamicin, tetracyclines
etc.

52
Drug Design

Chemical Modification of Known Drugs
Drug improvement by chemical modification
Pencillin G -gt Methicillin morphine-gtnalorphine
Receptor Based drug design
Receptor is the target (usually a protein)
Drug molecule binds to cause biological effects
It is also called lock and key system
Structure determination of receptor is important
Ligand-based drug design
Search a lead ocompound or active ligand
Structure of ligand guide the drug design process

53
Drug Design based on Bioinformatics Tools

Detect the Molecular Bases for Disease
Detection of drug binding site
Tailor drug to bind at that site
Protein modeling techniques
Traditional Method (brute force testing)
Rational drug design techniques
Screen likely compounds built
Modeling large number of compounds (automated)
Application of Artificial intelligence
Limitation of known structures

54
Important Points in Drug Design based on
Bioinformatics Tools

Application of Genome
3 billion bases pair
30,000 unique genes
Any gene may be a potential drug target
500 unique target
Their may be 10 to 100 variants at each target
gene
1.4 million SNP
10200 potential small molecules

55
Concept of Drug and Vaccine

Concept of Drug
Kill invaders of foreign pathogens
Inhibit the growth of pathogens
Concept of Vaccine
Generate memory cells
Trained immune system to face various existing
disease agents

56
VACCINES

A. SUCCESS STORY
COMPLETE ERADICATION OF SMALLPOX
WHO PREDICTION ERADICATION OF PARALYTIC
POLIO THROUGHOUT THE WORLD BY YEAR 2003
SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES
DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA,
POLIOMYELITIS, TETANUS
B.NEED OF AN HOUR
1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES
FOR
DISEASES LIKE
MALARIA, TUBERCULOSIS AND AIDS
2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT
VACCINES
3) LOW COST
4) EFFICIENT DELIVERY TO NEEDY
5) REDUCTION OF ADVERSE SIDE EFFECTS

57
Computer Aided Vaccine Design

Whole Organism of Pathogen
Consists more than 4000 genes and proteins
Genomes have millions base pair
Target antigen to recognise pathogen
Search vaccine target (essential and non-self)
Consists of amino acid sequence (e.g.
A-V-L-G-Y-R-G-C-T )
Search antigenic region (peptide of length 9
amino acids)

58
Major steps of endogenous antigen processing
59
Computer Aided Vaccine Design

Problem of Pattern Recognition
ATGGTRDAR Epitope
LMRGTCAAY Non-epitope
RTTGTRAWR Epitope
EMGGTCAAY Non-epitope
ATGGTRKAR Epitope
GTCVGYATT Epitope
Commonly used techniques
Statistical (Motif and Matrix)
AI Techniques

60
Why computational tools are required for
prediction.
200 aa proteins
Chopped to overlapping peptides of 9 amino acids
Bioinformatics Tools
192 peptides
10-20 predicted peptides
invitro or invivo experiments for detecting which
snippets of protein will spark an immune response.
61
Thanks

Write a Comment

User Comments (0)

About PowerShow.com

Various Career Options Available - PowerPoint PPT Presentation

Various Career Options Available

first aligning the most similar pair of sequences ... Repeats and calculate significance (t-test) Significance of fold used statistical method ... – PowerPoint PPT presentation