Various Career Options Available - PowerPoint PPT Presentation

About This Presentation
Title:

Various Career Options Available

Description:

first aligning the most similar pair of sequences ... Repeats and calculate significance (t-test) Significance of fold used statistical method ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 62
Provided by: imtec5
Category:

less

Transcript and Presenter's Notes

Title: Various Career Options Available


1
Role of Computer and Information Science in
Biology Presented By Dr G. P. S.
Raghava Co-ordinator, Bioinformatic Centre,
IMTECH, Chandigarh, India Visiting Professor,
Pohang Univ. of Science Technology, Republic of
Korea Email raghava_at_imtech.res.in Web
http//www.imtech.res.in/raghava/
2
Major Applications Challenges
  • Introduction to Biology
  • Genome Annotation Gene Prediction
  • Analysis and Comparison of Sequences
  • Protein Structure Prediction
  • DNA Chip (Microarray) technology
  • Proteomics Analysis of 2D gel
  • Fingerprinting Technique
  • Drug development
  • Computer-Aided Vaccine Design

3
Hierarchy in Biology Atoms Molecules Macromolecule
s Organelles Cells Tissues Organs Organ
Systems Individual Organisms Populations Communiti
es Ecosystems Biosphere
4
Animal cell
5
Human Chromosomes
6
Genes are linearly arranged along chromosomes
7
Chromosomes and DNA
8
DNA can be simplified to a string of four letters
GATTACA
9
(RT)
10
Sequence to StructureIts a matter of
dimensions!
  • 1D Nucleic acid sequence
  • AGT-TTC-CCA-GGG
  • 1D Protein sequence
  • Met-Ala-Gly-Lys-His
  • M A G K H
  • 3D Spatial arrangement of atoms

11
Genome Annotation
  • The Process of Adding Biology Information and
  • Predictions to a Sequenced Genome Framework

12
Importance of Sequence Comparison
  • Protein Structure Prediction
  • Similar sequence have similar structure
    function
  • Phylogenetic Tree
  • Homology based protein structure prediction
  • Genome Annotation
  • Homology based gene prediction
  • Function assignment evolutionary studies
  • Searching drug targets
  • Searching sequence present or absent across
    genomes

13
Protein Sequence Alignment and Database Searching
  • Alignment of Two Sequences (Pair-wise Alignment)
  • The Scoring Schemes or Weight Matrices
  • Techniques of Alignments
  • DOTPLOT
  • Multiple Sequence Alignment (Alignment of gt 2
    Sequences)
  • Extending Dynamic Programming to more sequences
  • Progressive Alignment (Tree or Hierarchical
    Methods)
  • Iterative Techniques
  • Stochastic Algorithms (SA, GA, HMM)
  • Non Stochastic Algorithms
  • Database Scanning
  • FASTA, BLAST, PSIBLAST, ISS
  • Alignment of Whole Genomes
  • MUMmer (Maximal Unique Match)

14
Alignment of Two Sequences
  • Dealing Gaps in Pair-wise Alignment
  • Sequence Comparison without Gaps
  • Slide Windos method to got maximum score
  • ALGAWDE
  • ALATWDE
  • Total score 11001115 (PID) (5100)/7
  • Sequence with variable length should use dynamic
    programming
  • Sequence Comparison with Gaps
  • Insertion and deletion is common
  • Slide Window method fails
  • Generate all possible alignment
  • 100 residue alignment require gt 1075

15
Alternate Dot Matrix PlotDiagnoal shows
align/identical regions
16
Dynamic Programming
  • Dynamic Programming allow Optimal Alignment
    between two sequences
  • Allow Insertion and Deletion or Alignment with
    gaps
  • Needlman and Wunsh Algorithm (1970) for global
    alignment
  • Smith Waterman Algorithm (1981) for local
    alignment
  • Important Steps
  • Create DOTPLOT between two sequences
  • Compute SUM matrix
  • Trace Optimal Path

17
Alignment of Multiple Sequences
  • Extending Dynamic Programming to more sequences
  • Dynamic programming can be extended for more than
    two
  • In practice it requires CPU and Memory (Murata et
    al 1985)
  • MSA, Limited only up to 8-10 sequences (1989)
  • DCA (Divide and Conquer Stoye et al., 1997),
    20-25 sequences
  • OMA (Optimal Multiple Alignment Reinert et al.,
    2000)
  • COSA (Althaus et al., 2002)
  • Progressive or Tree or Hierarchical Methods
    (CLUSTAL-W)
  • Practical approach for multiple alignment
  • Compare all sequences pair wise
  • Perform cluster analysis
  • Generate a hierarchy for alignment
  • first aligning the most similar pair of sequences
  • Align alignment with next similar alignment or
    sequence

18
(No Transcript)
19
Database scanning
  • Basic principles of Database searching
  • Search query sequence against all sequence in
    database
  • Calculate score and select top sequences
  • Dynamic programming is best
  • Approximation Algorithms
  • FASTA
  • Fast sequence search
  • Based on dotplot
  • Identify identical words (k-tuples)
  • Search significant diagonals
  • Use PAM 250 for further refinement
  • Dynamic programming for narrow region

20
Principles of FASTA Algorithms
21
(No Transcript)
22
Database Scanning or Fold Recognition
  • Concept of PSIBLAST
  • Perform the BLAST search (gap handling)
  • GeneImprove the sensivity of BLAST
  • rate the position-specific score matrix
  • Use PSSM for next round of search
  • Intermediate Sequence Search
  • Search query against protein database
  • Generate multiple alignment or profile
  • Use profile to search against PDB

23
Comparison of Whole Genomes
  • MUMmer (Salzberg group, 1999, 2002)
  • Pair-wise sequence alignment of genomes
  • Assume that sequences are closely related
  • Allow to detect repeats, inverse repeats, SNP
  • Domain inserted/deleted
  • Identify the exact matches
  • How it works
  • Identify the maximal unique match (MUM) in two
    genomes
  • As two genome are similar so larger MUM will be
    there
  • Sort the matches found in MUM and extract longest
    set of possible matches that occurs in same order
    (Ordered MUM)
  • Suffix tree was used to identify MUM
  • Close the gaps by SNPs, large inserts
  • Align region between MUMs by Smith-Waterman

24
Protein Structure Prediction
  • Experimental Techniques
  • X-ray Crystallography
  • NMR
  • Limitations of Current Experimental Techniques
  • Protein DataBank (PDB) -gt 24000 protein
    structures
  • SwissProt -gt 100,000 proteins
  • Non-Redudant (NR) -gt 1,000,000 proteins
  • Importance of Structure Prediction
  • Fill gap between known sequence and structures
  • Protein Engg. To alter function of a protein
  • Rational Drug Design

25
Protein Structures

26
Techniques of Structure Prediction
  • Computer simulation based on energy calculation
  • Based on physio-chemical principles
  • Thermodynamic equilibrium with a minimum free
    energy
  • Global minimum free energy of protein surface
  • Knowledge Based approaches
  • Homology Based Approach
  • Threading Protein Sequence
  • Hierarchical Methods

27
Energy Minimization Techniques
  • Energy Minimization based methods in their pure
    form, make no priori assumptions and attempt to
    locate global minma.
  • Static Minimization Methods
  • Classical many potential-potential can be
    construted
  • Assume that atoms in protein is in static form
  • Problems(large number of variables minima and
    validity of potentials)
  • Dynamical Minimization Methods
  • Motions of atoms also considered
  • Monte Carlo simulation (stochastics in nature,
    time is not cosider)
  • Molecular Dynamics (time, quantum mechanical,
    classical equ.)
  • Limitations
  • large number of degree of freedom,CPU power not
    adequate
  • Interaction potential is not good enough to model

28
Knowledge Based Approaches
  • Homology Modelling
  • Need homologues of known protein structure
  • Backbone modelling
  • Side chain modelling
  • Fail in absence of homology
  • Threading Based Methods
  • New way of fold recognition
  • Sequence is tried to fit in known structures
  • Motif recognition
  • Loop Side chain modelling
  • Fail in absence of known example

29
Hierarcial Methods
  • Intermidiate structures are predicted, instead of
    predicting tertiary structure of protein from
    amino acids sequence
  • Prediction of backbone structure
  • Secondary structure (helix, sheet,coil)
  • Beta Turn Prediction
  • Super-secondary structure
  • Tertiary structure prediction
  • Limitation
  • Accuracy is only 75-80
  • Only three state prediction

30
excitation
scanning
cDNA clones (probes)
laser 1
laser 2
PCR product amplification purification
emission
printing
mRNA target)
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
31
  • Major Applications
  • Identification of differentially expressed genes
    in diseased tissues (in presence of drug)
  • Classification of differentially expressed
    (genes) or clustering/ grouping of genes having
    similar behaviour in different conditions
  • Use expression profile of known disease to
    diagnosis and classify of unknown genes

32
Terms/Jargons
  • Stanford/cDNA chip
  • one slide/experiment
  • one spot
  • 1 gene gt one spot or few spots(replica)
  • control control spots
  • control two fluorescent dyes (Cy3/Cy5)
  • Affymetrix/oligo chip
  • one chip/experiment
  • one probe/feature/cell
  • 1 gene gt many probes (2025 mers)
  • control match and mismatch cells.

33
Images examples
Spot colour Signal strength Gene expression
yellow Control perturbed unchanged
red Control lt perturbed induced
green Control gt perturbed repressed
34
Processing of images
  • Addressing or gridding
  • Assigning coordinates to each of the spots
  • Segmentation
  • Classification of pixels either as foreground or
    as background
  • Intensity determination for each spot
  • Foreground fluorescence intensity pairs (R, G)
  • Background intensities
  • Quality measures

35
Management of Microarray Data
  • Magnitude of Data
  • Experiments
  • 50 000 genes in human
  • 320 cell types
  • 2000 compunds
  • 3 times points
  • 2 concentrations
  • 2 replicates
  • Data Volume
  • 41011 data-points
  • 1015 1 petaB of Data

36
Management of Microarray Data
  • Major Issues
  • Large volume of microarray data in last few years
  • Storage and efficient access
  • Comparison and integration of data
  • Problem of data access and exchange
  • Data scattered around Internet
  • Supplementary material of publications
  • Difficult for user to access relivent data
  • Problems with existing databases
  • Diverse purpose
  • Developed for specific purpose

37
Management of Microarray Data
  • Specific Database
  • Platform (eg.Stanford MA Database SMD)
  • Organism (Yeast MA global viewer)
  • Project (Life cycle database of Drosophila)
  • Problem with Supplement and MA databases
  • Lack of direct access
  • Quality not checked
  • No standard format
  • Incomplete data

38
Pre-processed cDNA Gene Expression Data
  • On p genes for n slides p is O(10,000), n is
    O(10-100), but growing,

Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4

Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
39
Analysis of Microarray Data
  • Analysis of images
  • Preprocessing of gene expression data
  • Normalization of data
  • Subtraction of Background Noise
  • Global/local Normalization
  • House keeping genes (or same gene)
  • Expression in ratio (test/references) in log
  • Differential Gene expression
  • Repeats and calculate significance (t-test)
  • Significance of fold used statistical method
  • Clustering
  • Supervised/Unsupervised (Hierarchical, K-means,
    SOM)
  • Prediction or Supervised Machine Learnning (SVM)

40
Normalization Techniques
  • Global normalization
  • Divide channel value by means
  • Control spots
  • Common spots in both channels
  • House keeping genes
  • Ratio of intensity of same gene in two channel is
    used for correction
  • Iterative linear regression
  • Parametric nonlinear nomalization
  • log(CY3/CY5) vs log(CY5))
  • Fitted log ratio observed log ratio
  • General Non Linear Normalization
  • LOESS
  • curve between log(R/G) vs log(sqrt(R.G))

41
Classification
  • Task assign objects to classes (groups) on the
    basis of measurements made on the objects
  • Unsupervised classes unknown, want to discover
    them from the data (cluster analysis)
  • Supervised classes are predefined, want to use
    a (training or learning) set of labeled objects
    to form a classifier for classification of future
    observations

42
Issues in Clustering
  • Pre-processing (Image analysis and Normalization)
  • Which genes (variables) are used
  • Which samples are used
  • Which distance measure is used
  • Which algorithm is applied
  • How to decide the number of clusters K

43
Unsupervised Learnning
  • Hierarchical clustering merging two branches at
    the time until all vari-ables
  • (genes) are in one tree. it does not answer the
    question of how
  • many gene clusters there are?
  • K-mean clustering assuming there are K clusters.
    what if this assump-tion
  • is incorrect?
  • Model-based clustering the number of clusters is
    determined dynami-cally
  • could be one of the most promising methods

44
Supervised Analysis
  • Fishers linear discriminant analysis
  • Quadratic discriminant analysis
  • Logistic regression (a linear discriminant
    analysis)
  • Neural networks
  • Support vector machine

45
Traditional Proteomics
  • 1D gel electrophoresis (SDS-PAGE)
  • 2D gel electrophoresis
  • Protein Chips
  • Chips coated with proteins/Antibodies
  • large scale version of ELISA
  • Mass Spectrometry
  • MALDI Mass fingerprinting
  • Electrospray and tandem mass spectrometry
  • Sequencing of Peptides (N-gtC)
  • Matching in Genome/Proteome Databases

46
Overview of 2D Gel
  • SDS-PAGE Isoelectric focusing (IEF)
  • Gene Expression Studies
  • Medical Applications
  • Sample Experiments
  • Capturing and Analyzing Data
  • Image Acquistion
  • Image Sizing Orientation
  • Spot Identification
  • Matching and Analysis

47
Comparision/Matcing of Gel Images
  • Compare 2 gel images
  • Set X and y axis
  • Overlap matching spots
  • Compare intensity of spots
  • Scan against database
  • Compare query gel with all gels
  • Calculate similarity score
  • Sort based on score

48
Differential Proteomics Fingerprints of Disease
Phenotypic Changes
  • Differential protein expression
  • Protein nitration patterns
  • Altered phosporylation
  • Altered glycosylation profiles
  • Utility
  • Target discovery
  • Disease pathways
  • Disease biomarkers

49
Fingerprinting Technique
  • What is fingerprinting
  • It is technique to create specific pattern for a
    given organism/person
  • To compare pattern of query and target object
  • To create Phylogenetic tree/classification based
    on pattern
  • Type of Fingerprinting
  • DNA Fingerprinting
  • Mass/peptide fingerprinting
  • Properties based (Toxicity, classification)
  • Domain/conserved pattern fingerprinting
  • Common Applications
  • Paternity and Maternity
  • Criminal Identification and Forensics
  • Personal Identification
  • Classification/Identification of organisms
  • Classification of cells

50
Fingerprinting Techniques Principles
Applications
  • What is fingerprinting
  • Type of Fingerprinting
  • Common Applications
  • Role of Computer in DNA Fingerprinting
  • Searching Restriction Enzymes
  • Searching VNTRs
  • Computation of size of DNA fragments
  • Optimization of gels
  • Comparison of patterns
  • Creation of Phylogenetic tree

51
Drug Design
  • History of Drug/Vaccine development
  • Plants or Natural Product
  • Plant and Natural products were source for
    medical substance
  • Example foxglove used to treat congestive heart
    failure
  • Foxglove contain digitalis and cardiotonic
    glycoside
  • Identification of active component
  • Accidental Observations
  • Penicillin is one good example
  • Alexander Fleming observed the effect of mold
  • Mold(Penicillium) produce substance penicillin
  • Discovery of penicillin lead to large scale
    screening
  • Soil micoorganism were grown and tested
  • Streptomycin, neomycin, gentamicin, tetracyclines
    etc.

52
Drug Design
  • Chemical Modification of Known Drugs
  • Drug improvement by chemical modification
  • Pencillin G -gt Methicillin morphine-gtnalorphine
  • Receptor Based drug design
  • Receptor is the target (usually a protein)
  • Drug molecule binds to cause biological effects
  • It is also called lock and key system
  • Structure determination of receptor is important
  • Ligand-based drug design
  • Search a lead ocompound or active ligand
  • Structure of ligand guide the drug design process

53
Drug Design based on Bioinformatics Tools
  • Detect the Molecular Bases for Disease
  • Detection of drug binding site
  • Tailor drug to bind at that site
  • Protein modeling techniques
  • Traditional Method (brute force testing)
  • Rational drug design techniques
  • Screen likely compounds built
  • Modeling large number of compounds (automated)
  • Application of Artificial intelligence
  • Limitation of known structures

54
Important Points in Drug Design based on
Bioinformatics Tools
  • Application of Genome
  • 3 billion bases pair
  • 30,000 unique genes
  • Any gene may be a potential drug target
  • 500 unique target
  • Their may be 10 to 100 variants at each target
    gene
  • 1.4 million SNP
  • 10200 potential small molecules

55
Concept of Drug and Vaccine
  • Concept of Drug
  • Kill invaders of foreign pathogens
  • Inhibit the growth of pathogens
  • Concept of Vaccine
  • Generate memory cells
  • Trained immune system to face various existing
    disease agents

56
VACCINES
  • A. SUCCESS STORY
  • COMPLETE ERADICATION OF SMALLPOX
  • WHO PREDICTION ERADICATION OF PARALYTIC
  • POLIO THROUGHOUT THE WORLD BY YEAR 2003
  • SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES
  • DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA,
  • POLIOMYELITIS, TETANUS
  • B.NEED OF AN HOUR
  • 1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES
    FOR
  • DISEASES LIKE
  • MALARIA, TUBERCULOSIS AND AIDS
  • 2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT
  • VACCINES
  • 3) LOW COST
  • 4) EFFICIENT DELIVERY TO NEEDY
  • 5) REDUCTION OF ADVERSE SIDE EFFECTS

57
Computer Aided Vaccine Design
  • Whole Organism of Pathogen
  • Consists more than 4000 genes and proteins
  • Genomes have millions base pair
  • Target antigen to recognise pathogen
  • Search vaccine target (essential and non-self)
  • Consists of amino acid sequence (e.g.
    A-V-L-G-Y-R-G-C-T )
  • Search antigenic region (peptide of length 9
    amino acids)

58
Major steps of endogenous antigen processing
59
Computer Aided Vaccine Design
  • Problem of Pattern Recognition
  • ATGGTRDAR Epitope
  • LMRGTCAAY Non-epitope
  • RTTGTRAWR Epitope
  • EMGGTCAAY Non-epitope
  • ATGGTRKAR Epitope
  • GTCVGYATT Epitope
  • Commonly used techniques
  • Statistical (Motif and Matrix)
  • AI Techniques

60
Why computational tools are required for
prediction.
200 aa proteins
Chopped to overlapping peptides of 9 amino acids
Bioinformatics Tools
192 peptides
10-20 predicted peptides
invitro or invivo experiments for detecting which
snippets of protein will spark an immune response.
61
Thanks
Write a Comment
User Comments (0)
About PowerShow.com