Hasan H. Otu - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Hasan H. Otu

Description:

From Sequence to Function to Network: Analysis Issues in Bioinformatics Hasan H. Otu hotu_at_bidmc.harvard.edu BIDMC Genomics Center Harvard Medical School – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 54
Provided by: IZZ9
Category:

less

Transcript and Presenter's Notes

Title: Hasan H. Otu


1
From Sequence to Function to Network Analysis
Issues in Bioinformatics
  • Hasan H. Otu
  • hotu_at_bidmc.harvard.edu

2
  • Bioinformatics is a management and analysis
    information system for life sciences.

Data Storage and Management
Interpretation of Results
Data Analysis
  • Molecular Sequence Analysis
  • Homology Search
  • Phylogeny Construction
  • Whole Genome Sequencing
  • Gene Finding
  • Protein Structure Prediction
  • Protein/RNA tertiary structure
  • Docking
  • Drug Design
  • Functional Genomics and Proteomics
  • Microarrays
  • Biomarker Discovery
  • Systems Biology
  • Pathways
  • Network based wholistic approach

3
(No Transcript)
4
A and G Purines
T and C Pyrimidines
5
Central Dogma of Molecular Biology
6
Prokaryotes
Eukaryotes
7
Translation
  • Amino Acid Translation Table
  • There are 64 possibilities
  • Only 20 Amino Acids in Nature
  • One start, three stop codons

Protein AA Chain
8
(No Transcript)
9
Sequence Comparison
  • Finding similarity between sequences is important
    for many biological questions
  • For example
  • Find similar proteins
  • Allows to predict function structure
  • Locate similar subsequences in DNA
  • Allows to identify (e.g) regulatory elements
  • Locate DNA sequences that might overlap
  • Helps in sequence assembly

10
Sequence Alignment
  • Input two sequences over the same alphabet
  • Output an alignment (inserting gaps into the
    sequences so that their lengths become the same)
    of the two sequences
  • Example
  • GCGCATGGATTGAGCGA
  • TGCGCCATTGATGACCA
  • A possible alignment
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A

Need a scoring function. Goal Find the alignment
with the best score (and assign significance)
Alternative view Edit distance cost of the
cheapest set of edit operations needed to
transform one sequence into the other.
Current Approaches Dynamic Programming HMMs
11
Average Mutual Information (AMI) Profile
Shotgun Sequencing
What is the information a given base carries
about a base k positions apart?
I(k) ? pij(k) log2 pij(k) / pipj
Overall System
Fragments
Vector Quantization
AMI Profiles
Process Clusters
MA Clusters (Consensus)
Output
Otu and Sayood, Bioinformatics, 2003 1922-9
12
A phylogeny is a tree that describes the sequence
of speciation events that lead to the forming of
a set of current day species also called a
phylogenetic tree.
Also used to understand functional relatedness of
a group of genes or proteins
  • Classical phylogenetic analysis morphological
    features number of legs, lengths of legs, etc.
  • Modern biological methods allow to use molecular
    features
  • Gene sequences
  • Protein sequences

13
Morphological topology
Archonta
Ungulata
14
From sequences to a phylogenetic tree
Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QE
PGGLVVPPTDA Cat REPGGLVVPPTEG
There are many possible types of sequences to use
(e.g. Mitochondrial vs Nuclear proteins).
15
Mitochondrial topology
16
Nuclear topology
17
Three Methods of Tree Construction
  • Distance- (i) Compute distance between molecular
    sequences (ii) find a tree that realizes the
    distances between the objects (UPGMA, NJ, etc.).
  • Parsimony A tree with a total minimum number of
    character changes between nodes.
  • Likelihood A tree with highest likelihood of
    explaining relation between underlying sequences
    given a evolutionary model.

18
Computing Distances Between Sequences
Given a multiple alignment of N sequences,
assume -Each position in DNA sequence is
independent -Each position can mutate with same
probability to any another base
D k / n where n is the length of the sequence
and k is the number of nucleotides that differ
A simple Model
Jukes-Cantor distance DJC ¾ ln 1 4/3 D
Kimura 2 parameter distance DK ½ ln 1 /
(12P 2Q) ¼ ln 1 / (1 2Q) P fraction
of transitions (changes among purines or
pyrimidines) Q fraction of transversions
(changes between purines and pyrimidines)
19
Non Distance Based Method Maximum Likelihood
Given a probabilistic model for nucleotide
(or protein) substitution (e.g., Jukes
Cantor), pick the tree that has highest
probability of generating observed data i.e.,
given data D and model M, find tree T such that
Pr(DT, M)is maximized Models gives values
pij(t), the probability of going from
nucleotide i to j in time t
20
Maximum Likelihood
  • Requires a MA of sequences
  • Makes 2 independence assumptions
  • Different sites evolve independently
  • Diverged sequences (or species) evolve
    independently after diverging
  • If Di is data for ith site

21
Maximum Likelihood How to calculate Pr(DiT,M) ?

pxy(t) prob of going from x to y in time t
22
ML Possible Trees
Sequence W A C G C G T T G G G Sequence X A C
G C G T T G G G Sequence Y A C G C A A T G A A
Sequence Z A C A C A G G G A A
23
Likelihood for One Path
L(path) L(root) x P L(branches)
P(G)P(G?T)P(G?G) P(G?A)P(G?G) P(T?T)P(T?T)

24
Sum over all paths
L(Column) S L(all possible Evolutionary Paths)
L(path1) L(path2) L(path3)
L(path64)
25
Whole Sequence Likelihood
L(Sequence) L(each position i) Choose
the tree with the Maximum Likelihood.
26
Non Distance Based Method Parsimony
Minimal Evolution Principle
Most parsimonious tree ? Tree with minimal
parsimony score
27
Proposed Distance
Complexity of DNA Sequences Through a Production
Process
Produce Q from S
S AACGTACCATTG T CTAGGGACTTAT Q ACGGTCACCAA
S A A C G T A C C G C G T C C G C A
H(S) A
AC
G
T
ACC
GC
GTC
CGCA
H(SQ) AACGTACCATTGACGGTCACCAA H(TQ)
CTAGGGACTTATACGGTCACCAA
C(S) 8
c(SQ) c(S) 3
c(TQ) c(T) 5
Q is closer to S than it is to T
Application to phylogeny reconstruction
Whole mtDNA Genome
Fungi Phylogeny
Bastola, Otu et al. Mycological Research 2004
108(2)117-125
Otu and Sayood Bioinformatics 2003 192122-2130
28
Microarray Principle
ATTCCTACTTA
Affymetrix GeneChip Approach
Gene Sequence
25 mer
29
Affy Signal Values
30
Stemness Genes do they really exist?
(Nanog)
  • Cell Cycle
  • DNA processing
  • replication,repair,
  • helicase, binding-protein
  • RNA processing
  • Chromatin modifiers
  • Transcription factors

Otu et al. Science. 2003 302393
31
Renal Cell Cancer
Jones, Otu et al. 2005 Clinical Cancer Research
32
Regenerating liver and developing liver are
distinct
Selected overrepresented Gene Ontology (GO)
categories
Otu et al. J. Biol. Chem. 200728211197-11204
33
Transcriptome of Human Oocyte
Proc. Natl. Acad. Sci. USA (2006) 103, 14027-14032
Genes and gene products responsible for
dedifferentiation of somatic cells. US Patent by
MSU, GIS, and HMS
34
Type 2 Diabetes ? Diabetic Nephropathy
Predictive Signature
  • 62 Pima Indians
  • All with T2D at baseline
  • 31 go on to DN within 10 years
  • Training 14 Case, 14 Control
  • Validation 17 Case , 17 Control

12 peak signature 89 on training (93
sensitivity, 86 specificity) 74 accuracy an
validation (71 sensitivity, 76 specificity)
Otu et al. Diabetes Care, 2007 30638-643
35
Completed Projects
  • Transcriptional Effects of PTH and Estrogen
    Combination Treatment During Anabolic Bone
    Formation
  • J. Cell. Biochem. 2004 93476-490
  • Differences in Gene Expression Profiles of
    Diabetic and Non-diabetic Patients Undergoing
    Cardiopulmonary Bypass and Cardioplegic Arrest
  • Circulation 2004 110II-280-286
  • High-Throughput Generation of Reliable Serum and
    Plasma Protein Profiles with SELDI-TOF MS
  • Clinical Chemistry and Laboratory Medicine 2005
    43(2), 133-140
  • Preconditioning Of Primary Human Endothelial
    Cells With Inflammatory Mediators Alters The Set
    Point Of The Cell
  • FASEB Journal 2005 19(13)1914-1916
  • Unique Gene Expression Profile based upon
    Pathologic Response in Epithelial Ovarian Cancer
  • Journal of Clinical Oncology 2005
    23(31)7911-7918
  • A Novel Role For Gadd45ß As A Mediator Of MMP-13
    Gene Expression During Chondrocyte Terminal
    Differentiation
  • Journal of Biological Chemistry 2005 280 (46)
    38544-38555

36
Completed Projects
  • Proteomic Analysis Of The Allograft Response
  • Transplantation 2006 82(2) 267-274
  • Differential Gene Expression Analysis Reveals
    Generation Of An Autocrine Loop By A Mutant EGFR
    In Glioma Cells
  • Cancer Research 2006 66(2) 867-874
  • A Novel Class of VEGF-Responsive Genes That
    Require Forkhead Activity for Expression
  • Journal of Biological Chemistry 2006
    281(46)35544-53
  • Essential Role Of Jun Family Transcription
    Factors In PU.1-induced Leukemic Stem Cell
    Transformation
  • Nature Genetics, 2006 38(11)1269-77
  • A Novel Pathway Involving Melanoma
    Differentiation Associated Gene-7/Interleukin-24
    Mediates Nonsteroidal Anti-inflammatory
    Drug-Induced Apoptosis and Growth Arrest of
    Cancer Cells
  • Cancer Res. 2006 66(24)11922-31
  • Serum Proteome Profiling Detects Myelodysplastic
    Syndromes and Identifies CXC Chemokine Ligands 4
    and 7 As Markers For Advanced Disease
  • PNAS, 2007 104(4)1307-12

37
Completed Projects
  • Reduced PDEF Expression Enhances Prostate Cancer
    Cell Motility and Invasiveness Due To A Switch
    From Epithelial To Mesenchymal Gene Expression
  • Cancer Research, 2007 67(9)4219-26
  • A High Fat, Ketogenic Diet, Induces a Unique
    Metabolic State in Mice.
  • American Journal of Physiology-Endocrinology and
    Metabolism, 2007 292(6)E1724-39
  • c-Fos as a Pro-Apoptotic Agent in TRAIL-induced
    Apoptosis in Prostate Cancer Cells.
  • Cancer Research, 2007 67(19)9425-9434
  • Oxidative Stress and Atrial Fibrillation After
    Cardiac Surgery A Case-Control Study.
  • Ann. Thoracic Surgery, 2007 84(4)1166 - 1173.
  • Genomic Expression Pathways Associated to Brain
    Injury after Cardiopulmonary Bypass.
  • Journal of Thoracic and Cardiovascular Surgery,
    2007 134(4)996-1005.
  • Serum Proteomics and Biomarkers in
    Hepatocellular Carcinoma and Chronic Liver
    Disease
  • Clinical Cancer Research, 2008 14(2)470-7
  • Proteomic Identification of Interleukin-2
    Therapy Response in Metastatic Renal Cell Cancer.

38
Completed Projects
  • Gene expression of purified beta cell tissue
    obtained from human pancreas with laser capture
    microdissection.
  • The Journal of Clinical Endocrinology
    Metabolism, 2008 93(3)1046-1053.
  • Genomic Counter-Stress Changes Induced by the
    Relaxation Response.
  • PLoS One, 2008 3(7)e2576.
  • A Role for GADD45ß as a Survival Factor in
    Articular Chondrocytes in Normal and
    Osteoarthritic Cartilage
  • Arthritis Rheumatism, 2008 58(7)2075-87.
  • Gene expression profile of mouse prostate tumors
    reveals dysregulations in major biological
    processes and identifies potential murine targets
    for preclinical development of human prostate
    cancer therapy.
  • The Prostate, 2008 Oct 168(14)1517-30.
  • Gene expression analysis of embryonic stem cells
    expressing VE-cadherin (CD144) during endothelial
    differentiation.
  • BMC Genomics, 2008 9240.
  • Differential gene expression of bone
    marrow-derived CD34 cells is associated with
    survival of patients suffering from
    myelodysplastic syndrome.
  • Int J Hematol., 2009 89(2)173-87

39
Ongoing Projects
  • Identifying Reprogramming Genes / iPS cell
    characterization
  • Jose Cibelli, Michigan State University
  • Stem Cell-Differentiation Mechanism
  • Bing Lim, Genome Institute of Singapore
  • Genomic Changes Induced by Relaxation Response
  • Herbert Benson, Harvard Medical School
  • Effects of Soy Pythochemicals on Bladder Cancer
  • JR Zhou Harvard Medical School
  • Genome Wide Association for Brain Aneurysms
  • Murat Gunel Yale School of Medicine
  • Effect of Nicotine on Palate Development.
  • Ali Nawshad, University of Nebraska

40
The Challenge Create Order Out of Chaos
Integration of Disparate Clinical, Genomic and
Proteomic Data Into Biological Pathways
41
Infrastructure
42
Research Portal
www.bidmcgenomics.org
43
Assigning Significance to Subclusters of
Experimental SampleSASSESS
  • Inspired by Felsensteins Bootstrap Method for
    Phylogeny
  • Sub-sample expression data
  • Build a HC tree for each sample
  • Generate the consensus tree from above

44
ASSESS
Otu et al. EIT, 2005 pp1-6
45
Bayesian Networks
  • Consists of two components
  • A directed acyclic graph G whose nodes are the
    random variables Xi
  • ? describes conditional distribution for each
    variable
  • Each variable Xi is independent of its
    non-descendants, given its parents in G
  • BNs are
  • Inherently Stochastic
  • Resistant to Noise
  • Captures Gene Interactions
  • Suitable for Gene Regulation Network Analysis

46
Bayesian Pathway Analysis
47
Simulation Results
BN name of nodes Data following CPT Data following CPT Data inconsistent with CPT Data inconsistent with CPT
BN name of nodes Score p-value Score p-value
Alarm 37 -9,955 0 -22,600 0.56
Asia 8 -2,221 0 -2,926 0.54
BN1 19 -9,344 0 -10,213 0.62
BN2 8 -3,569 0 -3,874 0.54
BN3 21 -10,844 0 -12,763 0.55
BN4 36 -20,074 0 -21,746 0.59
BN5 18 -9,607 0 -10,245 0.50
BN6 29 -15,859 0 -17,122 0.64
BN7 19 -9,804 0 -10,996 0.65
BN8 53 -29,937 0 -32,262 0.67
48
RCC Dataset
Pathway Definition Score nodes P-value
Glycolysis / Gluconeogenesis - Homo sapiens (human) -802.593 27 0
Cell cycle - Homo sapiens (human) -2508.29 74 0
Citrate cycle (TCA cycle) - Homo sapiens (human) -557.116 17 0
Fatty acid metabolism - Homo sapiens (human) -684.362 22 0
Purine metabolism - Homo sapiens (human) -1586.13 65 0
Glutamate metabolism - Homo sapiens (human) -840.076 21 0
Alanine and aspartate metabolism - Homo sapiens (human) -922.135 22 0
Valine, leucine and isoleucine degradation - Homo sapiens (human) -888.348 26 0
Fluorobenzoate degradation - Homo sapiens (human) 0 0 0
beta-Alanine metabolism - Homo sapiens (human) -439.498 12 0
Glutathione metabolism - Homo sapiens (human) -399.509 11 0
Pyruvate metabolism - Homo sapiens (human) -710.303 19 0
1,4-Dichlorobenzene degradation - Homo sapiens (human) 0 0 0
One carbon pool by folate - Homo sapiens (human) -554.229 22 0
PPAR signaling pathway - Homo sapiens (human) -1724.8 42 0
ErbB signaling pathway - Homo sapiens (human) -1592.59 44 0
TGF-beta signaling pathway - Homo sapiens (human) -1415.7 42 0
Toll-like receptor signaling pathway - Homo sapiens (human) -2060.7 60 0
Synthesis and degradation of ketone bodies - Homo sapiens (human) -207.897 5 0
Fc epsilon RI signaling pathway - Homo sapiens (human) -1135.56 35 0
Natural killer cell mediated cytotoxicity - Homo sapiens (human) -2123.25 59 0.001
B cell receptor signaling pathway - Homo sapiens (human) -1187.39 35 0.001
MAPK signaling pathway - Homo sapiens (human) -3889.26 115 0.002
Insulin signaling pathway - Homo sapiens (human) -1953.19 60 0.002
3-Chloroacrylic acid degradation - Homo sapiens (human) -73.9607 2 0.003
Nicotinate and nicotinamide metabolism - Homo sapiens (human) -508.979 18 0.003
VEGF signaling pathway - Homo sapiens (human) -772.384 27 0.003
Arginine and proline metabolism - Homo sapiens (human) -798.486 23 0.004
T cell receptor signaling pathway - Homo sapiens (human) -1887.94 53 0.004
Wnt signaling pathway - Homo sapiens (human) -1847.35 60 0.005

Pathway Definition
Glycolysis / Gluconeogenesis
Pyruvate metabolism
Citrate cycle
Arginine and proline metabolism
Urea cycle
Propanoate metabolism,
Butanoate metabolism
Lysine degradation
Valine, leucine and isoleucine degradation
p53-mediated pathway
Purine Metabolism
Experimental proteomics study of RCC
BPA Analysis Results of External Microarray Data
49
Camel EST Sequencing
Young (1-2 yrs. old)
Adult (5-6 yrs. old)
Aged (9-10 yrs. old)
Age
Color
Black
11 tissues
11 tissues
11 tissues
Brown
11 tissues
11 tissues
11 tissues
White
11 tissues
11 tissues
11 tissues
Camel 1
Camel 2
Camel 3
384 x 61 23,424 read each 70272 total reads
Br, Liv, Kid, Hrt, Bld, Stm, Lng, Spl, Pan,
Gen, Msc
50
Raw chromatogram
Homology Search
Organism Based Annotation
BLASTX BLASTN
PHRED
Sequence and Quality Files
Contig/Singlet Processing
Functional Annotation
LUCY2
Vector and Low Quality Base Trimming
Full Length cDNA Analysis
Assembly
CAP3
Check for Chimeras
Clustering
ORF Analysis
REPEAT MASKER
RBR
TGICL
Mask for Repeats
Merge Repeats
51
Read Statistics Sequence Statistics
Untrimmed of reads 70,272 Average read length 1,447 411 bp Average of high 614 283 quality bp/read of reads after trimming 58,842 Average read length 755 171 bp Average of high 670 181 quality bp/read of chimeric sequences 1,241 of reads after chimera 59,534 analysis of reads with repeat region 18,340 (30.8) total of bp masked due 2.5x106 (5.5) to repeats of contigs 8,319 of singletons 15,283 total of sequences 23,602 average of reads per contig 5.2 average contig length 1,247 bp average singleton length 696 bp average ORF length (contig) 673 bp average ORF length (singleton) 390 bp of sequences with hit contigs 7,490 singletons 11,480 of sequences with no hit contigs 829 Singletons 3,803
52
Other Microarray Applications
  • Tiling Arrays (covers whole genomic region)
  • Exon Arrays (alternative splicing)
  • microRNA Arrays
  • Methylation Arrays
  • SNP Arrays (Genome Wide Association)
  • Cytogenetics (chromosomal aberations)
  • Promoter Arrays

53
Future Directions
  • Identification of species in a mixed culture
    (AMI)
  • Motif finding, classification, multiple alignment
    (LZ)
  • Network based analysis of high throughput data
    using external knowledge
  • Unified analysis of multiple measurements of same
    biological sample
Write a Comment
User Comments (0)
About PowerShow.com