Micro Array Analysis - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Micro Array Analysis

Description:

Elimination of damaged cells through apoptosis ... Alkylating agents: Diet, atmosphere, chemotherapy, smoking ... Prediction of viability using network interactions ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 78
Provided by: Sam5170
Category:
Tags: analysis | array | micro

less

Transcript and Presenter's Notes

Title: Micro Array Analysis


1
Micro Array Analysis
  • Lisi and Ivans data sets

2
Biological Background
  • Mechanisms protect from DNA damage
  • Prevention of DNA damage
  • Aiding in DNA repair
  • Elimination of damaged cells through apoptosis
  • Lack of DNA repair pathways -gt high risk for
    cancer
  • DNA damage -gt cancer, aging, toxicity
  • Alkylating agents Diet, atmosphere,
    chemotherapy, smoking
  • UV radiation, polymerase errors, spontaneous DNA
    decomposition

3
Biological Background 2
  • 7MeG damages N7 position of guanine -gt 7MeG not
    mutagenic or lethal
  • Lesions O6MeG and 3MeA
  • O6MeG lesions caused by cancer chemotherapy,
    MNNG, and BCNU
  • MMR can fix single base mismatches and
    insertion/deletion loops -gt low mutation rate
  • Ecoli MMR Excision of hundreds of bases near
    mismatch on daughter strand and replaces them

4
Biological Background 3
  • Mammalian MMR more complex mechanism,
    strand-specific, lack of strand-discrimination
    mechanism
  • MSH2, MSH3, MSH6 recognize heteroduplex DNA
  • MSH4 and MSH5 recombination
  • EXO1 does exonucleolytic reaction
  • PCNA mismatch recognition

5
Biological Background 4
  • Mutations in MMR -gt increase in spontaneous
    mutation rate -gt micro satellite instability -gt
    cancer (nonpolyposis colon cancer)
  • Explain normal mismatch repair

6
Microarray Technology
  • Measures simultaneously relative expression level
    of thousands of genes within a specific tissue
  • The mRNA transcripts of a cell are isolated and
    reverse transcribed to cDNA this is the cDNA
    library of a cell
  • The cDNA representation of a cell hybridized to
    labeled cDNA or to synthetic oligonucleotides
    (short sequences of single stranded cDNA)
  • The cDNA or oligos on the chip are called probes,
    while the cDNA of the cell is the target

7
cDNA Arrays
  • Selection of probes
  • Amplification of cellular mRNA -gt cDNA through
    PCR
  • Each spot in cDNA array is a gene or EST
  • Extract total mRNA from two cell types, label
    with green and red -gt relative abundance for each
    gene in the two cells

8
Oligo arrays
  • Cellular mRNA -gt cDNA -gt cRNA
  • Photolithographic, Short cDNAs
  • Represent genes by fixed length independent
    segments
  • Each probe is 25 bp, each gene represented by a
    number of probe pairs
  • Well chosen probes to specify gene uniquely and
    reduce chances of cross hybridization
  • Probe pair consists of perfect match and mismatch
  • Mismatch pair same as perfect match except for a
    base inversion in a central position

9
Pros and Cons of cDNA arrays vs Oligo arrays
10
Analysis Overview
  • Background subtraction (Affymetrix)
  • Normalization (RMA express, the affect of the
    number of chips?)
  • Filtering
  • P/A (all Ps, 3 of 4, 2 of 4, 0 of 4, etc)
  • Same/Same filters for duplications
  • Find correlation between duplicates and
    triplicates
  • Find Log2 ratios
  • Find Up Down regulated genes (use cutoff of /-
    1, arbitrary)

11
Analysis 2
  • Find Up and Down regulated by T-test
  • Make GO maps using annotation database
  • Clustering
  • Protein-Protein databases correlated with log
    ratios (redo)
  • SIPnomes and log ratios

12
Future Analysis Plans
  • Find Up and Down regulated by LPE test and
    Generalized Likelihood ratio test
  • Predictions of complexes and new pathways
  • Classification of unknown sample by class
    discovery
  • Time series analysis (regulatory networks,
    periodic expression, coregulation)
  • 9 region graphs
  • PCA
  • Using e-value cutoffs for subnetworks

13
Future Analysis Plans 2
  • Long term outcome prediction
  • Prediction of viability using network
    interactions
  • Phenotype data, Phosphorylation data,
    localization data, protein abudances
  • Mathematical properties of the network
  • Clustering methods comparison
  • Robustness of network
  • mRNA degradation effect
  • Genetic diagnostic test
  • Promoter Mapping

14
Annotation Database
  • Probe Set ID,Title,Gene Symbol,Location
  • Unigene ID, LocusLink ID,Swissprot ID, Ensemble
    ID
  • GO
  • Biological Process
  • Cellular Component
  • Molecular Function
  • Pathways
  • Etc
  • Verified annotation cross-referencing

15
My program
  • Inputs
  • P/A calls from affymetrix analysis
  • Annotation database
  • Normalized Cell files
  • Names of Cell files and their replicates
  • Names of one or more baseline Cell files
  • Fold cutoff

16
Names of Cell Files
  • TK6 Untreated,6-27-021.CEL,6-28-022.CEL
  • TK6 24 hr,6-28-024.CEL,6-28-025.CEL
  • TK6 48 hr,6-29-028.CEL,6-28-029.CEL
  • TK6 48 hr 2,6-27-028.CEL,6-28-029.CEL
  • ----------------------------------------
  • TK6-MGMT Untreated,6-27-0210.CEL,6-28-0211.CEL
  • TK6-MGMT 24 hr,6-27-0214.CEL,6-28-0215.CEL
  • TK6-MGMT 48 hr,6-27-0217.CEL,6-28-0218.CEL
  • -----------------------------------------------
  • MT1 Untreated,6-27-0219.CEL,6-28-0220.CEL,6-28-0
    221.CEL
  • MT1 24 hr,6-27-0222.CEL,6-28-0223.CEL,6-28-0224
    .CEL
  • MT1 48 hr,6-27-0225.CEL,6-28-0226.CEL,6-28-0227
    .CEL

17
Program Outputs
  • Makes new database of average over replicates,
    log2 ratios
  • Upregulated and Downregulated Lists for any
    combination of P/A filters and Same/Same filters
  • Number up or down regulated
  • List of probeset ID, gene symbol, title, GO
    (Biological Process), and log 2 ratio
  • List of probesets up or downregulated ready for
    GO analysis
  • Set Operations
  • Intersection
  • In A not B
  • In A not B,C,D,
  • 9 regions (explain)
  • Cross-referencing of protein-protein data with
    expression data
  • Subnetworks of up or down regulation

18
Correlation of Ivans Duplicates
Example of correlation between duplicate
experiments Similar correlation for all duplicates
19
Ivans Counts for TK6 24 hourswith Filtering
Shows No filtering is best
Used probeset IDs
20
TK6 24 hours no filtering
21
Ivans Counts no Filtering
22
Example Genes Up Regulated (TK6 at 24 hours)
23
Example Genes Downregulated(TK6 at 24 hours)
24
Correlation of Lisis Triplicates
25
Lisis Liver Counts no Filtering
26
Lisis Counts 2 no Filtering
27
Finding Up and Down Regulation by T-test
  • T test statistic for each gene
  • average over replicates for a single gene in
    condition one average over replicates in
    condition two / standard deviation of the gene
    over both conditions
  • where standard deviation over both conditions is
  • 1/n1 (standard deviation of the gene in
    condition one) 2 1/n2 (standard deviation
    of the gene in condition two) 2 (1/2)

28
More on T-test
  • 2-sample T-test, determines if two population
    means are equal
  • Paired or unpaired
  • Paired when samples are dependent
  • Degrees of freedom (s_12/ms_22/n)2/(s_12/m
    )2/(m-1)(s_22/n)2/(n-1), round down to
    nearest integer

29
Correlation between Log2 ratio and T-test
  • High correlation between Log2 ratio and T-value
    for TK6, TK6-MGMT, and MT1 for Ivans data sets
  • Example of T-value on y-axis, Log2 ratio on
    x-axis for TK6-24 hours

30
Correlation between Log2 cutoff and T-value
0.25 lt p lt 0.45
Need to find p-values corresponding to these
cutoff t-values
31
T-test for Liver MGMT
T-test versus Log2 ratio for Liver MGMT Untreated
32
Correlation between Log2 cutoff and T-value
0.25 lt p lt .45
33
T-test cutoff procedure
  • Normality method
  • Empirical method - Shuffle labels of conditions
    and find empirical t-distribution
  • Find the areas of the tails in order to find the
    t-cutoff for a specific p-value threshold
  • Find the p-values for log 2 ratios of /- 1

34
PCA background
  • Transforms a number of correlated variables into
    a smaller number of uncorrelated variables called
    principal components
  • Reduces dimensionality of the data set
  • Identification of underlying variables
  • First component accounts for as much variability
    as possible
  • Each succeeding accounts for as much as possible
    of the remaining variability

35
PCA algorithm and Example
  • Start with random vector x
  • Find its expectation
  • Form its covariance matrix
  • Find its eigenvalues and eigenvectors
  • Let A be the matrix of eigenvectors
  • Let A_k be the first K eigenvectors as rows
  • Then use the 2 transformations
  • Y A_k (x Ex)
  • X A_kT y Ex

36
GO analysis Upregulated Aag Brain
37
GO analysisDownregulated Aag Brain
38
Database of Interacting Proteins(DIB) and
SIPnomes
  • Cross-referenced DIB database with Log2 ratios
  • Red below -1.0, Green greater than 1.0
  • Cross-referenced SIPnome with Log2 ratio
  • Need to filter by e-value (Later)
  • Too little data

39
Background on Clustering
  • Options
  • Hierarchical clustering
  • Wards method
  • Single linkage
  • Complete linkage
  • UPGMA
  • WPGMA
  • Self-organizing maps
  • K-means clustering
  • Choose a clustering metric
  • Euclidean, Manhattan

40
Cluster Validation
  • Are the clusters real?
  • Internal (sub-sampling)
  • External validation (match to known categories)
  • Internal methods
  • Measure the similarity between two sets of
    clusters
  • Use label matrices Lij 1 if i and j are in the
    same cluster
  • Compare the label matrices of the clusters found
    using all of the data with the clusters found
    using 80 of the data
  • High confidence in original clustering if there
    is high similarity between the label matrices

41
K-means algorithm and Example
  • Ask for the number of clusters, k
  • Randomly guess k centers of clusters
  • Each data point finds the closest center
  • Each center finds the centroid of the points it
    owns
  • Center jumps to the centroid
  • Repeat

42
Hierarchical clustering and Example
  • Let each point be a cluster
  • Find the most similar pair of clusters through a
    cluster distance
  • Merge into a parent cluster
  • Repeat until all data merged into one cluster

43
Cluster similarity
  • Complete linkage Maximum distance between points
    in clusters
  • Single linkage Minimum distance between points
    in clusters
  • Average linkage Average distance between points
    in clusters

44
UPGMA
  • Unweighted Paired Group Method with Arithmetic
    Mean
  • Start with distance matrix for each pair of data
  • Find the smallest distance, and cluster these,
    the branching point is half the distance
  • Find a new distance matrix
  • Repeat the last two steps
  • UPGMA vs WPGMA
  • WPGMA weighted paired group method analysis
  • Difference is in the calculation of a new
    distance matrix

45
Self-organizing Maps
46
Advantages and Disadvantages of each clustering
method
47
Clustering results Ivan
Used Wards method Note similar experiments
cluster together Green Down Red Up
48
Clustering Lisi
49
Aag and MMS liver counts
Probeset IDs
50
Finding Hubs in pp networks
Mouse data hubs
Ivans data the hubs
51
Hub examples
52
of proteins (y) having x neighbors
Human
Mouse
53
SIPnome protein/protein, and protein/splice
variant connections
54
SIP and splice variants
  • For each up or down regulated splice variant
  • Find its parent protein, A
  • Find which proteins that protein A is connected
    to, call this set B
  • Find the splice variants of set B
  • If both splice variants are regulated, then
    success
  • Results None Found

55
LPE test
  • Two sample t-test
  • Large p-values
  • Large false positive rate
  • Assumes many replicates but we have only 3
  • Therefore use LPE test
  • LPE test
  • Local pooled error test
  • Add more
  • Independence? Problematic for time course data

56
Time Course Analysis
  • Correlation method
  • Edge detection method
  • Bayesian networks
  • Event method

57
Event method
  • Smooth the data
  • Events for each instant (falling, rising,
    constant)
  • Sequence alignment
  • Best match of two event strings taking time into
    account
  • Use global sequence alignment

58
Sample Correlation Analysis (no time lag)
Top Gene Pair Correlation
Top Gene Pairs AntiCorrelation
59
Sample Time Series (No time lag)
60
Sample Time Series with Significant Fold Change
(no time lag)
61
Correlation method
  • Correlate two profiles with 6 hr time lag
  • Check all 144 million probeset pairs for
    correlation
  • Require 98 correlation
  • Require one time point for both series to be fold
    change gt 1.7
  • 1006 matches out of 144 million pairs

62
Sample Time Series Significant Fold change (with
Time lag)
63
(No Transcript)
64
Combining SIP and activators
  • For each gene pair in the SIPnome
  • Loop through each splice variant of geneA
  • Loop through each splice variant of geneB
  • Find the time lagged correlation of these two
    splice variants
  • If the correlation gt 0.98 and both splice
    variants show 1.7 fold change then keep the
    SIPnome interaction
  • Results None found

65
Pathway Mapping
  • Take GenMapp pathways
  • Combine with microarray data
  • Color by up or down regulation

66
Combining pathways and Chip Data for 6 Aag liver
67
TGF Beta Signaling
68
Promoter Mapping
  • For each accession number, find 2500 bases
    upstream, and 50 downstream, use PromoSer
  • For each of the these 4000 sequences of 2550 bp,
    identify possible TF binding sites using TFsearch
    in the forward strand
  • Find how many times Oct-1 (for example) occurs at
    random vs aag promotors
  • Random samples use 100 samples of size 100
  • If Oct-1 occurs more than once in the 2500 bp
    upstream, count it as one

69
Promoter results
  • Oct-1, YY1, S8, C/EBpb, Oct-x, TATA, Sox-5,
    HNF-3a, C/EBpa are overrepresented for up
    regulated AagNull Brain, 2 std above random
  • Aag6MMSWTULIV has Oct-1 overrepresented by gt 2
    std
  • AagBM has GR,NGFI-C,CRE-BP overrepresented gt 2
    std
  • WT6MMSWTULIV has Oct-1 overrepresented gt 2 std
  • All have p-values lt 0.02
  • Found overrepresented by computing average and
    standard deviation of random sample and compared
    to observed sample

70
Promoter pair results
71
Map of DNA binding sites for one Aag gene
72
Distribution of TF binding sites for Aag
YY1
S8
Positions are distributed randomly ?
73
Promoter Alignment
74
Promoter Part 2
75
Works read
  • Statistical Challenges in Functional Genomics,
    P. Sebastiani, E. Gussoni, I. Kohane, M. Ramoni,
    Statistical Science, Vol. 18, No. 1, 33-70
  • Mark Hickmans thesis pgs. 1-41
  • Probability and Statistics for Engineering and
    the Sciences, J. Devore, Thomson Learning, 2004
  • A generalized likelihood ratio test to identify
    differentially expressed genes from microarray
    data, S. Wang, S. Ethier, Bio-infomatics, Vol.
    20, no. 1, 2004, pgs. 100-104
  • Global snapshot of a protein interaction
    network a percolation based approach, C. Chin,
    M. Samanta, Vol. 19, no. 18, 2003, pgs. 2413 -
    2419
  • Essentiality and damage in metabolic networks,
    N. Lemke, F. Heredia, et al. Vol. 20, no. 1,
    2004, pgs. 115-119.
  • Self-Organizing Maps, http//davis.wpi.edu/matt
    /courses/soms

76
Works Read
  • 8. Using Structured Self-Organizing Maps in
    News Integration Websites, I. Perelomov, A.
    Azcarraga, J. Tan, et al.
  • Inference of transcriptional regulation
    relationships from gene expression data, A.
    Kwon, H. Hoos, R. Ng, Bioinfomatics, Vol. 19, no.
    8, 2003, pgs. 905-912
  • Modified nonparametric approaches to detecting
    differentially expressed genes in replicated
    microarray experiments, Y. Zhao and W. Pan,
    Bioinfomatics, Vol. 19, no. 9, 2003, pgs.
    1046-1054
  • Metabolic pathways in three dimensions, I.
    Rojdestvenski, BioInfomatics, Vol. 19, no. 18,
    2003, pgs.2436-2441
  • Local pooled error test for identifying
    differentially expressed genes with a small
    number of replicates
  • Linking gene expression data with patient
    survival times using partial least squares, P.
    Park, L. Tian, I. Kohane, Bioinfomatics, Vol. 18,
    Suppl. 1, 2002, pgs. S120-S127

77
Acknowledgements
  • Professor Samson
  • Coworkers
  • Katherine
  • Lisi
  • Tom
  • Rebecca
  • Mark
Write a Comment
User Comments (0)
About PowerShow.com