Title: Microarray and ChIP-seq Data Analysis
1Microarray and ChIP-seq Data Analysis
- Victor Jin
- Department of Biomedical Informatics
- Ohio State University
2Microarray and microarray normalization ChIP-seq
and motif analysis
3- What is microarray?
- Affymetrix-like arrays single channel
(background-green, foreground-red) - cDNA arrays two channel (red, green, yellow)
- Protein array
- Tissue microarray
-
4How does microarray work?
5- Example Affymetrix Data Files
- Image file (.dat file)
- Probe results file (.cel file)
- Library file (.cdf, .gin files)
- Results file (.chp file)
6Microarray Softwares
- DChip
- Open source R
- Bioconductor
- BRBArray tools (NCI biometric research branch)
- Matlab
- GeneSpring
- Affymetrix
7Microarray Databases
- Gene Expression Ominbus (GEO) database NCBI
- http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?DBp
ubmed - EMBL-EBI microarray database (ArrayExpress)
- http//www.ebi.ac.uk/Databases/microarray.html
- Stanford Microarray Database (SMD)
- http//genome-www5.stanford.edu/
- caARRAY sites
- Other specialized, regional and aggregated
databases - http//psi081.ba.ars.usda.gov/SGMD/
- http//www.oncomine.org/main/index.jsp
- http//ihome.cuhk.edu.hk/b400559/arraysoft_public
.html
8RI plots of the Microarrays
- RI (ratio-intensity) plot or MA plot
9Box plot
Upper quartile
Median
Low quartile
10- Why normalization?
- microarray data is highly noisy
- Experimental design
- Replication
- Comparison
(McShane, NCI)
11- Normalization
- Which genes to use for normalization
- Housekeeping genes
- Genes involved in essential activities of cell
maintenance and survival, but not in cell
function and proliferation - These genes will be similarly expressed in all
samples. - Difficult to identify need to be confirmed
- Affymetrix GeneChip provides a set of house
keeping genes based on a large set of tests on
different tissues and were found to have low
variability in these samples (but still no
guarantee).
12- Normalization
- Which genes to use for normalization
- Using all genes
- Simplest approach use all adequately expressed
genes for normalization - The assumption is that the majority of genes on
the array are housekeeping genes and the
proportion of over expressed genes is similar to
that of the under expressed genes. - If the genes one the chip are specially selected,
then this method will not work.
13- Normalization
- Line (global) normalization
- Simplest but most consistent
- Move the median to zero (slope 1 in scatter plot,
this only changes the intersection) - No clear nonliearity or slope in MA plot
14- Normalization
-
- Intensity-based (Lowess) normalization
- Lowess fit
- Overall magnitude of the spot intensity has an
impact on the relative intensity between the
channels.
(McShane, NCI)
15- Normalization
- Intensity-based (Lowess) normalization
- Straighten the Lowess fit line in MA plot to
horizontal line and move it to zero
16- Normalization
- Intensity-based (Lowess) normalization
- Nonlinear
- Gene-by-gene, could introduce bias
- Use only when there is a compelling reason
(McShane, NCI)
17- Normalization
- Location-based normalization
- Background subtracted ratios on the array may
vary in a predicable manner. - Sample uniformly across the chip
- Nonlinear
- Gene-by-gene, could introduce bias
- Use only when there is a compelling reason
18- Normalization
-
- Quantile normalization
- Nonlinear
- Same intensity distribution
After Lowess normalization
After quantile normalization
19- Normalization
-
- Linear (global) the chips have equal median (or
mean) intensity - Intensity-based (Lowess) the chips have equal
medians (means) at all intensity values - Quantile the chips have identical intensity
distribution - Quantile is the best in term of normalizing the
data to desired distribution, however it also
changes the gene expression level individually - Avoid overfitting
- Avoid bias
20What is ChIP-Seq?
- ChIP-Seq is a new frontier technology to analyze
protein interactions with DNA. - ChIP-Seq
- Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively
parallel sequencing - Allow mapping of proteinDNA interactions in-vivo
on a genome scale
21Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
22Workflow of ChIP-Seq
23Why ChIP-Seq?
- Current microarray and ChIP-ChIP designs require
knowing sequence of interest as a promoter,
enhancer, or RNA-coding domain. - Lower cost
- Less work in ChIP-Seq
- Higher accuracy
- Alterations in transcription-factor binding in
response to environmental stimuli can be
evaluated for the entire genome in a single
experiment.
24(No Transcript)
25Sequencers
- Solexa (Illumina)
- 1 GB of sequences in a single run
- 35 bases in length
- 454 Life Sciences (Roche Diagnostics)
- 25-50 MB of sequences in a single run
- Up to 500 bases in length
- SOLiD (Applied Biosystems)
- 6 GB of sequences in a single run
- 35 bases in length
26Illumina Genome Analysis System
27Sequencing
28Sequencer Output
29Sequence Files
- 10 million sequences per lane
- 500 MB files
30Bioinformatics Challenges
- Rapid mapping of these short sequence reads to
the reference genome - Visualize mapping results
- Thousand of enriched regions
- Peak analysis
- Peak detection
- Finding exact binding sites
- Compare results of different experiments
- Normalization
- Statistical tests
31Mapping of Short Oligonucleotides to the
Reference Genome
- Mapping Methods
- Need to allow mismatches and gaps
- SNP locations
- Sequencing errors
- Reading errors
- Indexing and hashing
- genome
- oligonucleotide reads
- Use of quality scores
- Use of SNP knowledge
- Performance
- Partitioning the genome or sequence reads
32Mapping Methods Indexing the Genome
- Fast sequence similarity search algorithms (like
BLAST) - Not specifically designed for mapping millions of
query sequences - Take very long time
- e.g. 2 days to map half million sequences to 70MB
reference genome (using BLAST) - Indexing the genome is memory expensive
33Mapping Methods Indexing the Oligonucleotide
Reads
- ELAND (Cox, unpublished)
- Efficient Large-Scale Alignment of Nucleotide
Databases (Solexa Ltd.) - SeqMap (Jiang, 2008)
- Mapping massive amount of oligonucleotides to
the genome - RMAP (Smith, 2008)
- Using quality scores and longer reads improves
accuracy of Solexa read mapping - MAQ (Li, 2008)
- Mapping short DNA sequencing reads and calling
variants using mapping quality scores
34Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
35Huang, 2008 (unpublished)
36Peak Analysis
- Finding Exact Binding Site
- Determining the exact binding sites from short
reads generated from ChIP-Seq experiments - SISSRs (Site Identification from Short Sequence
Reads) (Jothi 2008) - MACS (Model-based Analysis of ChIP-Seq) (Zhang et
al, 2008)
37(No Transcript)
38(No Transcript)
39 Transcription in higher eukaryotes
- Gene Expression
- Chromatin structure
- Initiation of transcription
- Processing of the transcript
- Transport to the cytoplasm
- mRNA translation
- mRNA stability
- Protein activity stability
40Some common approaches
- Cluster-first motif discovery
- Cluster genes by expression profile, annotation,
to find potentially coregulated genes - Find overrepresented motifs in promoter
sequences of similar genes (algorithms MEME,
Consensus, Gibbs sampler, AlignACE, )
(Spellman et al. 1998)
41Training data Features
regulator expression
promoter sequence
label
feature vector
42What is PWM?
- Transcription factor binding sites (TFBSs) are
usually slightly variable in their sequences. - A positional weight matrix (PWM) specifies the
probability that you will see a given base at
each index position of the motif.
43 PWM for ERE
Position frequency matrix (PFM) (also known as
raw count matrix)
- acggcagggTGACCc
- aGGGCAtcgTGACCc
- cGGTCGccaGGACCt
- tGGTCAggcTGGTCt
- aGGTGGcccTGACCc
- cTGTCCctcTGACCc
- aGGCTAcgaTGACGt
- .
- .
- .
- cagggagtgTGACCc
- gagcatgggTGACCa
- aGGTCAtaacgattt
- gGAACAgttTGACCc
- cGGTGAcctTGACCc
- gGGGCAaagTGACTg
Given N sequence fragments of fixed length, one
can assemble a position frequency matrix (number
of times a particular nucleotide appears at a
given position). A normalized PFM, in which each
column adds up to a total of one, is a matrix of
probabilities for observing each nucleotide at
each position.
Position weight matrix (PWM) (also known as
position-specific scoring matrix)
PFM should be converted to log-scale for
efficient computational analysis. To eliminate
null values before log-conversion, and to correct
for small samples of binding sites, a sampling
correction, known as pseudocounts, is added to
each cell of the PFM.
44Examples for Motifs and PWMs in Yeast
Universal stress repressor motif
STRE element
45Signaling networks in a cell
46Example oxygen sensing and regulatory network
47Inferring regulatory networks from the expression
data and binding data
48An ERa regulatory network in MCF7 cells
CCNL1
BRF1
49http//motif.bmi.ohio-state.edu/ChIPMotifs/
50http//motif.bmi-ohio-state.edu/HRTBLDb