Microarray and ChIP-seq Data Analysis - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Microarray and ChIP-seq Data Analysis

Description:

Microarray and ChIP-seq Data Analysis Victor Jin Department of Biomedical Informatics Ohio State University * * * * * * * * * * Xbp1 universal stress repressor, tbp1 ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 51

Provided by: Biomedical4

Category:

more less

Transcript and Presenter's Notes

Title: Microarray and ChIP-seq Data Analysis

1
Microarray and ChIP-seq Data Analysis

Victor Jin
Department of Biomedical Informatics
Ohio State University

2
Microarray and microarray normalization ChIP-seq
and motif analysis
3

What is microarray?
Affymetrix-like arrays single channel
(background-green, foreground-red)
cDNA arrays two channel (red, green, yellow)
Protein array
Tissue microarray

4
How does microarray work?
5

Example Affymetrix Data Files
Image file (.dat file)
Probe results file (.cel file)
Library file (.cdf, .gin files)
Results file (.chp file)

6
Microarray Softwares

DChip
Open source R
Bioconductor
BRBArray tools (NCI biometric research branch)
Matlab
GeneSpring
Affymetrix

7
Microarray Databases

Gene Expression Ominbus (GEO) database NCBI
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?DBp
ubmed
EMBL-EBI microarray database (ArrayExpress)
http//www.ebi.ac.uk/Databases/microarray.html
Stanford Microarray Database (SMD)
http//genome-www5.stanford.edu/
caARRAY sites
Other specialized, regional and aggregated
databases
http//psi081.ba.ars.usda.gov/SGMD/
http//www.oncomine.org/main/index.jsp
http//ihome.cuhk.edu.hk/b400559/arraysoft_public
.html

8
RI plots of the Microarrays

RI (ratio-intensity) plot or MA plot

9
Box plot
Upper quartile
Median
Low quartile
10

Why normalization?
microarray data is highly noisy
Experimental design
Replication
Comparison

(McShane, NCI)
11

Normalization
Which genes to use for normalization
Housekeeping genes
Genes involved in essential activities of cell
maintenance and survival, but not in cell
function and proliferation
These genes will be similarly expressed in all
samples.
Difficult to identify need to be confirmed
Affymetrix GeneChip provides a set of house
keeping genes based on a large set of tests on
different tissues and were found to have low
variability in these samples (but still no
guarantee).

Normalization
Which genes to use for normalization
Using all genes
Simplest approach use all adequately expressed
genes for normalization
The assumption is that the majority of genes on
the array are housekeeping genes and the
proportion of over expressed genes is similar to
that of the under expressed genes.
If the genes one the chip are specially selected,
then this method will not work.

Normalization
Line (global) normalization
Simplest but most consistent
Move the median to zero (slope 1 in scatter plot,
this only changes the intersection)
No clear nonliearity or slope in MA plot

Normalization
Intensity-based (Lowess) normalization
Lowess fit
Overall magnitude of the spot intensity has an
impact on the relative intensity between the
channels.

(McShane, NCI)
15

Normalization
Intensity-based (Lowess) normalization
Straighten the Lowess fit line in MA plot to
horizontal line and move it to zero

Normalization
Intensity-based (Lowess) normalization
Nonlinear
Gene-by-gene, could introduce bias
Use only when there is a compelling reason

(McShane, NCI)
17

Normalization
Location-based normalization
Background subtracted ratios on the array may
vary in a predicable manner.
Sample uniformly across the chip
Nonlinear
Gene-by-gene, could introduce bias
Use only when there is a compelling reason

Normalization
Quantile normalization
Nonlinear
Same intensity distribution

After Lowess normalization
After quantile normalization
19

Normalization
Linear (global) the chips have equal median (or
mean) intensity
Intensity-based (Lowess) the chips have equal
medians (means) at all intensity values
Quantile the chips have identical intensity
distribution
Quantile is the best in term of normalizing the
data to desired distribution, however it also
changes the gene expression level individually
Avoid overfitting
Avoid bias

20
What is ChIP-Seq?

ChIP-Seq is a new frontier technology to analyze
protein interactions with DNA.
ChIP-Seq
Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively
parallel sequencing
Allow mapping of proteinDNA interactions in-vivo
on a genome scale

21
Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
22
Workflow of ChIP-Seq
23
Why ChIP-Seq?

Current microarray and ChIP-ChIP designs require
knowing sequence of interest as a promoter,
enhancer, or RNA-coding domain.
Lower cost
Less work in ChIP-Seq
Higher accuracy
Alterations in transcription-factor binding in
response to environmental stimuli can be
evaluated for the entire genome in a single
experiment.

24
(No Transcript)
25
Sequencers

Solexa (Illumina)
1 GB of sequences in a single run
35 bases in length
454 Life Sciences (Roche Diagnostics)
25-50 MB of sequences in a single run
Up to 500 bases in length
SOLiD (Applied Biosystems)
6 GB of sequences in a single run
35 bases in length

26
Illumina Genome Analysis System
27
Sequencing
28
Sequencer Output
29
Sequence Files

10 million sequences per lane
500 MB files

30
Bioinformatics Challenges

Rapid mapping of these short sequence reads to
the reference genome
Visualize mapping results
Thousand of enriched regions
Peak analysis
Peak detection
Finding exact binding sites
Compare results of different experiments
Normalization
Statistical tests

31
Mapping of Short Oligonucleotides to the
Reference Genome

Mapping Methods
Need to allow mismatches and gaps
SNP locations
Sequencing errors
Reading errors
Indexing and hashing
genome
oligonucleotide reads
Use of quality scores
Use of SNP knowledge
Performance
Partitioning the genome or sequence reads

32
Mapping Methods Indexing the Genome

Fast sequence similarity search algorithms (like
BLAST)
Not specifically designed for mapping millions of
query sequences
Take very long time
e.g. 2 days to map half million sequences to 70MB
reference genome (using BLAST)
Indexing the genome is memory expensive

33
Mapping Methods Indexing the Oligonucleotide
Reads

ELAND (Cox, unpublished)
Efficient Large-Scale Alignment of Nucleotide
Databases (Solexa Ltd.)
SeqMap (Jiang, 2008)
Mapping massive amount of oligonucleotides to
the genome
RMAP (Smith, 2008)
Using quality scores and longer reads improves
accuracy of Solexa read mapping
MAQ (Li, 2008)
Mapping short DNA sequencing reads and calling
variants using mapping quality scores

34
Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
35
Huang, 2008 (unpublished)
36
Peak Analysis

Finding Exact Binding Site
Determining the exact binding sites from short
reads generated from ChIP-Seq experiments
SISSRs (Site Identification from Short Sequence
Reads) (Jothi 2008)
MACS (Model-based Analysis of ChIP-Seq) (Zhang et
al, 2008)

37
(No Transcript)
38
(No Transcript)
39
Transcription in higher eukaryotes

Gene Expression
Chromatin structure
Initiation of transcription
Processing of the transcript
Transport to the cytoplasm
mRNA translation
mRNA stability
Protein activity stability

40
Some common approaches

Cluster-first motif discovery
Cluster genes by expression profile, annotation,
to find potentially coregulated genes
Find overrepresented motifs in promoter
sequences of similar genes (algorithms MEME,
Consensus, Gibbs sampler, AlignACE, )

(Spellman et al. 1998)
41
Training data Features
regulator expression
promoter sequence
label
feature vector
42
What is PWM?

Transcription factor binding sites (TFBSs) are
usually slightly variable in their sequences.
A positional weight matrix (PWM) specifies the
probability that you will see a given base at
each index position of the motif.

43
PWM for ERE
Position frequency matrix (PFM) (also known as
raw count matrix)

acggcagggTGACCc
aGGGCAtcgTGACCc
cGGTCGccaGGACCt
tGGTCAggcTGGTCt
aGGTGGcccTGACCc
cTGTCCctcTGACCc
aGGCTAcgaTGACGt
.
.
.
cagggagtgTGACCc
gagcatgggTGACCa
aGGTCAtaacgattt
gGAACAgttTGACCc
cGGTGAcctTGACCc
gGGGCAaagTGACTg

Given N sequence fragments of fixed length, one
can assemble a position frequency matrix (number
of times a particular nucleotide appears at a
given position). A normalized PFM, in which each
column adds up to a total of one, is a matrix of
probabilities for observing each nucleotide at
each position.
Position weight matrix (PWM) (also known as
position-specific scoring matrix)
PFM should be converted to log-scale for
efficient computational analysis. To eliminate
null values before log-conversion, and to correct
for small samples of binding sites, a sampling
correction, known as pseudocounts, is added to
each cell of the PFM.
44
Examples for Motifs and PWMs in Yeast
Universal stress repressor motif
STRE element
45
Signaling networks in a cell
46
Example oxygen sensing and regulatory network
47
Inferring regulatory networks from the expression
data and binding data
48
An ERa regulatory network in MCF7 cells
CCNL1
BRF1
49
http//motif.bmi.ohio-state.edu/ChIPMotifs/
50
http//motif.bmi-ohio-state.edu/HRTBLDb

Write a Comment

User Comments (0)