Microarray and ChIP-seq Data Analysis - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Microarray and ChIP-seq Data Analysis

Description:

Microarray and ChIP-seq Data Analysis Victor Jin Department of Biomedical Informatics Ohio State University * * * * * * * * * * Xbp1 universal stress repressor, tbp1 ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 51
Provided by: Biomedical4
Category:

less

Transcript and Presenter's Notes

Title: Microarray and ChIP-seq Data Analysis


1
Microarray and ChIP-seq Data Analysis
  • Victor Jin
  • Department of Biomedical Informatics
  • Ohio State University

2
Microarray and microarray normalization ChIP-seq
and motif analysis
3
  • What is microarray?
  • Affymetrix-like arrays single channel
    (background-green, foreground-red)
  • cDNA arrays two channel (red, green, yellow)
  • Protein array
  • Tissue microarray

4
How does microarray work?
5
  • Example Affymetrix Data Files
  • Image file (.dat file)
  • Probe results file (.cel file)
  • Library file (.cdf, .gin files)
  • Results file (.chp file)

6
Microarray Softwares
  • DChip
  • Open source R
  • Bioconductor
  • BRBArray tools (NCI biometric research branch)
  • Matlab
  • GeneSpring
  • Affymetrix

7
Microarray Databases
  • Gene Expression Ominbus (GEO) database NCBI
  • http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?DBp
    ubmed
  • EMBL-EBI microarray database (ArrayExpress)
  • http//www.ebi.ac.uk/Databases/microarray.html
  • Stanford Microarray Database (SMD)
  • http//genome-www5.stanford.edu/
  • caARRAY sites
  • Other specialized, regional and aggregated
    databases
  • http//psi081.ba.ars.usda.gov/SGMD/
  • http//www.oncomine.org/main/index.jsp
  • http//ihome.cuhk.edu.hk/b400559/arraysoft_public
    .html

8
RI plots of the Microarrays
  • RI (ratio-intensity) plot or MA plot

9
Box plot
Upper quartile
Median
Low quartile
10
  • Why normalization?
  • microarray data is highly noisy
  • Experimental design
  • Replication
  • Comparison

(McShane, NCI)
11
  • Normalization
  • Which genes to use for normalization
  • Housekeeping genes
  • Genes involved in essential activities of cell
    maintenance and survival, but not in cell
    function and proliferation
  • These genes will be similarly expressed in all
    samples.
  • Difficult to identify need to be confirmed
  • Affymetrix GeneChip provides a set of house
    keeping genes based on a large set of tests on
    different tissues and were found to have low
    variability in these samples (but still no
    guarantee).

12
  • Normalization
  • Which genes to use for normalization
  • Using all genes
  • Simplest approach use all adequately expressed
    genes for normalization
  • The assumption is that the majority of genes on
    the array are housekeeping genes and the
    proportion of over expressed genes is similar to
    that of the under expressed genes.
  • If the genes one the chip are specially selected,
    then this method will not work.

13
  • Normalization
  • Line (global) normalization
  • Simplest but most consistent
  • Move the median to zero (slope 1 in scatter plot,
    this only changes the intersection)
  • No clear nonliearity or slope in MA plot

14
  • Normalization
  • Intensity-based (Lowess) normalization
  • Lowess fit
  • Overall magnitude of the spot intensity has an
    impact on the relative intensity between the
    channels.

(McShane, NCI)
15
  • Normalization
  • Intensity-based (Lowess) normalization
  • Straighten the Lowess fit line in MA plot to
    horizontal line and move it to zero

16
  • Normalization
  • Intensity-based (Lowess) normalization
  • Nonlinear
  • Gene-by-gene, could introduce bias
  • Use only when there is a compelling reason

(McShane, NCI)
17
  • Normalization
  • Location-based normalization
  • Background subtracted ratios on the array may
    vary in a predicable manner.
  • Sample uniformly across the chip
  • Nonlinear
  • Gene-by-gene, could introduce bias
  • Use only when there is a compelling reason

18
  • Normalization
  • Quantile normalization
  • Nonlinear
  • Same intensity distribution

After Lowess normalization
After quantile normalization
19
  • Normalization
  • Linear (global) the chips have equal median (or
    mean) intensity
  • Intensity-based (Lowess) the chips have equal
    medians (means) at all intensity values
  • Quantile the chips have identical intensity
    distribution
  • Quantile is the best in term of normalizing the
    data to desired distribution, however it also
    changes the gene expression level individually
  • Avoid overfitting
  • Avoid bias

20
What is ChIP-Seq?
  • ChIP-Seq is a new frontier technology to analyze
    protein interactions with DNA.
  • ChIP-Seq
  • Combination of chromatin immunoprecipitation
    (ChIP) with ultra high-throughput massively
    parallel sequencing
  • Allow mapping of proteinDNA interactions in-vivo
    on a genome scale

21
Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
22
Workflow of ChIP-Seq
23
Why ChIP-Seq?
  • Current microarray and ChIP-ChIP designs require
    knowing sequence of interest as a promoter,
    enhancer, or RNA-coding domain.
  • Lower cost
  • Less work in ChIP-Seq
  • Higher accuracy
  • Alterations in transcription-factor binding in
    response to environmental stimuli can be
    evaluated for the entire genome in a single
    experiment.

24
(No Transcript)
25
Sequencers
  • Solexa (Illumina)
  • 1 GB of sequences in a single run
  • 35 bases in length
  • 454 Life Sciences (Roche Diagnostics)
  • 25-50 MB of sequences in a single run
  • Up to 500 bases in length
  • SOLiD (Applied Biosystems)
  • 6 GB of sequences in a single run
  • 35 bases in length

26
Illumina Genome Analysis System
27
Sequencing
28
Sequencer Output
29
Sequence Files
  • 10 million sequences per lane
  • 500 MB files

30
Bioinformatics Challenges
  • Rapid mapping of these short sequence reads to
    the reference genome
  • Visualize mapping results
  • Thousand of enriched regions
  • Peak analysis
  • Peak detection
  • Finding exact binding sites
  • Compare results of different experiments
  • Normalization
  • Statistical tests

31
Mapping of Short Oligonucleotides to the
Reference Genome
  • Mapping Methods
  • Need to allow mismatches and gaps
  • SNP locations
  • Sequencing errors
  • Reading errors
  • Indexing and hashing
  • genome
  • oligonucleotide reads
  • Use of quality scores
  • Use of SNP knowledge
  • Performance
  • Partitioning the genome or sequence reads

32
Mapping Methods Indexing the Genome
  • Fast sequence similarity search algorithms (like
    BLAST)
  • Not specifically designed for mapping millions of
    query sequences
  • Take very long time
  • e.g. 2 days to map half million sequences to 70MB
    reference genome (using BLAST)
  • Indexing the genome is memory expensive

33
Mapping Methods Indexing the Oligonucleotide
Reads
  • ELAND (Cox, unpublished)
  • Efficient Large-Scale Alignment of Nucleotide
    Databases (Solexa Ltd.)
  • SeqMap (Jiang, 2008)
  • Mapping massive amount of oligonucleotides to
    the genome
  • RMAP (Smith, 2008)
  • Using quality scores and longer reads improves
    accuracy of Solexa read mapping
  • MAQ (Li, 2008)
  • Mapping short DNA sequencing reads and calling
    variants using mapping quality scores

34
Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
35
Huang, 2008 (unpublished)
36
Peak Analysis
  • Finding Exact Binding Site
  • Determining the exact binding sites from short
    reads generated from ChIP-Seq experiments
  • SISSRs (Site Identification from Short Sequence
    Reads) (Jothi 2008)
  • MACS (Model-based Analysis of ChIP-Seq) (Zhang et
    al, 2008)

37
(No Transcript)
38
(No Transcript)
39
Transcription in higher eukaryotes
  • Gene Expression
  • Chromatin structure
  • Initiation of transcription
  • Processing of the transcript
  • Transport to the cytoplasm
  • mRNA translation
  • mRNA stability
  • Protein activity stability

40
Some common approaches
  • Cluster-first motif discovery
  • Cluster genes by expression profile, annotation,
    to find potentially coregulated genes
  • Find overrepresented motifs in promoter
    sequences of similar genes (algorithms MEME,
    Consensus, Gibbs sampler, AlignACE, )

(Spellman et al. 1998)
41
Training data Features
regulator expression
promoter sequence
label
feature vector
42
What is PWM?
  • Transcription factor binding sites (TFBSs) are
    usually slightly variable in their sequences.
  • A positional weight matrix (PWM) specifies the
    probability that you will see a given base at
    each index position of the motif.

43
PWM for ERE
Position frequency matrix (PFM) (also known as
raw count matrix)
  • acggcagggTGACCc
  • aGGGCAtcgTGACCc
  • cGGTCGccaGGACCt
  • tGGTCAggcTGGTCt
  • aGGTGGcccTGACCc
  • cTGTCCctcTGACCc
  • aGGCTAcgaTGACGt
  • .
  • .
  • .
  • cagggagtgTGACCc
  • gagcatgggTGACCa
  • aGGTCAtaacgattt
  • gGAACAgttTGACCc
  • cGGTGAcctTGACCc
  • gGGGCAaagTGACTg

Given N sequence fragments of fixed length, one
can assemble a position frequency matrix (number
of times a particular nucleotide appears at a
given position). A normalized PFM, in which each
column adds up to a total of one, is a matrix of
probabilities for observing each nucleotide at
each position.
Position weight matrix (PWM) (also known as
position-specific scoring matrix)
PFM should be converted to log-scale for
efficient computational analysis. To eliminate
null values before log-conversion, and to correct
for small samples of binding sites, a sampling
correction, known as pseudocounts, is added to
each cell of the PFM.
44
Examples for Motifs and PWMs in Yeast
Universal stress repressor motif
STRE element
45
Signaling networks in a cell
46
Example oxygen sensing and regulatory network
47
Inferring regulatory networks from the expression
data and binding data
48
An ERa regulatory network in MCF7 cells
CCNL1
BRF1
49
http//motif.bmi.ohio-state.edu/ChIPMotifs/
50
http//motif.bmi-ohio-state.edu/HRTBLDb
Write a Comment
User Comments (0)
About PowerShow.com