Title: Notes and statistics on base level expression
1Notes and statistics on base level expression
May 2009
Don Gilbert
Biology Dept., Indiana University gilbertd_at_indiana
.edu
22007 Tile expression
DrosMel tiled by Affymetrix, finds new genes
(blue) and known (orange)
.
3Precision improves 06-09
Measuring expression over gene structures,
Nimblegen (08) has higher precision than Affy
(06/07) RNA-Seq (09) has higher precision than
Nimblegen
.
4microarray statistics for base level expression?
5Gene or Base expression?
- Base-level expression (tiles, rna-seq) calculate
like gene differential expression (DE) - Per tile, per RNA-seq contig or per base
treatment - control - Combine for tiles over gene
- Independent (technically) observations, but
biologically related - Increase DF, Power with longer gene
- How to combine?
- As independent replicates gene gt (tiles,
technical, bio replicates)? - As nested block gene gt tiles gt replicates ?
- As gene average gene mean(tiles) gt replicates
? - Compare with gene-level stats
6Gene or Base expression?
Base level tests find expression better than gene
average
Base level sensitivity 42, Gene level
sensitivity 38 Both have specificity 37
Sensitivity 1 - false
rejection Specificity 1 - false discovery
7Gene or Base expression?
DE is consistent over gene span though expression
Ave changes gene-level measure can miss this.
Expression over gene span, treatment(red) vs
control(green) with 3 replicates
8gene structures expression
9Sequence normalizing?
Idea is to remove sequence (GC) effects on probe
hyb. score
TileScope Royce TE, Rozowsky JS, and Gerstein,
MB. (2007). Assessing the need for
sequence-based normalization in tiling microarray
experiments. Bioinformatics, 23, 988-997.
10Sequence normalizing?
Sequence-normalizing also removes Exon/Intron
signal !
Dont use it (TileScopes quantilenorm) .. or
other sequence adjustments of expression, unless
gene structure signals are included.
11Intron-Exon Detection
Nimblegen and Solexa tile/base expression detects
gene structure, on average, fairly well.
12Intron-Exon Update
Newest RNA-Seq finds intron/exon very
well (Stranded RNA-Seq, modEncode Gingeras lab,
March 2009 )
13Differential expression
Gene end (3) has more expression, but
Example genes
exons
introns
constant differential over gene span, on average.
Green is treatment, red control. Line style
shows 3 replicates of Daphnia tiled expression.
14Diff. Expr. distributions
Genes
Introns
TARs
Introns show a null DE distribution, genes and
TAR regions are wider. Use introns as baseline
for statistics?
Pred
Metal
Sex
15 multiple testing corrections
16Multiple statistic tests
- Problem perform 20,000 tests and p-values hit
laws of chance. Pr 0.05 can happen 1,000 times
by chance (false discovery, FDR). - DrosMel Affy line t-tests 2,284,383 / 5,395,023
0.42 Sig - Bonferroni conservative 0.03 Sig
- Benjamini Hochberg p.adjust(p,BH) 0.35 Sig
- qvalue(p) distribution based 0.41 Sig
- Storey, JD and R Tibshirani, 2003. Statistical
significance for genomewide studies. PNAS
1009440-9445 - SAM permutation qvalue
- However, p.adjust meant for 100s of tests, not
Millions - Drosmel modEncode case 1900 pairwise Affy cell
line (62 cells) DE comparisons x 14,000 genes
26,600,000 t-tests
17Multiple DE tests Daphnia
- Much different corrections for experiments on
same genes - Daphnia DE 3 expt.s (trt - con), 25000 genes, 3
replicates - Predate, Metal genes have low expression,
important to detect
18Multiple statistic tests
- Statisticians have turned p-value corrections
into an industry, but they are really more of a
band-aid than a solution - What about false rejection (FRR type II error)?
- Balance errors, false rejection maybe more
important - Solution 1 test fewer, directed hypotheses
- Solution 2 measure error rate on knowns, eg.
prediction of known genes - Solution 3 known null hypothesis, eg. introns
http//www.bioconductor.org/workshops/2009/Seattl
eApr09/DiffExpr/
191900 pairwise Affy cell line DE comparisons x
14,000 genes 26,600,000 t-tests
20Hypotheses of interest are fewer 100s cells x
14,000 genes 2 Million tests
21Summary
- Base-level expression (tiles, rna-seq) measures
gene expression better - Balances sensitivity (false rejection) with
specificity (false discovery) - Base-level expression measures gene structures
well - On average, and precision is improving for
individual genes. - Multiple test corrections are needed but
problematic - False discovery corrections for millions of tests
leads to false rejections. - Determine empirical error rates where possible
22End note
- Summary pages
- wfleabase.org/genome-summaries/tile-expression/
- insects.eugenes.org/species/data/dmel5/modencode/
- Genome expression maps
- insects.eugenes.org8091/gbrowse/cgi-bin/gbrowse/d
rosmelme/ - expression in 52 cell lines (affy) and more
precise solexa nimblegen for a few cell lines - insects.eugenes.org8091/gbrowse/cgi-bin/gbrowse/d
aphnia_pulex8/ - expression among 4 treatment groups (sex, metal
stress, biotic predator) nimblegen
23Differential expression
Gene models miss much expression
Known sex genes capture DE, but unknown regions
capture environmental stress expression, in
Daphnia.