Differentially Expressed Genes, Class Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Differentially Expressed Genes, Class Discovery

Description:

Differentially Expressed Genes, Class Discovery & Classification. Finding ... Serous carcinoma. Lung. adenocarcinoma. Tissue Composition ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 35
Provided by: csHu
Category:

less

Transcript and Presenter's Notes

Title: Differentially Expressed Genes, Class Discovery


1
Differentially Expressed Genes, Class Discovery
Classification
2
Finding Differentially Expressed Genes
  • Two types of motivation
  • Direct
  • Relate the genes to known biology functions,
    pathways etc. Infer about their rule, the
    mechanisms governing the process etc.
  • Indirect Use as a pruning stage for tools
    that perform learning tasks
  • Infer regulatory mechanisms and relations
  • Classification ( disease Vs. normal, disease
    subtypes)

3
Example Tumor vs. Normal tissues
Normalsamples
Tumorsamples
  • Identify differentially expressed genes
  • Diagnostic Markers
  • Therapeutic targets
  • Understanding the disease process

Under expressed
Non-small cell lung carcinomas Sheba medical
center U. of Colorado Medical Center
4
What We Need
  • Score the genes, hopefully in a meaningful way..
  • Attach a measure of statistical significance to
    the score so we can
  • Choose a subset of genes wisely
  • Have a measure of how strong our signal is

5
Simplest Score Fold Change
6
Fold Change problems
  • Not reliable at the low end of the scale
  • (0/0 effects large variance)
  • Sensitive to outliers
  • Variant pairwise fold change
  • compute fold change over all possible sample
    pairs
  • If in e.g. 75 of the pairs, change gt D gt
    significant

7
Relevance Scores - TNoM
  • Beyond fold change
  • Both genes have gt15 fold change
  • TNoM (Total Number of Misclassifications) score
  • Find the threshold that best separates tumors
    from normals,
  • count the number of errors committed there.

tumor
normal
8
Scoring Informative Genes
Expression pattern of a gene a Pathological
diagnosis information (annotation) L v(a,L), a
vector of s and s, ordered by the a values
- - - - - -
- a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11
a12 a13 a14 a15
9
TNoM Score
Find the threshold that best separates tumors
from normals, count the number of errors
committed there.
Ex 1
- - - - - - -
10
TNoM vs. Fold Change
11
TNoM
  • Cons
  • Ones-sided vs. two sided errors
  • Absolute values ignored
  • For any given level s, we can efficiently
    compute p-Val(s) Prob( TNoM(V) ? s ),where V
    is uniformly drawn over the appropriate space.
  • (H0 the gene expression values are independent
    of the labels)
  • Computed using DP

12
Wilcoxon Rank Test
  • Another gene score, which similarly to TNoM
  • Ignores absolute values
  • Takes into account only order of measurements
  • Sort the expression values of both groups
  • - - - - - -
    -
  • a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12
    a13 a14 a15
  • W(g) sum of ranks of the positive examples
  • W(g) 1 2 5 6 7 10 13 14 58

13
Wilcoxon Rank Test
  • A common test in statistics
  • Again, we can compute p-Values given the null
    hypothesis H0
  • P(W(g) gt sn,k) the probability of getting a
    score gt s given a total of n samples, out of
    which k are labeled as ().

14
SAM (Tusher et al., PNAS 01)
  • Where a (1/n1 1/n2)/(n1 n2-2)
  • d(i) is exactly the paired t-statistic
  • Tests the assumption are the means of the two
    processes the same?
  • Underlying assumption two normal distributions
  • A known p-value the t-distribution

15
SAM Alternative to P-Value
  • P-value relies on t-test assumptions -
    problematic
  • Can we assess the significance of d(i) without
    parametric assumptions?
  • Define a balanced permutation division of
    samples to 2 groups, where in each group the
    number of and - is balanced
  • Perform all possible balanced permutations p to
    the data and compute

16
False Discovery Rate for SAM
  • Genes with D above a given threshold
    significant
  • FDR False discovery rate the of genes
    passing as significant which are expected to be
    false positives
  • Each threshold on D(i) can be given an FDR value
  • compute the avg. number of FP crossing this
    threshold in the permuted sets

17
Different Scores
  • TNoM
  • Info
  • Wilcoxon
  • t Test
  • Fold Change

Different scores and null hypothesis (parametric,
non parametric etc.) All can be found in the
ScoreGene package http//www.cs.huji.ac.il/labs/
compbio/scoregenes/
Can we assess which scoring method is the best
for our case?
18
Overabundance Analysis
  • Data on 30 samples from normal and tumor lung
    tissues.
  • 7000 genes.
  • Naftali Kaminskis lab, Sheba Medical Center

19
Why Test Overabundance?
  • Tests how informative is a set of genes w.r.t. a
    given classification of the data and a scoring
    method.
  • Can be used to compare different
  • gene scoring methods
  • normalization methods

20
Comparing Normalization Methods
21
Why Test Overabundance?
  • But also a method to discover new classes in the
    data
  • Intuition biologically meaningful partitions
    will have a high overabundance of informative
    genes

22
Overabundance Analysis in Class Discovery
AML/ALL
  • Score Genes
  • Count
  • Compare torandom

BRCA1/2
Melanoma
23
Class Discovery Approach
Seek partitions with statistically significant
overabundance of informative genes
  • Use local search techniques, e.g
  • Steepest ascent
  • Simulated annealing

24
Scoring a Partition
  • At a given score level s, set p p-Val(s) .
  • Suppose that in the data we observe n(s) genes
    with score ? s .
  • The number of genes with score ? s we observe for
    uniformly and independently drawn labeling
    vectors is a random variable N(s) with N(s)
    Binom(n,p)where n is the total number of genes.
  • The surprise rate at s is defined as ?(s)
    Prob( N(s) ? n(s) ) ?kn(s)n n
    choose n(s)pk(1-p)n-p.
  • Finally, the max surprise score for the suggested
    partition is Maxs ?(s)

25
Overabundance Max-Surprise
26
Example Survival Prediction
Good Prognosis Patients
All Patients
27
Class 2
Good Prognosis Patients
All Patients
28
Class 3
Good Prognosis Patients
All Patients
29
Tissue Classification
  • Given a set of labeled samples, we can try to
    classify a new sample
  • Supervised methods SVM, Adaboost, Naïve Bayes
  • Semi-supervised methods Clustering
  • Issues
  • Evaluating the methods
  • Feature Selection
  • Sample contamination/composition

30
Evaluating Classification
  • LOOCV Leave one out cross validation
  • For all samples i 1M
  • Take sample i out
  • Learn from M-1 remaining samples
  • Test on sample i

31
Feature Selection
  • How many of the informative genes do we choose
    for our classifier?
  • A question of choosing a cutoff

32
Tissue Composition
Small celllung carcinoma
Lung adenocarcinoma
Serous carcinoma
Lung metastasa
33
Tissue Composition
  • The tissue is composed of many cell types (tumor,
    blood, muscle, )
  • The arrayed samples are not always pure!
  • Major difference differentialy expressed genes
    which are
  • Causes of the disease state
  • Outcome of the disease state

34
Summary
  • Many methods for choosing differentially
    expressed genes
  • These can be compared, e.g. using overabundance
    tests
  • Overabundance can also be used for new class
    discovery
  • Expression patterns can be used to classify a
    tissue
Write a Comment
User Comments (0)
About PowerShow.com