Identifying genes with high confidence from small samples - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Identifying genes with high confidence from small samples

Description:

Allan Tucker, Xiaohui Liu, Eleftherios Panteris, Paul Kellam. Microarray Experiment ... vector of gene-expression levels for one sample. c=0 or 1 ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 16
Provided by: allant5
Category:

less

Transcript and Presenter's Notes

Title: Identifying genes with high confidence from small samples


1
Identifying genes with high confidence from small
samples
  • Veronica Vinciotti
  • Brunel University, London, UK
  • joint work with
  • Allan Tucker, Xiaohui Liu,
  • Eleftherios Panteris, Paul Kellam

2
Microarray Experiment
DATA
3
Microarray Data
  • High-dimensional
  • number of genes on a chip
  • Small number of samples
  • number of experiments
  • B-cell dataset 1987 genes, 26 arrays
  • Prostate dataset 1410 genes, 112 arrays

4
Classification Problem
  • x(x1xN)
  • vector of gene-expression levels for one sample
  • c0 or 1
  • class, e.g. presence of virus or cancer
  • Estimate p(1x)
  • Assign x to class 1 if
  • and 0 otherwise

5
Classifier
  • Too few observations-gt choose simple model to
    reduce risk of overfitting
  • Naïve Bayes Classifier

Bayes Rule
6
Feature Selection
  • Too many variables -gtFeature selection
  • Identify predictive genes to improve
    classification
  • Give useful insight to biologists
  • Choose subset of k genes that max a score
  • Wrapper approach classification accuracy
  • Filter approach score independent on model
  • Our score
  • Likelihood of the modelp(DataModel)

7
Optimization Method
  • We use Simulated Annealing (SA) to find the set
    of k genes that maximise p(DM)
  • Global optimization method
  • Operators randomly add, delete and swap links
  • Cooling temperature parameter to overcome local
    maxima

8
Confidence on Predictive Genes
  • Different samples and different runs of SA might
    lead to different sets of genes
  • Need to identify genes robustly
  • Data perturbed using cross-validation (CV)
  • Stochastic SA repeated a number of times
  • Assign confidence to genes based upon the
    frequencies of genes being selected

9
Everything together RSN(k) method
  • Input number of features k
  • Randomly split the data in m samples
  • Take one sample out
  • Repeat SA feature selection 10 times on the
    remaining data
  • At each repeat update frequency on each feature
    and test classifier accuracy on sample originally
    taken out
  • Repeat across the m samples
  • Output m- fold CV error and confidence measures
    on each feature

10
Effect of Model Complexity
11
Classification Accuracy
  • Leave-one-out on B-cell dataset
  • 10-fold CV on prostate dataset
  • k5

B-cell
Prostate
Prostate 25
RSN(5)
68
96
91
SW-RSN(5)
73
95
83
SW-LinReg(5)
62
93
74
Naïve
58
56
49
12
Confidence Measure Simulation
  • 1000 genes, 30 differentially expressed between
    two classes, 100 samples

13
Confidence Measure Real Data
PROSTATE
B-CELL
14
Identified Genes
B-CELL
PROSTATE
Genebank
Proportion
GeneBank
Proportion
AK023995
0.862
AA055368
0.5
U15173
0.796
N64741
0.34
L21936
0.488
AA487560
0.33
D83785
0.454
W47179
0.27
BC014433
0.442
AA486727
0.26
U59309
0.277
AA455925
0.25
(47202)
0.25
H29252
0.25
Z14982
0.169
AA010110
0.24
BC016182
0.162
AA180237
0.23
U82130
0.146
AA443302
0.2
Z80783
0.131
BC009914
0.127
U77949
0.112
15
Conclusions
  • Microarray data are characterized by few
    observations and many variables
  • Simple models with few parameters perform best
  • Need to select predictive genes robustly
  • Proposed RSN successfully identifies genes of
    interest paving way for further biological
    analysis
  • Current work explore a different score function
    for feature selection that incorporates the
    parameter k
Write a Comment
User Comments (0)
About PowerShow.com