Merck ACSM Intern Presentation - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Merck ACSM Intern Presentation

Description:

Merck ACSM. Intern Presentation. Xiangdong wen. Projects. Quality control on high ... Acknowledgement : Reid. Ansu. Scott. Irene. Nancy. ACSM interns. ACSM family ... – PowerPoint PPT presentation

Number of Views:243
Avg rating:3.0/5.0
Slides: 30
Provided by: Wenx
Category:

less

Transcript and Presenter's Notes

Title: Merck ACSM Intern Presentation


1
Merck ACSMIntern Presentation
  • Xiangdong wen

2
Projects
  • Quality control on high throughput screening
  • Active learning approach to this type of problems

3
Quality control on high throughput screening (HTS)
  • HTS quantitatively and qualitatively examining
    the interaction of synthetic molecules with
    proteins, identify target compounds for a
    potential drug.
  • HTS plays an important role in pharmaceutical
    research and drug discovery.
  • Technologies HTRF, SPA, FLIPR, BLA

4
FlIPR system Fluorescent Imaging Plate reader
  • 96,
  • 384,
  • 1536,
  • 3456,
  • More wells

5
384 well FLIPR Assays
  • 16 (rows) x 24 (columns)384 wells
  • First 2 and last 2 columns are controls.
  • 16x(24-2x2)320 wells.
  • Two channels data, top and bottom, or max and min
    for the intensity value.
  • we use the differences top-bottom.

6
Data
  • Min
  • VGCCA1B_FLIPR_ANT_Block1_2904.txt
  • Max
  • VGCCA1B_FLIPR_ANT_Block1_2904.txt
  • 2904 plates,
  • experts fail 406 plates.
  • Pass 2498 plates

7
Quality control for flipr system
  • Global quality control
  • Median Polish method
  • Local quality control
  • a set of rules
  • such as patch finder

8
Median Polish
  • µij µ µi µj eij
  • µij the value at well (i, j)
  • µ analog of the global mean, fixed overall
    effects.
  • µi analog of the ith row mean, row effects
  • µj analog of the jth column mean, column
    effects.
  • eij random error term.

9
Median Polish cont,
  • Median polish needs 162036 features for 384
    flipr assays.
  • Do QC by computing these 36 features and test
    whether they fall in some interval, if not we
    fail the plate.
  • For now, global quality control using median
    polish features

10
My Approach
  • There are 320 well values for each assay.
  • Sort these 320 values.
  • Take the smallest N (2-35) of them.
  • Compute the mean of these smallest N numbers
  • Use the mean as a feature.
  • Use rpart in R, get threshold values.
  • Found optimum Ns. (N13,14,15)
  • Explored more details on N13,14,15. get the
    corresponding optimum threshold.

11
Find the optimum N
12
Find the optimum threshold
13
New rule
  • If Mean_Of_15_Smallestgt5246.992
  • Pass the plate
  • Otherwise fail the plate.

14
Results
Define Cost PF 0.5 FP Global QC (median
polish) cost292 0.5 65324.5 New rule
cost123 0.5 218232 Cost decreased
by (324.5-232)/324.528.5
15
Other approaches
  • Compute the mean and variance for each mn
    blocks, m(4-gt10), n(4-gt10). And use the results
    as features.
  • Apply LDA in R.
  • Use random Forest package in R.
  • These methods did not get us better results.

16
Active Learning
  • Many manual tasks are tedious, repetitive, boring
    or even dangerous.
  • We try to automate such kinds of tasks.
  • Learning ask questions and learn from answers.
  • Active learning if questions are cheap but
    answers are expensive, wed like to ask the
    most informative series of questions.

17
Classification
  • Input
  • samples (x1,y1),(x2,y2),(xn, yn)
  • Output
  • a classification rule c. Such that the error
    rate err(c) is as small as possible.

18
Sequential sampling
  • When labels are expensive, we would like to
    economize.
  • Select examples which will result in the best
    classifier.
  • We can hope for extra efficiency if we permit
    sampling to depend on previous samples and labels.

19
Query by CommitteeHow to do sequential sampling?
  • Let Sn be the labeled sample at stage n.
  • Let VnC(Sn) be the version space at stage n,
    relative to the current sample.
  • (version space the set of all classifiers
    which agree with the samples.)
  • Define the information in Vn as logp(Vn)
  • Given an unlabelled example x, measure its
    potential information by the expected
    instantaneous information gain
  • Select examples with high information gain.

20
Monte Carlo approach
  • Choose k2 concepts (committee) at random from
    the version space. If the two concepts agree on
    an example x, we estimate the entropy of c(x) to
    be low, if they disagree, high .
  • Apply this to a stream of examples, selecting
    those with high entropy.

21
Likelihood for the forest
  • P(yx)y(1-P(yx))(1-y)
  • (x,y) is a sample point.
  • For certain sample
  • High likelihood means good forest.
  • For fixed forest
  • Low likelihood means informative sample.

22
Our approach
  • Use weighted random forest as committee,
  • Use the likelihood for each tree as weights for
    the corresponding tree.
  • use the likelihood for the forest as an
    information measure,
  • Select those examples with low forest
    likelihood (high information gain)
  • Update the forest.

23
Procedures
  • Generate random multidimensional Gaussian
    distribution data.
  • Revise random Forest packages in R to weighted
    random forest.
  • Compute the likelihood for each tree in the
    forest.
  • Compute the likelihood for the forest.
  • Developing some forest updating functions

24
Generating data
  • Very flexible, could simulating almost all kinds
    of data.
  • The data in reality is limited.
  • The generated data already have labels.
  • Easy for use in doing research work.
  • Easy to control Bayesians error, decision
    boundary.

25
Random forest
  • Each tree is a classifier.
  • Random forest is a bunch of such kind trees.
  • Using randomly selected features to split each
    node in the tree.
  • The overall error for forest converges as the
    number of trees becomes large.

26
Why weighted random forest?
  • Some trees in the forest are more important than
    other trees.
  • When the number of trees in the random forest
    becomes larger, Its efficiency goes slow.
  • When using weighted random forest, need only
    relatively a small number of trees.
  • Theoretically want the forest to approximate the
    posterior distribution, need weights.
  • When adding new training data, we could just
    update the weights of the forest.

27
Simple example
  • Suppose we have a two classes problem
  • Class 0 N((-1,2), I) p1/2
  • N((-1,-2),I) p1/2
  • Class 1 N((1,4),I) p1/3
  • N((1,0),I) p1/3
  • N((1,-4),I) p1/3
  • Where I is the 2 dimensional identity matrix,
    here we use it as the covariance matrix of random
    data.

28
Simple example cont,
  • Generating data and plot.
  • How to find a good classifier?

6000 data points, 500 trees
3000 data points, 500 trees
29
Acknowledgement
  • Reid
  • Ansu
  • Scott
  • Irene
  • Nancy
  • ACSM interns
  • ACSM family
Write a Comment
User Comments (0)
About PowerShow.com