Merck ACSM Intern Presentation - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Merck ACSM Intern Presentation

Description:

Merck ACSM. Intern Presentation. Xiangdong wen. Projects. Quality control on high ... Acknowledgement : Reid. Ansu. Scott. Irene. Nancy. ACSM interns. ACSM family ... – PowerPoint PPT presentation

Number of Views:243

Avg rating:3.0/5.0

Slides: 30

Provided by: Wenx

Category:

more less

Transcript and Presenter's Notes

Title: Merck ACSM Intern Presentation

1
Merck ACSMIntern Presentation

Xiangdong wen

2
Projects

Quality control on high throughput screening
Active learning approach to this type of problems

3
Quality control on high throughput screening (HTS)

HTS quantitatively and qualitatively examining
the interaction of synthetic molecules with
proteins, identify target compounds for a
potential drug.
HTS plays an important role in pharmaceutical
research and drug discovery.
Technologies HTRF, SPA, FLIPR, BLA

4
FlIPR system Fluorescent Imaging Plate reader

96,
384,
1536,
3456,
More wells

5
384 well FLIPR Assays

16 (rows) x 24 (columns)384 wells
First 2 and last 2 columns are controls.
16x(24-2x2)320 wells.
Two channels data, top and bottom, or max and min
for the intensity value.
we use the differences top-bottom.

6
Data

Min
VGCCA1B_FLIPR_ANT_Block1_2904.txt
Max
VGCCA1B_FLIPR_ANT_Block1_2904.txt
2904 plates,
experts fail 406 plates.
Pass 2498 plates

7
Quality control for flipr system

Global quality control
Median Polish method
Local quality control
a set of rules
such as patch finder

8
Median Polish

µij µ µi µj eij
µij the value at well (i, j)
µ analog of the global mean, fixed overall
effects.
µi analog of the ith row mean, row effects
µj analog of the jth column mean, column
effects.
eij random error term.

9
Median Polish cont,

Median polish needs 162036 features for 384
flipr assays.
Do QC by computing these 36 features and test
whether they fall in some interval, if not we
fail the plate.
For now, global quality control using median
polish features

10
My Approach

There are 320 well values for each assay.
Sort these 320 values.
Take the smallest N (2-35) of them.
Compute the mean of these smallest N numbers
Use the mean as a feature.
Use rpart in R, get threshold values.
Found optimum Ns. (N13,14,15)
Explored more details on N13,14,15. get the
corresponding optimum threshold.

11
Find the optimum N
12
Find the optimum threshold
13
New rule

If Mean_Of_15_Smallestgt5246.992
Pass the plate
Otherwise fail the plate.

14
Results
Define Cost PF 0.5 FP Global QC (median
polish) cost292 0.5 65324.5 New rule
cost123 0.5 218232 Cost decreased
by (324.5-232)/324.528.5
15
Other approaches

Compute the mean and variance for each mn
blocks, m(4-gt10), n(4-gt10). And use the results
as features.
Apply LDA in R.
Use random Forest package in R.
These methods did not get us better results.

16
Active Learning

Many manual tasks are tedious, repetitive, boring
or even dangerous.
We try to automate such kinds of tasks.
Learning ask questions and learn from answers.
Active learning if questions are cheap but
answers are expensive, wed like to ask the
most informative series of questions.

17
Classification

Input
samples (x1,y1),(x2,y2),(xn, yn)
Output
a classification rule c. Such that the error
rate err(c) is as small as possible.

18
Sequential sampling

When labels are expensive, we would like to
economize.
Select examples which will result in the best
classifier.
We can hope for extra efficiency if we permit
sampling to depend on previous samples and labels.

19
Query by CommitteeHow to do sequential sampling?

Let Sn be the labeled sample at stage n.
Let VnC(Sn) be the version space at stage n,
relative to the current sample.
(version space the set of all classifiers
which agree with the samples.)
Define the information in Vn as logp(Vn)
Given an unlabelled example x, measure its
potential information by the expected
instantaneous information gain
Select examples with high information gain.

20
Monte Carlo approach

Choose k2 concepts (committee) at random from
the version space. If the two concepts agree on
an example x, we estimate the entropy of c(x) to
be low, if they disagree, high .
Apply this to a stream of examples, selecting
those with high entropy.

21
Likelihood for the forest

P(yx)y(1-P(yx))(1-y)
(x,y) is a sample point.
For certain sample
High likelihood means good forest.
For fixed forest
Low likelihood means informative sample.

22
Our approach

Use weighted random forest as committee,
Use the likelihood for each tree as weights for
the corresponding tree.
use the likelihood for the forest as an
information measure,
Select those examples with low forest
likelihood (high information gain)
Update the forest.

23
Procedures

Generate random multidimensional Gaussian
distribution data.
Revise random Forest packages in R to weighted
random forest.
Compute the likelihood for each tree in the
forest.
Compute the likelihood for the forest.
Developing some forest updating functions

24
Generating data

Very flexible, could simulating almost all kinds
of data.
The data in reality is limited.
The generated data already have labels.
Easy for use in doing research work.
Easy to control Bayesians error, decision
boundary.

25
Random forest

Each tree is a classifier.
Random forest is a bunch of such kind trees.
Using randomly selected features to split each
node in the tree.
The overall error for forest converges as the
number of trees becomes large.

26
Why weighted random forest?

Some trees in the forest are more important than
other trees.
When the number of trees in the random forest
becomes larger, Its efficiency goes slow.
When using weighted random forest, need only
relatively a small number of trees.
Theoretically want the forest to approximate the
posterior distribution, need weights.
When adding new training data, we could just
update the weights of the forest.

27
Simple example

Suppose we have a two classes problem
Class 0 N((-1,2), I) p1/2
N((-1,-2),I) p1/2
Class 1 N((1,4),I) p1/3
N((1,0),I) p1/3
N((1,-4),I) p1/3
Where I is the 2 dimensional identity matrix,
here we use it as the covariance matrix of random
data.

28
Simple example cont,