Title: Merck ACSM Intern Presentation
1Merck ACSMIntern Presentation
2Projects
- Quality control on high throughput screening
- Active learning approach to this type of problems
3Quality control on high throughput screening (HTS)
- HTS quantitatively and qualitatively examining
the interaction of synthetic molecules with
proteins, identify target compounds for a
potential drug. - HTS plays an important role in pharmaceutical
research and drug discovery. - Technologies HTRF, SPA, FLIPR, BLA
4FlIPR system Fluorescent Imaging Plate reader
- 96,
- 384,
- 1536,
- 3456,
-
- More wells
5384 well FLIPR Assays
- 16 (rows) x 24 (columns)384 wells
- First 2 and last 2 columns are controls.
- 16x(24-2x2)320 wells.
- Two channels data, top and bottom, or max and min
for the intensity value. - we use the differences top-bottom.
6Data
- Min
- VGCCA1B_FLIPR_ANT_Block1_2904.txt
- Max
- VGCCA1B_FLIPR_ANT_Block1_2904.txt
- 2904 plates,
- experts fail 406 plates.
- Pass 2498 plates
7Quality control for flipr system
- Global quality control
- Median Polish method
- Local quality control
- a set of rules
- such as patch finder
-
8Median Polish
- µij µ µi µj eij
- µij the value at well (i, j)
- µ analog of the global mean, fixed overall
effects. - µi analog of the ith row mean, row effects
- µj analog of the jth column mean, column
effects. - eij random error term.
9Median Polish cont,
- Median polish needs 162036 features for 384
flipr assays. - Do QC by computing these 36 features and test
whether they fall in some interval, if not we
fail the plate. - For now, global quality control using median
polish features
10My Approach
- There are 320 well values for each assay.
- Sort these 320 values.
- Take the smallest N (2-35) of them.
- Compute the mean of these smallest N numbers
- Use the mean as a feature.
- Use rpart in R, get threshold values.
- Found optimum Ns. (N13,14,15)
- Explored more details on N13,14,15. get the
corresponding optimum threshold.
11Find the optimum N
12Find the optimum threshold
13New rule
- If Mean_Of_15_Smallestgt5246.992
- Pass the plate
- Otherwise fail the plate.
-
14Results
Define Cost PF 0.5 FP Global QC (median
polish) cost292 0.5 65324.5 New rule
cost123 0.5 218232 Cost decreased
by (324.5-232)/324.528.5
15Other approaches
- Compute the mean and variance for each mn
blocks, m(4-gt10), n(4-gt10). And use the results
as features. - Apply LDA in R.
- Use random Forest package in R.
- These methods did not get us better results.
16Active Learning
- Many manual tasks are tedious, repetitive, boring
or even dangerous. - We try to automate such kinds of tasks.
- Learning ask questions and learn from answers.
- Active learning if questions are cheap but
answers are expensive, wed like to ask the
most informative series of questions.
17Classification
- Input
- samples (x1,y1),(x2,y2),(xn, yn)
- Output
- a classification rule c. Such that the error
rate err(c) is as small as possible.
18Sequential sampling
- When labels are expensive, we would like to
economize. - Select examples which will result in the best
classifier. - We can hope for extra efficiency if we permit
sampling to depend on previous samples and labels.
19Query by CommitteeHow to do sequential sampling?
- Let Sn be the labeled sample at stage n.
- Let VnC(Sn) be the version space at stage n,
relative to the current sample. - (version space the set of all classifiers
which agree with the samples.) - Define the information in Vn as logp(Vn)
- Given an unlabelled example x, measure its
potential information by the expected
instantaneous information gain - Select examples with high information gain.
20Monte Carlo approach
- Choose k2 concepts (committee) at random from
the version space. If the two concepts agree on
an example x, we estimate the entropy of c(x) to
be low, if they disagree, high . - Apply this to a stream of examples, selecting
those with high entropy.
21Likelihood for the forest
- P(yx)y(1-P(yx))(1-y)
- (x,y) is a sample point.
- For certain sample
- High likelihood means good forest.
- For fixed forest
- Low likelihood means informative sample.
22Our approach
- Use weighted random forest as committee,
- Use the likelihood for each tree as weights for
the corresponding tree. - use the likelihood for the forest as an
information measure, - Select those examples with low forest
likelihood (high information gain) - Update the forest.
23Procedures
- Generate random multidimensional Gaussian
distribution data. - Revise random Forest packages in R to weighted
random forest. - Compute the likelihood for each tree in the
forest. - Compute the likelihood for the forest.
- Developing some forest updating functions
24Generating data
- Very flexible, could simulating almost all kinds
of data. - The data in reality is limited.
- The generated data already have labels.
- Easy for use in doing research work.
- Easy to control Bayesians error, decision
boundary.
25Random forest
- Each tree is a classifier.
- Random forest is a bunch of such kind trees.
- Using randomly selected features to split each
node in the tree. - The overall error for forest converges as the
number of trees becomes large.
26Why weighted random forest?
- Some trees in the forest are more important than
other trees. - When the number of trees in the random forest
becomes larger, Its efficiency goes slow. - When using weighted random forest, need only
relatively a small number of trees. - Theoretically want the forest to approximate the
posterior distribution, need weights. - When adding new training data, we could just
update the weights of the forest.
27Simple example
- Suppose we have a two classes problem
- Class 0 N((-1,2), I) p1/2
- N((-1,-2),I) p1/2
- Class 1 N((1,4),I) p1/3
- N((1,0),I) p1/3
- N((1,-4),I) p1/3
- Where I is the 2 dimensional identity matrix,
here we use it as the covariance matrix of random
data.
28Simple example cont,
- Generating data and plot.
- How to find a good classifier?
6000 data points, 500 trees
3000 data points, 500 trees
29Acknowledgement
- Reid
- Ansu
- Scott
- Irene
- Nancy
- ACSM interns
- ACSM family