Title: A Confidence Interval for the Misclassification Rate
1 A Confidence Interval for theMisclassification
Rate
2Outline
- Review
- Three challenges in constructing CIs
- Combining a statistical approach with a learning
theory approach to constructing CIs - Relevance to confidence measures for the value of
a policy.
3Review
- X is the vector of features, Y is the binary
classification - Misclassification Rate
- Data N iid observations of (Y,X)
- Given a space of classifiers, , and the data,
use some method to construct a classifier, - The goal is to provide a CI for
4Review
- Since the loss function
is not smooth, one commonly uses a surrogate loss
to estimate the classifier - Surrogate Loss L(Y,f(X))
-
5Review
- General approach to providing a CI
- We estimate using the data,
resulting in -
- Derive approximate distribution for
- Use this approximate distribution to construct a
confidence interval for
6Three challenges
- is too large leading to over-fitting and
-
(negative bias) - is a
non-smooth function. - may behave like an extreme quantity
- No assumption that is close to optimal.
-
7Three challenges
- is
non-smooth. - Example The unknown Bayes classifier has
quadratic decision boundary. We fit, by least
squares, a linear decision boundary - f(x) sign(ß0 ß1 x)
8Density of
9Bootstrap CI for
10Misclassification Rate is Non-smooth
Sample Size Bootstrap Percentile Yang CV CUD-Bound
30 .85 .29 .91
50 .88 .24 .92
100 .83 .20 .94
200 .85 .22 .95
11CIs for Extreme Quantities
- may behave like an extreme quantity
- Should this be problematic?
- Highly skewed distribution of
- Fast convergence of
to zero
12CIs from Learning Theory
- Given a result of the form
-
- where is known to belong to and
-
-
forms a 1-d CI as
13Combine statistical ideas with learning theory
ideas
- Construct a confidence interval for
- where is chosen to be small yet contain
- ---from this CI deduce a conservative CI for
- ---use the surrogate loss to smooth the
maximization and to construct
14- Construct a confidence interval for
- --- should contain all that are close to
- --- all f for which
- --- is the limiting value of
15Confidence Interval
- Construct a confidence interval for
- --- is a rate I would like
- ---
16Confidence Interval
17Bootstrap
- We use bootstrap to obtain an estimate of an
upper percentile of the distribution of - to obtain b. The CI is then
18Implementation
- Approximation space for the classifier is linear
- Surrogate loss is least squares
-
- is the .632 estimator
19Implementation
20(No Transcript)
21Computational Issues
- Partition Rp into equivalence classes defined by
the pattern of signs - Each equivalence class, can be written as
a set of ß satisfying linear constraints. - The term in absolute values is constant on
22Computational Issues
- can be written as
- since g is non-decreasing.
23Computational Issues
- Reduced the problem to the computation of a
number of convex optimization problems. The
number of convex optimizations is reduced via use
of g with a branch and bound algorithm. - With a sample size of N 150 and 11 features
calculation of the percentiles of the CUD bound
using 500 bootstrap samples can be accomplished
in a few minutes on a standard desktop (2.4 GHz
processor 2 GB RAM).
24Comparisons, 95 CI
Data CUD BS M Y
Spam 1.0 .99 .63 1.0
Ion .96 .96 .80 .99
Heart 1.0 .99 .95 1.0
Diabetes 1.0 .91 .98 .99
Donut .98 .90 .62 .88
Outlier .99 .80 .93 .93
Sample size 50
25Comparisons, length of CI
Data CUD BS M Y
Spam .56 .38 .25 .33
Ion .34 .36 .24 .32
Heart .47 .47 .40 .44
Diabetes .39 .31 .31 .36
Donut .49 .53 .26 .33
Outlier .50 .39 .29 .33
Sample size50
26Intuition
- If then we are
approximating the distribution of - where
27Intuition
- If and
-
- then the distribution is approximately that of
the -
- absolute value of a
-
-
-
- (limiting distribution for binomial, as
expected). -
-
28Intuition
- If and
- the distribution is approximately the
distribution of -
-
29Intuition
- Consider
-
- if in place of we put where is
close to - then due to the non-smoothness
in - at
we will get jittering.
30Discussion
- Further reduce the conservatism of the CUD-bound.
- Eliminate symmetry of CUD-bound CI
- ?
- Trade off computational burden versus bias by use
of a surrogate for the indicator in the
misclassification rate - The real goal is to produce CIs for the Value of
a policy.
31The simplest Dynamic treatment regime (e.g.
policy) is a decision rule if there is only one
stage of treatment 1 Stage for each individual
Observation available at jth stage
Action at jth stage (usually a treatment)
Primary Outcome
32Goal Construct decision rules that input
patient information and output a recommended
action these decision rules should lead to a
maximal mean Y. In future one selects action
33Single Stage (k1)
- Find a confidence interval for the mean outcome
if a particular estimated policy (here one
decision rule) is employed. - Action A is randomized in -1,1.
- Suppose the decision rule is of form
- We do not assume the optimal decision boundary is
linear.
34Single Stage (k1)
- Mean outcome following this policy is
-
- is the randomization
probability
35(No Transcript)
36Oslin ExTENd
Naltrexone
8 wks Response
Randomassignment
TDM Naltrexone
Early Trigger for Nonresponse
CBI
Randomassignment
Nonresponse
CBI Naltrexone
Randomassignment
Naltrexone
8 wks Response
Randomassignment
TDM Naltrexone
Late Trigger for Nonresponse
Randomassignment
CBI
Nonresponse
CBI Naltrexone
37 - This seminar can be found at
- http//www.stat.lsa.umich.edu/samurphy/
- seminars/Stanford04.01.08.ppt
- Email Eric or me with questions or if you would
like a copy of the associated paper - laber_at_umich.edu or samurphy_at_umich.edu