Local one class optimization - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Local one class optimization

Description:

A typical application for searching small but interesting sets of genes. ... Complete characterization of all interesting subsets in the data. ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 24
Provided by: GAL1151
Category:

less

Transcript and Presenter's Notes

Title: Local one class optimization


1
Local one class optimization
  • Gal Chechik, Stanford
  • joint work with Koby Crammer, Hebrew university
    of Jerusalem

2
The one-class problem
  • Find a subset of similar/typical samples
  • Formally find a ball of a given radius (with
    some metric) that covers as many data points as
    possible (related to the set covering problem).

3
Motivation I
  • Unsupervised setting Sometimes we wish to model
    small parts of the data and ignore the rest. This
    happens when many data points are irrelevant.
  • Example
  • Finding sets of co-expressed genes in genome
    wide-experiment identify the relevant genes out
    of thousands irrelevant ones.
  • Finding a set of document of the same topic, in
    an heterogeneous corpus

4
Motivation II
  • Supervised setting Learning given positive
    samples only
  • Examples
  • Protein interactions
  • Intrusion detection application
  • Care about low false positive rate

5
Current approaches
  • Often treat the problem as Outliers and novelty
    detection most samples are relevant
  • Current approaches use
  • A convex cost function (Schölkopf 95, Tax and
    Duin 99, Ben-Hur et al 2001).
  • A parameter that affects the size or weight of
    the ball
  • Bias towards center of massWhen searching for a
    small ball, the center of the optimal ball is in
    the global center of mass, wargmin Sx(x-w)2
    missing the interesting structures.

6
Current approaches
  • Example with synthetic data
  • 2 Gaussians uniform background

Convex one class (OSU-SVM)
Local one-class
7
How do we do it
  • A cost function designed for small sets
  • A probabilistic approach allow soft assignment
    to the set
  • Regularized optimization

8
1. A cost function for small sets
  • The case where only few samples are relevant
  • Use cost function that is flat for samples not in
    the set
  • Two parameters
  • Divergence measure DBF
  • Flat cost K
  • Indifferent to the position of irrelevant
    samples.
  • Solutions converge to the center of mass when
    ball is large.

9
2. A probabilistic formulation
  • We are given m samples in a d dimensional space
    or simplex, indexed by x .
  • p(x) is the prior distribution over samples
  • c TRUE,FALSE is an R.V. that characterizes
    assignment to the interesting set (the Ball).
  • p(cx) reflects our belief that the sample x is
    interesting.
  • The cost function will be Dp(cx)DBF(wvx)
    (1-p(cx))KDBF is a divergence measure, to be
    discussed later

10
3. Regularized optimization
  • The goal minimize the mean costregularization
  • min ß ltDBF,K(,wCvx)gtp(c,x) I(CX)
    p(cx),w
  • The first term measures the mean distortion
  • ltDBF,R(p(cx),wvx)gt S p(x)
    p(cx)BF(wvx)(1-p(cx))K
  • The second term regularizes the compression of
    the data (removes information about X)
  • I(CX) H(X) H(XC),
  • It pushes for putting many points in the set.
  • This target function is not convex

11
To solve the problem
  • It turns out that for a family of divergence
    functions, called Bregman divergences, we can
    analytically describe properties of the optimal
    solution.
  • The proof follows the analysis of the Information
    Bottleneck method (Tishby,Pereira,Bialek,99)

12
Relation to Information Bottleneck
  • IB aims to compresses one variable X into T,
    while at the same time preserves information
    about Y. Combined to a single tradeoff
    optimization
  • min I(TX) ß I(TY)
  • A mathematically equivalent formulation
  • min ß ltD BF (wvx)gt I(TX)
  • where ltDKLgt measures the mean distortion between
    the clusters centroids wp(yt) and the samples
    vxp(yx),

KL
13
Bregman divergences
  • A Bregman divergence is defined by a convex
    function F (in our case F(v)Sf(vi))
  • Common examples
  • L2 norm f(x)½x2
  • Itkura-Saito f(x)-log(x)
  • DKL f(x)xlog(x)
  • Unnormalized relative entropy f(x)xlogx-x
  • Lemma Convexity of the Bregman Ball
  • The set of points v s.t. BF(vw)ltR is convex

14
Relation to Information Bottleneck
  • The extended IB problem
  • min ß ltD BF (wTvx)gt p(T,x) I(TX)
  • p(tx),w
  • The one class problem
  • min ß ltDBF,K(wCvx)gtp(c,x) I(CX)
    p(cx),w

15
Properties of the solution
  • OC solutions obey three fixed point equations
  • When ß?8,
  • Best assignment for x is to minimize

16
The effect of the K
  • K controls the nature of the solution.
  • Is the cost of leaving a point out of the ball
  • Large K gt large radius many points in set
  • For the L2 norm, K is formally related to the
    prior of a single Gaussian fit to the subset.
  • A full description of a data may require to solve
    for the complete spectrum of K values.

17
Algorithm One-Class IB
  • Adapting the sequential-IB algorithm
  • One-Class IB
  • Input set of m points vx, divergence BF, cost K
  • Output centroid w, assignment p(cx)
  • Optimization method
  • Iteratively operating sample-by-sample, try to
    modify the status of a single sample
  • One step Look-ahead re-fit the model and decide
    if to change assignment of a sample
  • This uses a simple formula because of the nice
    properties of Bregman divergences
  • search in the dual space of samples, rather than
    parameters w.

18
Experiments 1 information retrieval
  • Five most frequent categories of Reuters21578.
  • Each document represented as a multinomial
    distribution over 2000 terms.
  • The experimental setup For each category
  • train with half of the positive documents,
  • test with all rest of documents
  • Compared one-class IB with One-class Convex which
    uses a convex loss function (Crammer
    Singer-2003). Controlled by a single parameter ?,
    that determines weight of the class.

19
Experiments 1 information retrieval
  • Compare precision recall performance, for a range
    or K/µ values.

precision
recall
20
Experiments 1 information retrieval
  • Centroids of clusters, and their distances from
    the center of mass

21
Experiments 2 gene expression
  • A typical application for searching small but
    interesting sets of genes.

Genes represented by expression profile across
tissues from different patients Alizadeh-2000,
(B-cell lymphoma tissues) has mortality data
which can be used as an objective method for
validating quality of the genes selected.
22
Experiments 2 gene expression
  • One-class IB compared with one-class SVM (L2)
  • For a series of K values, gene sets with lowest
    loss function was found (10 restarts).
  • The set of genes was used for regression vs, the
    mortality data.

good
Significance of regression prediction (p- value)
bad
23
Future work finding ALL relevant subsets
  • Complete characterization of all interesting
    subsets in the data.
  • Assume we have a function that assign an interest
    value to each subset. We search in the space of
    subsets and for all local maxima.
  • Requires to define the locality. A natural
    measure of locality in the subsets-space is the
    Hamming distance.
  • The complete characterization of the data require
    description using a range of local neighborhoods.

24
Future work multiple one-class
  • Synthetic example two overlapping Gaussians and
    background uniform noise

25
Conclusions
  • We focus on learning one-class for cases where a
    small ball is sought.
  • Formalize the problem using IB, and derive its
    formal solutions
  • One-class IB performs well in the regime of small
    subsets.
Write a Comment
User Comments (0)
About PowerShow.com