Local one class optimization - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Local one class optimization

Description:

A typical application for searching small but interesting sets of genes. ... Complete characterization of all interesting subsets in the data. ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 24

Provided by: GAL1151

Category:

more less

Transcript and Presenter's Notes

Title: Local one class optimization

1
Local one class optimization

Gal Chechik, Stanford
joint work with Koby Crammer, Hebrew university
of Jerusalem

2
The one-class problem

Find a subset of similar/typical samples
Formally find a ball of a given radius (with
some metric) that covers as many data points as
possible (related to the set covering problem).

3
Motivation I

Unsupervised setting Sometimes we wish to model
small parts of the data and ignore the rest. This
happens when many data points are irrelevant.
Example
Finding sets of co-expressed genes in genome
wide-experiment identify the relevant genes out
of thousands irrelevant ones.
Finding a set of document of the same topic, in
an heterogeneous corpus

4
Motivation II

Supervised setting Learning given positive
samples only
Examples
Protein interactions
Intrusion detection application
Care about low false positive rate

5
Current approaches

Often treat the problem as Outliers and novelty
detection most samples are relevant
Current approaches use
A convex cost function (Schölkopf 95, Tax and
Duin 99, Ben-Hur et al 2001).
A parameter that affects the size or weight of
the ball

Bias towards center of massWhen searching for a
small ball, the center of the optimal ball is in
the global center of mass, wargmin Sx(x-w)2
missing the interesting structures.

6
Current approaches

Example with synthetic data
2 Gaussians uniform background

Convex one class (OSU-SVM)
Local one-class
7
How do we do it

A cost function designed for small sets
A probabilistic approach allow soft assignment
to the set
Regularized optimization

8
1. A cost function for small sets

The case where only few samples are relevant
Use cost function that is flat for samples not in
the set
Two parameters
Divergence measure DBF
Flat cost K
Indifferent to the position of irrelevant
samples.
Solutions converge to the center of mass when
ball is large.

9
2. A probabilistic formulation

We are given m samples in a d dimensional space
or simplex, indexed by x .
p(x) is the prior distribution over samples
c TRUE,FALSE is an R.V. that characterizes
assignment to the interesting set (the Ball).
p(cx) reflects our belief that the sample x is
interesting.
The cost function will be Dp(cx)DBF(wvx)
(1-p(cx))KDBF is a divergence measure, to be
discussed later

10
3. Regularized optimization

The goal minimize the mean costregularization
min ß ltDBF,K(,wCvx)gtp(c,x) I(CX)
p(cx),w
The first term measures the mean distortion
ltDBF,R(p(cx),wvx)gt S p(x)
p(cx)BF(wvx)(1-p(cx))K
The second term regularizes the compression of
the data (removes information about X)
I(CX) H(X) H(XC),
It pushes for putting many points in the set.
This target function is not convex

11
To solve the problem

It turns out that for a family of divergence
functions, called Bregman divergences, we can
analytically describe properties of the optimal
solution.
The proof follows the analysis of the Information
Bottleneck method (Tishby,Pereira,Bialek,99)

12
Relation to Information Bottleneck

IB aims to compresses one variable X into T,
while at the same time preserves information
about Y. Combined to a single tradeoff
optimization
min I(TX) ß I(TY)
A mathematically equivalent formulation
min ß ltD BF (wvx)gt I(TX)
where ltDKLgt measures the mean distortion between
the clusters centroids wp(yt) and the samples
vxp(yx),

KL
13
Bregman divergences

A Bregman divergence is defined by a convex
function F (in our case F(v)Sf(vi))
Common examples
L2 norm f(x)½x2
Itkura-Saito f(x)-log(x)
DKL f(x)xlog(x)
Unnormalized relative entropy f(x)xlogx-x
Lemma Convexity of the Bregman Ball
The set of points v s.t. BF(vw)ltR is convex

14
Relation to Information Bottleneck

The extended IB problem
min ß ltD BF (wTvx)gt p(T,x) I(TX)
p(tx),w
The one class problem
min ß ltDBF,K(wCvx)gtp(c,x) I(CX)
p(cx),w

15
Properties of the solution

OC solutions obey three fixed point equations
When ß?8,
Best assignment for x is to minimize

16
The effect of the K

K controls the nature of the solution.
Is the cost of leaving a point out of the ball
Large K gt large radius many points in set
For the L2 norm, K is formally related to the
prior of a single Gaussian fit to the subset.
A full description of a data may require to solve
for the complete spectrum of K values.

17
Algorithm One-Class IB

Adapting the sequential-IB algorithm
One-Class IB
Input set of m points vx, divergence BF, cost K
Output centroid w, assignment p(cx)
Optimization method
Iteratively operating sample-by-sample, try to
modify the status of a single sample
One step Look-ahead re-fit the model and decide
if to change assignment of a sample
This uses a simple formula because of the nice
properties of Bregman divergences
search in the dual space of samples, rather than
parameters w.

18
Experiments 1 information retrieval

Five most frequent categories of Reuters21578.
Each document represented as a multinomial
distribution over 2000 terms.
The experimental setup For each category
train with half of the positive documents,
test with all rest of documents
Compared one-class IB with One-class Convex which
uses a convex loss function (Crammer
Singer-2003). Controlled by a single parameter ?,
that determines weight of the class.

19
Experiments 1 information retrieval

Compare precision recall performance, for a range
or K/µ values.

precision
recall
20
Experiments 1 information retrieval

Centroids of clusters, and their distances from
the center of mass

21
Experiments 2 gene expression

A typical application for searching small but
interesting sets of genes.

Genes represented by expression profile across
tissues from different patients Alizadeh-2000,
(B-cell lymphoma tissues) has mortality data
which can be used as an objective method for
validating quality of the genes selected.
22
Experiments 2 gene expression

One-class IB compared with one-class SVM (L2)
For a series of K values, gene sets with lowest
loss function was found (10 restarts).
The set of genes was used for regression vs, the
mortality data.

good
Significance of regression prediction (p- value)
bad
23
Future work finding ALL relevant subsets

Complete characterization of all interesting
subsets in the data.
Assume we have a function that assign an interest
value to each subset. We search in the space of
subsets and for all local maxima.
Requires to define the locality. A natural
measure of locality in the subsets-space is the
Hamming distance.
The complete characterization of the data require
description using a range of local neighborhoods.

24
Future work multiple one-class