Efficient Learning in High Dimensions with Trees and Mixtures - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient Learning in High Dimensions with Trees and Mixtures

Description:

X ray. Cough. Lung cancer. Bronchitis. Statistical model. Probabilistic ... P( Smoker, Bronchitis, Lung cancer, Cough, X ray ) summarizes knowledge about domain ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 38

Provided by: mmp6

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Learning in High Dimensions with Trees and Mixtures

1
Efficient Learning in High Dimensions withTrees
and Mixtures
Marina Meila Carnegie Mellon University
2
Multidimensional data

Multidimensional (noisy) data
Learning tasks - intelligent data analysis
categorization (clustering)
classification
novelty detection
probabilistic reasoning
Data is changing, growing
Tasks change
need to make learning automatic, efficient

3
Combining probability and algorithms

Automatic probability and statistics
Efficient algorithms
This talk
the tree statistical model

4
Talk overview
Perspective generative models and decision tasks
Introduction statistical models
The tree model
Mixtures of trees
Accelerated learning
Bayesian learning
Learning
Experiments
5
A multivariate domain

Data
Patient1
Patient2
. . . . . . . . . . . .
Queries
Diagnose new patient
Is smoking related to lung cancer?
Understand the laws of the domain

X ray
Cough
Lung cancer
Bronchitis
Smoker
X ray
Cough
Lung cancer
Bronchitis
Smoker
Smoker
X ray
Cough
Bronchitis
Lung cancer?
Smoker
X ray
Cough
Bronchitis
Lung cancer?
6
Probabilistic approach

Smoker, Bronchitis .. (discrete) random variables
Statistical model (joint distribution)
P( Smoker, Bronchitis, Lung cancer, Cough, X ray
)
summarizes knowledge about domain
Queries
inference
e.g. P( Lung cancer true Smoker true,
Cough false )
structure of the model
discovering relationships
categorization

7
Probability table representation

Query
P(v10 v21)
.23
Curse of dimensionality
if v1, v2, vn binary variables
PV1,V2Vn table with 2n entries!
How to represent?
How to query?
How to learn from data?
Structure?

8
Graphical models
distance

Structure
vertices variables
edges direct dependencies
Parametrization
by local probability tables

Galaxy type
size
spectrum
Z (red-shift)
dust
observed size
Obs spectrum
photometric measurement

compact parametric representation
efficient computation
learning parameters by simple formula
learning structure is NP-hard

9
The tree statistical model

Structure
tree (graph with no cycles)
Parameters
probability tables Tuv
Tv marginal distribution of v
Tuv marginal distribution of u,v

T13(x1x3 ) T23(x2x3) T34(x3x4) T45(x4x5) T3
(x3)2T4(x4 )
T(x)
10
The tree statistical model

Structure
tree (graph with no cycles)
Parameters
probability tables associated to edges

T3
T34
T43

T(x) factors over tree edges

11
Examples

Splice junction domain
Premature babies Bronho-Pulmonary Disease (BPD)

junction type
-7
-3
7
5
8
4
6
-2
2
3
-5
-4
-6
1
-1
PulmHemorrh
Coag
Thrombocyt
HyperNa
Hypertension
Weight
Temperature
Gestation
Acidosis
BPD
Neutropenia
Suspect
Lipid
12
Trees - basic operations

V n
computing likelihood T(x) n
conditioning TV-AA (junction tree algorithm)
n
marginalization Tuv for arbitrary u,v n
sampling n
fitting to a given distribution n2
learning from data n2Ndata
is a simple model

Querying the model
Estimating the model
13
The mixture of trees
(Meila 97)
m Q(x) S lkTk(x)
k1

h hidden variable
P( hk ) lk k 1, 2 . . . m
NOT a graphical model
computational efficiency preserved

14
Learning - problem formulation

Maximum Likelihood learning
given a data set D x1, . . . xN
find the model that best predicts the data
Topt argmax T(D)
Fitting a tree to a distribution
given a data set D x1, . . . xN
and distribution P that weights each data point,
find
Topt argmin KL( P T )
KL is Kullbach-Leibler divergence
includes Maximum likelihood learning as a special
case

15
Fitting a tree to a distribution
(Chow Liu 68)

Topt argmin KL( P T )
optimization over structure parameters
sufficient statistics
probability tables Puv Nuv/N u,v V
mutual informations Iuv
Iuv S Puv log

16
Fitting a tree to a distribution - solution

Structure
Eopt argmax S Iuv
uv E
found by Maximum Weight
Spanning Tree algorithm with
edge weights Iuv
Parameters
copy marginals of P
Tuv Puv for uv E

17
Finding the optimal tree structure
18
Learning mixtures by the EM algorithm
Meila Jordan 97
E step which xi come from T k? distribution P
k(x)
M step fit T k to set of points min KL( PkTk )

Initialize randomly
converges to local maximum of the likelihood

19
Remarks

Learning a tree
solution is globally optimal over structures and
parameters
tractable running time n2N
Learning a mixture by the EM algorithm
both E and M steps are exact, tractable
running time
E step mnN
M step mn2N
assumes m known
converges to local optimum

20
Finding structure - the bars problem
Data n25
learned structure
Structure recovery 19 out of 20 trials Hidden
variable accuracy 0.85 /- 0.08 (ambiguous)
0.95 /-
0.01 (unambiguous) Data likelihood bits/data
point true model 8.58
learned model 9.82
/-0.95
21
The bars problem

True structure
Approximate tree structure

22
Experiments - density estimation

Digits and digit pairs
Ntrain 6000 Nvalid 2000 Ntest 5000
n 64 variables ( m 16 trees )
n 128 variables ( m
32 trees )

Mix Trees
Mix Trees
23
Classification with trees

Class variable c
Predictor variables V
learn tree over V U c from labeled data set
class of new example xV is c argmaxj T(xV,j)

24
Classification with mixtures of trees

learn mixture of trees tree over V U c
class of new example xV is c argmaxj Q(xV,j)
Tree augmented naïve Bayes (TANB) (Friedman al
96)
constructs a tree for each class
c argmaxk Tk(xV)

25
DNA splice junction classification

n 61 variables
class Intron/Exon, Exon/Intron, Neither
Ntrain 2000

Ntrain 100

26
DNA splice junction classification

n 61 variables
class Intron/Exon, Exon/Intron, Neither

27
Discovering structure
Tree adjacency matrix
class
28
Feature selection
1
C
3
5
2

Only the neighbors of c are relevant for
classification
Implicit feature selection mechanism
selection based on mutual information with c
avoids double counting of features

29
Irrelevant variables

61 original variables 60 noise variables
Original
Augmented with irrelevant variables

30
Accelerated tree learning
Meila 99

Running time for the tree learning algorithm
n2N
Quadratic running time may be too slow
Example document classification
document data point --gt N 103-4
word variable --gt n 103-4
sparse data --gt words in document s and s
ltlt n,N
Can sparsity be exploited to create faster
algorithms?

31
Sparsity

Assume special value 0 that occurs frequently
sparsity s xv 0 s x
D
s ltlt n, N
Additional assumption about the data
allows faster learning algorithm
is met in practice
document representation as vector of words
diagnostics problems
Define
v occurs in x xv 0
uv cooccur in x xu 0 , xv 0

32
Sparsity

assume special value 0 that occurs frequently
sparsity s non-zero variables in
each data point s
s ltlt n, N
Idea do not represent / count zeros

Linked list length s
Sparse data
33
How can sparsity help?

Idea do not represent / count zeros
Assumptions - can be relaxed
binary data
learning 1 tree
Sufficient statistics
Nv v occurrences in D
Nuv uv co-occurrences in D
Nv , Nuv , u,v V sufficient to
reconstruct Puv, Iuv

34
First idea Sparse data representation

data point x list ( variables that occur in x )
storage sN
computing all Nv sN
computing all Nuv s 2N

35
Presort mutual informations

Theorem (Meila,99) If v, v are variables
that do not cooccur with u in V (i.e. Nuv Nuv
0 ) then
Nv gt Nv gt Iuv gt Iuv
Consequences
sort Nv gt all edges uv , Nuv 0
implicitly sorted by Iuv
these edges need not be represented explicitly
construct black box that outputs next largest
edge

36
The black box data structure
v1
Nv
v2
list of u , Nuv gt 0, sorted by Iuv
v
F-heap of size n
list of u, Nuv 0, sorted by Nv (virtual)
vn
next edge uv
Total running time n log n s2N nK
log n (standard alg. running time n2N )
37
Accelerated algorithm outline
sN n log n s2N s2N log n 1
1 x nK n