Title: Efficient Learning in High Dimensions with Trees and Mixtures
1Efficient Learning in High Dimensions withTrees
and Mixtures
Marina Meila Carnegie Mellon University
2Multidimensional data
- Multidimensional (noisy) data
- Learning tasks - intelligent data analysis
- categorization (clustering)
- classification
- novelty detection
- probabilistic reasoning
- Data is changing, growing
- Tasks change
- need to make learning automatic, efficient
3Combining probability and algorithms
- Automatic probability and statistics
- Efficient algorithms
- This talk
- the tree statistical model
-
4Talk overview
Perspective generative models and decision tasks
Introduction statistical models
The tree model
Mixtures of trees
Accelerated learning
Bayesian learning
Learning
Experiments
5A multivariate domain
- Data
- Patient1
- Patient2
- . . . . . . . . . . . .
- Queries
- Diagnose new patient
- Is smoking related to lung cancer?
- Understand the laws of the domain
X ray
Cough
Lung cancer
Bronchitis
Smoker
X ray
Cough
Lung cancer
Bronchitis
Smoker
Smoker
X ray
Cough
Bronchitis
Lung cancer?
Smoker
X ray
Cough
Bronchitis
Lung cancer?
6Probabilistic approach
- Smoker, Bronchitis .. (discrete) random variables
- Statistical model (joint distribution)
- P( Smoker, Bronchitis, Lung cancer, Cough, X ray
) - summarizes knowledge about domain
- Queries
- inference
- e.g. P( Lung cancer true Smoker true,
Cough false ) -
- structure of the model
- discovering relationships
- categorization
7Probability table representation
- Query
- P(v10 v21)
.23 - Curse of dimensionality
- if v1, v2, vn binary variables
PV1,V2Vn table with 2n entries! - How to represent?
- How to query?
- How to learn from data?
- Structure?
8Graphical models
distance
- Structure
- vertices variables
- edges direct dependencies
- Parametrization
- by local probability tables
Galaxy type
size
spectrum
Z (red-shift)
dust
observed size
Obs spectrum
photometric measurement
- compact parametric representation
- efficient computation
- learning parameters by simple formula
- learning structure is NP-hard
9The tree statistical model
- Structure
- tree (graph with no cycles)
- Parameters
- probability tables Tuv
- Tv marginal distribution of v
- Tuv marginal distribution of u,v
T13(x1x3 ) T23(x2x3) T34(x3x4) T45(x4x5) T3
(x3)2T4(x4 )
T(x)
10The tree statistical model
- Structure
- tree (graph with no cycles)
- Parameters
- probability tables associated to edges
T3
T34
T43
- T(x) factors over tree edges
11Examples
- Splice junction domain
- Premature babies Bronho-Pulmonary Disease (BPD)
junction type
-7
-3
7
5
8
4
6
-2
2
3
-5
-4
-6
1
-1
PulmHemorrh
Coag
Thrombocyt
HyperNa
Hypertension
Weight
Temperature
Gestation
Acidosis
BPD
Neutropenia
Suspect
Lipid
12Trees - basic operations
- V n
- computing likelihood T(x) n
- conditioning TV-AA (junction tree algorithm)
n - marginalization Tuv for arbitrary u,v n
- sampling n
- fitting to a given distribution n2
- learning from data n2Ndata
- is a simple model
Querying the model
Estimating the model
13The mixture of trees
(Meila 97)
m Q(x) S lkTk(x)
k1
- h hidden variable
- P( hk ) lk k 1, 2 . . . m
- NOT a graphical model
- computational efficiency preserved
14Learning - problem formulation
- Maximum Likelihood learning
- given a data set D x1, . . . xN
- find the model that best predicts the data
- Topt argmax T(D)
- Fitting a tree to a distribution
- given a data set D x1, . . . xN
- and distribution P that weights each data point,
- find
- Topt argmin KL( P T )
- KL is Kullbach-Leibler divergence
- includes Maximum likelihood learning as a special
case
15Fitting a tree to a distribution
(Chow Liu 68)
- Topt argmin KL( P T )
- optimization over structure parameters
- sufficient statistics
- probability tables Puv Nuv/N u,v V
- mutual informations Iuv
- Iuv S Puv log
16Fitting a tree to a distribution - solution
- Structure
- Eopt argmax S Iuv
- uv E
- found by Maximum Weight
- Spanning Tree algorithm with
- edge weights Iuv
- Parameters
- copy marginals of P
- Tuv Puv for uv E
17Finding the optimal tree structure
18Learning mixtures by the EM algorithm
Meila Jordan 97
E step which xi come from T k? distribution P
k(x)
M step fit T k to set of points min KL( PkTk )
- Initialize randomly
- converges to local maximum of the likelihood
19Remarks
- Learning a tree
- solution is globally optimal over structures and
parameters - tractable running time n2N
- Learning a mixture by the EM algorithm
- both E and M steps are exact, tractable
- running time
- E step mnN
- M step mn2N
- assumes m known
- converges to local optimum
20Finding structure - the bars problem
Data n25
learned structure
Structure recovery 19 out of 20 trials Hidden
variable accuracy 0.85 /- 0.08 (ambiguous)
0.95 /-
0.01 (unambiguous) Data likelihood bits/data
point true model 8.58
learned model 9.82
/-0.95
21The bars problem
- True structure
- Approximate tree structure
22Experiments - density estimation
- Digits and digit pairs
- Ntrain 6000 Nvalid 2000 Ntest 5000
- n 64 variables ( m 16 trees )
n 128 variables ( m
32 trees )
Mix Trees
Mix Trees
23Classification with trees
- Class variable c
- Predictor variables V
- learn tree over V U c from labeled data set
- class of new example xV is c argmaxj T(xV,j)
24Classification with mixtures of trees
- learn mixture of trees tree over V U c
- class of new example xV is c argmaxj Q(xV,j)
- Tree augmented naïve Bayes (TANB) (Friedman al
96) - constructs a tree for each class
- c argmaxk Tk(xV)
25DNA splice junction classification
- n 61 variables
- class Intron/Exon, Exon/Intron, Neither
- Ntrain 2000
Ntrain 100
26DNA splice junction classification
- n 61 variables
- class Intron/Exon, Exon/Intron, Neither
27Discovering structure
Tree adjacency matrix
class
28Feature selection
1
C
3
5
2
- Only the neighbors of c are relevant for
classification - Implicit feature selection mechanism
- selection based on mutual information with c
- avoids double counting of features
29Irrelevant variables
- 61 original variables 60 noise variables
- Original
Augmented with irrelevant variables
30Accelerated tree learning
Meila 99
- Running time for the tree learning algorithm
n2N - Quadratic running time may be too slow
- Example document classification
- document data point --gt N 103-4
- word variable --gt n 103-4
- sparse data --gt words in document s and s
ltlt n,N - Can sparsity be exploited to create faster
algorithms? -
31Sparsity
- Assume special value 0 that occurs frequently
- sparsity s xv 0 s x
D - s ltlt n, N
- Additional assumption about the data
- allows faster learning algorithm
- is met in practice
- document representation as vector of words
- diagnostics problems
- Define
- v occurs in x xv 0
- uv cooccur in x xu 0 , xv 0
32Sparsity
- assume special value 0 that occurs frequently
- sparsity s non-zero variables in
each data point s - s ltlt n, N
- Idea do not represent / count zeros
Linked list length s
Sparse data
33How can sparsity help?
- Idea do not represent / count zeros
- Assumptions - can be relaxed
- binary data
- learning 1 tree
- Sufficient statistics
- Nv v occurrences in D
- Nuv uv co-occurrences in D
- Nv , Nuv , u,v V sufficient to
reconstruct Puv, Iuv
34First idea Sparse data representation
- data point x list ( variables that occur in x )
- storage sN
- computing all Nv sN
- computing all Nuv s 2N
35Presort mutual informations
- Theorem (Meila,99) If v, v are variables
that do not cooccur with u in V (i.e. Nuv Nuv
0 ) then - Nv gt Nv gt Iuv gt Iuv
- Consequences
- sort Nv gt all edges uv , Nuv 0
implicitly sorted by Iuv - these edges need not be represented explicitly
- construct black box that outputs next largest
edge
36The black box data structure
v1
Nv
v2
list of u , Nuv gt 0, sorted by Iuv
v
F-heap of size n
list of u, Nuv 0, sorted by Nv (virtual)
vn
next edge uv
Total running time n log n s2N nK
log n (standard alg. running time n2N )
37Accelerated algorithm outline
sN n log n s2N s2N log n 1
1 x nK n
- Compute and sort Nv
- Compute Nuv gt 0, Iuv
- Construct black box
- Construct tree (Kruskal algorithm - revisited)
- repeat
- extract uv with largest Iuv from black box
- check if it forms a cycle
- if not, add to tree
- until n-1 edges are added
- Compute tree parameters from Nv , Nuv
- Total running time n log n s2N nK log
n
38Experiments - sparse binary data
- N 10,000
- s 5, 10, 15, 100
39Experiments - sparse binary data (continued)
nKruskal
40Generalizations
- Multi-valued sparse data
- x list( v j 0 )
- Nuv contingency table
- Non-integer counts Nv , Nuv
- gt can learn Mixtures of Trees by the EM
algorithm - Prior information
- incompatible with general priors
- priors on the tree structure
- constant penalty for adding an edge
- priors on the tree parameters
- non-informative Dirichlet priors
41Remarks
- Realistic assumption
- Exact algorithm, provably efficient time bounds
- Degrades slowly to the standard algorithm if data
not sparse - General
- non-integer counts
- multi-valued discrete variables
42Bayesian learning of trees
Meila Jaakkola 00
- Problem
- given prior distribution over trees P0(T)
- data D x1, . . . xN
- find posterior distribution P(TD)
- Advantages
- incorporates prior knowledge
- regularization
- Solution
- Bayes formula P(TD) P0(T) P T(xi)
-
i1,N - practically hard
- distribution over structure E and parameters qE
- hard to represent
- computing Z is intractable in general
- exception conjugate priors
43Decomposable priors
T P f( u, v, quv) uv E
- want priors that factor over tree edges
- prior for structure E
- P0(E) a P buv
-
uv E - prior for tree parameters
- P0(qE) P D( quv Nuv )
-
uv E - (hyper) Dirichlet with hyper-parameters
Nuv(xuxv), u,v V - posterior is also Dirichlet with hyper-parameters
- Nuv(xuxv) Nuv(xuxv), u,v V
-
44Decomposable posterior
- Posterior distribution
- P(TD) a P Wuv
- uv E
- factored over edges
- same form as prior
- Wuv buv D( quv Nuv Nuv )
-
- Remains to compute the normalization constant
45The Matrix tree theorem
Discrete graph theory continuous Meila
Jaakkola 99
- Matrix tree theorem
-
- If
- P0(E) P buv, buv 0
-
uv E - then
- Z det M( b )
- with
- M( b )
46The Matrix tree theorem
Discrete graph theory continuous Meila
Jaakkola 99
- Matrix tree theorem
-
- If
- P0(E) P buv, buv 0
-
uv E - M( b )
Then Z det M( b )
47Remarks on the decomposable prior
- Is a conjugate prior for the tree distribution
- Is tractable
- defined by n2 parameters
- computed exactly in n3 operations
- posterior obtained in n2N n3 operations
- derivatives w.r.t parameters, averaging, . . .
n3 - Mixtures of trees with decomposable priors
- MAP estimation with EM algorithm tractable
- Other applications
- ensembles of trees
- maximum entropy distributions on trees
48So far . .
- Trees and mixtures of trees are structured
statistical models - Algorithmic techniques enable efficient learning
- mixture of trees
- accelerated algorithm
- matrix tree theorem Bayesian learning
- Examples of usage
- Structure learning
- Compression
- Classification
49Generative models and discrimination
- Trees are generative models
- descriptive
- can perform many tasks suboptimally
- Maximum Entropy discrimination (Jaakkola,Meila,Jeb
ara,99) - optimize for specific tasks
- use generative models
- combine simple models into ensembles
- complexity control - by information theoretic
principle - Discrimination tasks
- detecting novelty
- diagnosis
- classification
50Bridging the gap
Tasks
Descriptive learning
Discriminative learning
51Future . . .
- Tasks have structure
- multi-way classification
- multiple indexing of documents
- gene expression data
- hierarchical, sequential decisions
-
- Learn structured decision tasks
- sharing information btw tasks (transfer)
- modeling dependencies btw decisions
-
52combine and conquer
statistics computer science