Efficient Learning in High Dimensions with Trees and Mixtures - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Learning in High Dimensions with Trees and Mixtures

Description:

X ray. Cough. Lung cancer. Bronchitis. Statistical model. Probabilistic ... P( Smoker, Bronchitis, Lung cancer, Cough, X ray ) summarizes knowledge about domain ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 38
Provided by: mmp6
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient Learning in High Dimensions with Trees and Mixtures


1
Efficient Learning in High Dimensions withTrees
and Mixtures
Marina Meila Carnegie Mellon University
2
Multidimensional data
  • Multidimensional (noisy) data
  • Learning tasks - intelligent data analysis
  • categorization (clustering)
  • classification
  • novelty detection
  • probabilistic reasoning
  • Data is changing, growing
  • Tasks change
  • need to make learning automatic, efficient

3
Combining probability and algorithms
  • Automatic probability and statistics
  • Efficient algorithms
  • This talk
  • the tree statistical model

4
Talk overview
Perspective generative models and decision tasks
Introduction statistical models
The tree model
Mixtures of trees
Accelerated learning
Bayesian learning
Learning
Experiments
5
A multivariate domain
  • Data
  • Patient1
  • Patient2
  • . . . . . . . . . . . .
  • Queries
  • Diagnose new patient
  • Is smoking related to lung cancer?
  • Understand the laws of the domain

X ray
Cough
Lung cancer
Bronchitis
Smoker
X ray
Cough
Lung cancer
Bronchitis
Smoker
Smoker
X ray
Cough
Bronchitis
Lung cancer?
Smoker
X ray
Cough
Bronchitis
Lung cancer?
6
Probabilistic approach
  • Smoker, Bronchitis .. (discrete) random variables
  • Statistical model (joint distribution)
  • P( Smoker, Bronchitis, Lung cancer, Cough, X ray
    )
  • summarizes knowledge about domain
  • Queries
  • inference
  • e.g. P( Lung cancer true Smoker true,
    Cough false )
  • structure of the model
  • discovering relationships
  • categorization

7
Probability table representation
  • Query
  • P(v10 v21)
    .23
  • Curse of dimensionality
  • if v1, v2, vn binary variables
    PV1,V2Vn table with 2n entries!
  • How to represent?
  • How to query?
  • How to learn from data?
  • Structure?

8
Graphical models
distance
  • Structure
  • vertices variables
  • edges direct dependencies
  • Parametrization
  • by local probability tables

Galaxy type
size
spectrum
Z (red-shift)
dust
observed size
Obs spectrum
photometric measurement
  • compact parametric representation
  • efficient computation
  • learning parameters by simple formula
  • learning structure is NP-hard

9
The tree statistical model
  • Structure
  • tree (graph with no cycles)
  • Parameters
  • probability tables Tuv
  • Tv marginal distribution of v
  • Tuv marginal distribution of u,v

T13(x1x3 ) T23(x2x3) T34(x3x4) T45(x4x5) T3
(x3)2T4(x4 )
T(x)
10
The tree statistical model
  • Structure
  • tree (graph with no cycles)
  • Parameters
  • probability tables associated to edges

T3
T34
T43
  • T(x) factors over tree edges

11
Examples
  • Splice junction domain
  • Premature babies Bronho-Pulmonary Disease (BPD)

junction type
-7
-3
7
5
8
4
6
-2
2
3
-5
-4
-6
1
-1
PulmHemorrh
Coag
Thrombocyt
HyperNa
Hypertension
Weight
Temperature
Gestation
Acidosis
BPD
Neutropenia
Suspect
Lipid
12
Trees - basic operations
  • V n
  • computing likelihood T(x) n
  • conditioning TV-AA (junction tree algorithm)
    n
  • marginalization Tuv for arbitrary u,v n
  • sampling n
  • fitting to a given distribution n2
  • learning from data n2Ndata
  • is a simple model

Querying the model
Estimating the model
13
The mixture of trees
(Meila 97)
m Q(x) S lkTk(x)
k1
  • h hidden variable
  • P( hk ) lk k 1, 2 . . . m
  • NOT a graphical model
  • computational efficiency preserved

14
Learning - problem formulation
  • Maximum Likelihood learning
  • given a data set D x1, . . . xN
  • find the model that best predicts the data
  • Topt argmax T(D)
  • Fitting a tree to a distribution
  • given a data set D x1, . . . xN
  • and distribution P that weights each data point,
  • find
  • Topt argmin KL( P T )
  • KL is Kullbach-Leibler divergence
  • includes Maximum likelihood learning as a special
    case

15
Fitting a tree to a distribution
(Chow Liu 68)
  • Topt argmin KL( P T )
  • optimization over structure parameters
  • sufficient statistics
  • probability tables Puv Nuv/N u,v V
  • mutual informations Iuv
  • Iuv S Puv log

16
Fitting a tree to a distribution - solution
  • Structure
  • Eopt argmax S Iuv
  • uv E
  • found by Maximum Weight
  • Spanning Tree algorithm with
  • edge weights Iuv
  • Parameters
  • copy marginals of P
  • Tuv Puv for uv E

17
Finding the optimal tree structure
18
Learning mixtures by the EM algorithm
Meila Jordan 97
E step which xi come from T k? distribution P
k(x)
M step fit T k to set of points min KL( PkTk )
  • Initialize randomly
  • converges to local maximum of the likelihood

19
Remarks
  • Learning a tree
  • solution is globally optimal over structures and
    parameters
  • tractable running time n2N
  • Learning a mixture by the EM algorithm
  • both E and M steps are exact, tractable
  • running time
  • E step mnN
  • M step mn2N
  • assumes m known
  • converges to local optimum

20
Finding structure - the bars problem
Data n25
learned structure
Structure recovery 19 out of 20 trials Hidden
variable accuracy 0.85 /- 0.08 (ambiguous)
0.95 /-
0.01 (unambiguous) Data likelihood bits/data
point true model 8.58
learned model 9.82
/-0.95
21
The bars problem
  • True structure
  • Approximate tree structure

22
Experiments - density estimation
  • Digits and digit pairs
  • Ntrain 6000 Nvalid 2000 Ntest 5000
  • n 64 variables ( m 16 trees )
    n 128 variables ( m
    32 trees )

Mix Trees
Mix Trees
23
Classification with trees
  • Class variable c
  • Predictor variables V
  • learn tree over V U c from labeled data set
  • class of new example xV is c argmaxj T(xV,j)

24
Classification with mixtures of trees
  • learn mixture of trees tree over V U c
  • class of new example xV is c argmaxj Q(xV,j)
  • Tree augmented naïve Bayes (TANB) (Friedman al
    96)
  • constructs a tree for each class
  • c argmaxk Tk(xV)

25
DNA splice junction classification
  • n 61 variables
  • class Intron/Exon, Exon/Intron, Neither
  • Ntrain 2000

    Ntrain 100

26
DNA splice junction classification
  • n 61 variables
  • class Intron/Exon, Exon/Intron, Neither

27
Discovering structure
Tree adjacency matrix
class
28
Feature selection
1
C
3
5
2
  • Only the neighbors of c are relevant for
    classification
  • Implicit feature selection mechanism
  • selection based on mutual information with c
  • avoids double counting of features

29
Irrelevant variables
  • 61 original variables 60 noise variables
  • Original
    Augmented with irrelevant variables

30
Accelerated tree learning
Meila 99
  • Running time for the tree learning algorithm
    n2N
  • Quadratic running time may be too slow
  • Example document classification
  • document data point --gt N 103-4
  • word variable --gt n 103-4
  • sparse data --gt words in document s and s
    ltlt n,N
  • Can sparsity be exploited to create faster
    algorithms?

31
Sparsity
  • Assume special value 0 that occurs frequently
  • sparsity s xv 0 s x
    D
  • s ltlt n, N
  • Additional assumption about the data
  • allows faster learning algorithm
  • is met in practice
  • document representation as vector of words
  • diagnostics problems
  • Define
  • v occurs in x xv 0
  • uv cooccur in x xu 0 , xv 0

32
Sparsity
  • assume special value 0 that occurs frequently
  • sparsity s non-zero variables in
    each data point s
  • s ltlt n, N
  • Idea do not represent / count zeros

Linked list length s
Sparse data
33
How can sparsity help?
  • Idea do not represent / count zeros
  • Assumptions - can be relaxed
  • binary data
  • learning 1 tree
  • Sufficient statistics
  • Nv v occurrences in D
  • Nuv uv co-occurrences in D
  • Nv , Nuv , u,v V sufficient to
    reconstruct Puv, Iuv

34
First idea Sparse data representation
  • data point x list ( variables that occur in x )
  • storage sN
  • computing all Nv sN
  • computing all Nuv s 2N

35
Presort mutual informations
  • Theorem (Meila,99) If v, v are variables
    that do not cooccur with u in V (i.e. Nuv Nuv
    0 ) then
  • Nv gt Nv gt Iuv gt Iuv
  • Consequences
  • sort Nv gt all edges uv , Nuv 0
    implicitly sorted by Iuv
  • these edges need not be represented explicitly
  • construct black box that outputs next largest
    edge

36
The black box data structure
v1
Nv
v2
list of u , Nuv gt 0, sorted by Iuv
v
F-heap of size n
list of u, Nuv 0, sorted by Nv (virtual)
vn
next edge uv
Total running time n log n s2N nK
log n (standard alg. running time n2N )
37
Accelerated algorithm outline
sN n log n s2N s2N log n 1
1 x nK n
  • Compute and sort Nv
  • Compute Nuv gt 0, Iuv
  • Construct black box
  • Construct tree (Kruskal algorithm - revisited)
  • repeat
  • extract uv with largest Iuv from black box
  • check if it forms a cycle
  • if not, add to tree
  • until n-1 edges are added
  • Compute tree parameters from Nv , Nuv
  • Total running time n log n s2N nK log
    n

38
Experiments - sparse binary data
  • N 10,000
  • s 5, 10, 15, 100

39
Experiments - sparse binary data (continued)
nKruskal
40
Generalizations
  • Multi-valued sparse data
  • x list( v j 0 )
  • Nuv contingency table
  • Non-integer counts Nv , Nuv
  • gt can learn Mixtures of Trees by the EM
    algorithm
  • Prior information
  • incompatible with general priors
  • priors on the tree structure
  • constant penalty for adding an edge
  • priors on the tree parameters
  • non-informative Dirichlet priors

41
Remarks
  • Realistic assumption
  • Exact algorithm, provably efficient time bounds
  • Degrades slowly to the standard algorithm if data
    not sparse
  • General
  • non-integer counts
  • multi-valued discrete variables

42
Bayesian learning of trees
Meila Jaakkola 00
  • Problem
  • given prior distribution over trees P0(T)
  • data D x1, . . . xN
  • find posterior distribution P(TD)
  • Advantages
  • incorporates prior knowledge
  • regularization
  • Solution
  • Bayes formula P(TD) P0(T) P T(xi)

  • i1,N
  • practically hard
  • distribution over structure E and parameters qE
  • hard to represent
  • computing Z is intractable in general
  • exception conjugate priors

43
Decomposable priors
T P f( u, v, quv) uv E
  • want priors that factor over tree edges
  • prior for structure E
  • P0(E) a P buv

  • uv E
  • prior for tree parameters
  • P0(qE) P D( quv Nuv )

  • uv E
  • (hyper) Dirichlet with hyper-parameters
    Nuv(xuxv), u,v V
  • posterior is also Dirichlet with hyper-parameters
  • Nuv(xuxv) Nuv(xuxv), u,v V

44
Decomposable posterior
  • Posterior distribution
  • P(TD) a P Wuv
  • uv E
  • factored over edges
  • same form as prior
  • Wuv buv D( quv Nuv Nuv )

  • Remains to compute the normalization constant

45
The Matrix tree theorem
Discrete graph theory continuous Meila
Jaakkola 99
  • Matrix tree theorem
  • If
  • P0(E) P buv, buv 0

  • uv E
  • then
  • Z det M( b )
  • with
  • M( b )

46
The Matrix tree theorem
Discrete graph theory continuous Meila
Jaakkola 99
  • Matrix tree theorem
  • If
  • P0(E) P buv, buv 0

  • uv E
  • M( b )

Then Z det M( b )
47
Remarks on the decomposable prior
  • Is a conjugate prior for the tree distribution
  • Is tractable
  • defined by n2 parameters
  • computed exactly in n3 operations
  • posterior obtained in n2N n3 operations
  • derivatives w.r.t parameters, averaging, . . .
    n3
  • Mixtures of trees with decomposable priors
  • MAP estimation with EM algorithm tractable
  • Other applications
  • ensembles of trees
  • maximum entropy distributions on trees

48
So far . .
  • Trees and mixtures of trees are structured
    statistical models
  • Algorithmic techniques enable efficient learning
  • mixture of trees
  • accelerated algorithm
  • matrix tree theorem Bayesian learning
  • Examples of usage
  • Structure learning
  • Compression
  • Classification

49
Generative models and discrimination
  • Trees are generative models
  • descriptive
  • can perform many tasks suboptimally
  • Maximum Entropy discrimination (Jaakkola,Meila,Jeb
    ara,99)
  • optimize for specific tasks
  • use generative models
  • combine simple models into ensembles
  • complexity control - by information theoretic
    principle
  • Discrimination tasks
  • detecting novelty
  • diagnosis
  • classification

50
Bridging the gap
Tasks
Descriptive learning
Discriminative learning
51
Future . . .
  • Tasks have structure
  • multi-way classification
  • multiple indexing of documents
  • gene expression data
  • hierarchical, sequential decisions
  • Learn structured decision tasks
  • sharing information btw tasks (transfer)
  • modeling dependencies btw decisions

52
combine and conquer
statistics computer science
Write a Comment
User Comments (0)
About PowerShow.com