Title: COMP201 Java Programming
1Latent Tree Analysis of Unlabeled Data
Nevin L. Zhang Dept. of Computer Science
Engineering The Hong Kong Univ. of Sci.
Tech. http//www.cse.ust.hk/lzhang
2Outline
- Latent tree models
- Latent tree analysis algorithms
- What can LTA be used for
- Discovery of co-occurrence/correlation patterns
- Discovery of latent variable/structures
- Multidimensional clustering
- Examples
- Danish beer survey data
- Text data
- TCM survey data
3Latent Tree Models
- Tree-structured probabilistic graphical models
- Leaves observed (manifest variables)
- Discrete or continuous
- Internal nodes latent (latent variables)
- Discrete
- Each edge is associated with a conditional
distribution - One node with marginal distribution
- Defines a joint distributions over all the
variables - (Zhang, JMLR 2004)
4Latent Tree Analysis
From data on observed variables, obtain latent
tree model
- Learning latent tree models Determine
- Number of latent variables
- Numbers of possible states for latent variables
- Connections among nodes
- Probability distributions
- Model Selection Criterion
- Find the model that maximize the BIC score
- BIC(mD) log P(Dm, ?) d/2 logN
- D Data, N sample size
- m model, ? MLE of parameters
- d number of free parameters
5Algorithms EAST
- Search-based
- Extension, Adjustment,
- Simplification until
- Termination
- Can deal with
- 100 observed variables
-
- (Chen, Zhang et al. AIJ 2011)
6- (Liu, Zhang et al. MLJ 2013)
UniDimensioanlity Test
7- (Liu, Zhang et al. MLJ 2013)
8- (Liu, Zhang et al. MLJ 2013)
Chow-Liu tree (1968)
9 (Liu, Zhang et al. MLJ 2013)
- Close to EAST in terms of model quality. Can deal
with 1,000 observed variables
10Outline
- Latent tree models
- Latent tree analysis algorithms
- What can LTA be used for
- Discovery of co-occurrence/correlation patterns
- Discovery of latent variable/structures
- Multidimensional clustering
- Examples
- Danish beer survey data
- Text data
- TCM survey data
11Danish Beer Market Survey
(Mourad et al. JAIR 2013)
- 463 consumers, 11 beer brands
- Questionnaire For each brand
- Never seen the brand before (s0)
- Seen before, but never tasted (s1)
- Tasted, but do not drink regularly (s2)
- Drink regularly (s3).
12Why variables grouped as such?
- GronTuborg and Carlsberg Main mass-market beers
- TuborgClas and CarlSpec Frequent beers, bit
darker than the above - CeresTop, CeresRoyal, Pokal, minor local beers
- Grouped as such because responses on brands in
each group strongly correlated. - Intuitively, latent tree analysis
- Partitions observed variables into groups such
that - Variables in each group are strongly correlated,
and - The correlations among each group can be properly
be modeled using one single latent variable
13Multidmensional Clustering
- Each Latent variable gives a partition of
consumers. - H1
- Class 1 Likely to have tasted TuborgClas,
Carlspec and Heineken , but do not drink
regularly - Class 2 Likely to have seen or tasted the
beers, but did not drink regularly - Class 3 Likely to drink TuborgClas and Carlspec
regularly - Intuitively, latent tree analysis is a technique
for multiple clustering. - K-Means, mixture models give only one partition.
14Page 14
Binary Text Data WebKB
(Liu et al. PGM 2012, MLJ 2013)
1041 web pages collected from 4 CS departments in
1997 336 words
15Page 15
Latent Tree Model for WebKB Data by BI Algorithm
89 latent variables
16Latent Tree Modes for WebKB Data
17Page 17
18Page 18
19Page 19
Why variables grouped as such?
- Group as such because words in in each group tend
to co-occur. - On binary data, latent tree analysis
- Partitions observed word variables into groups
such that - Words in each group tend to co-occur and
- The correlations can be properly be explained
using one single latent variable
LTA is a method for identifying co-occurrence
relationships.
20Multidimensional Clustering
- LTA is an approach to topic detection
- Y664 Object Oriented Programming (oop)
- Y662 Non-oop programming
- Y661 programming language
- Y663 Not on programming
21Outline
- Latent tree models
- Latent tree analysis algorithms
- What can LTA be used for
- Discovery of co-occurrence/correlation patterns
- Discovery of latent variable/structures
- Multidimensional clustering
- Examples
- Danish beer survey data
- Text data
- TCM survey data
22Background of Research
- Common practice in China, increasingly in Western
world - Patients of a WM disease divided into several TCM
classes - Different classes are treated differently using
TCM treatments. - Example
- WM disease Depression
- TCM Classes
- Liver-Qi Stagnation (????). Treatment principle
????, Prescription ????? - Deficiency of Liver Yin and Kidney Yin
(????)Treatment principle ????, Prescription
????????? - Vacuity of both heart and spleen (????).
Treatment principle ????, Prescription ??? - .
23Key Question
- How should patients of a WM disease be divided
into subclasses from the TCM perspective? - What TCM classes?
- What are the characteristics of each TCM class?
- How to differentiate different TCM classes?
- Important for
- Clinic practice
- Research
- Randomized controlled trials for efficacy
- Modern biomedical understanding of TCM concepts
- No consensus. Different doctors/researchers use
different schemes. Key weakness of TCM.
24Key Idea
- Our objective
- Provide an evidence-based method for TCM patient
classification - Key Idea
- Cluster analysis of symptom data gt empirical
partition of patients - Check to see whether it corresponds to TCM class
concept - Key technology Multidimensional clustering
- Motivation for developing latent tree analysis
25Symptoms Data of Depressive Patients
(Zhao et al. JACM 2014)
- Subjects
- 604 depressive patients aged between 19 and 69
from 9 hospitals - Selected using the Chinese classification of
mental disorder clinic guideline CCMD-3 - Exclusion
- Subjects we took anti-depression drugs within two
weeks prior to the survey women in the
gestational and suckling periods, .. etc - Symptom variables
- From the TCM literature on depression between
1994 and 2004. - Searched with the phrase ?? and ? on the
CNKI (China National Knowledge Infrastructure)
data - Kept only those on studies where patients were
selected using the ICD-9, ICD-10, CCMD-2, or
CCMD-3 guidelines. - 143 symptoms reported in those studies altogether.
26The Depression Data
- Data as a table
- 604 rows, each for a patient
- 143 columns, each for a symptom
- Table cells 0 symptom not present, 1 symptom
present - Removed Symptoms occurring lt10 times
- 86 symptoms variables entered latent tree
analysis. - Structure of the latent tree model obtained on
the next two slides.
27Model Obtained for a Depression Data (Top)
28Model obtained for a Depression Data (Bottom)
29The Empirical Partitions
- The first cluster (Y29 s0) consists of 54 of
the patients and while the cluster (Y29 s1)
consists of 46 of the patients. - The two symptoms fear of cold and cold limbs
do not occur often in the first cluster - While they both tend to occur with high
probabilities (0.8 and 0.85) in the second
cluster.
30Probabilistic Symptom co-occurrence pattern
- Probabilistic symptom co-occurrence pattern
- The table indicates that the two symptoms fear
of cold and cold limbs tend to co-occur in the
cluster Y29 s1 - Pattern meaningful from the TCM perspective.
- TCM asserts that YANG DEFICIENCY (??) can lead
to, among other symptoms, fear of cold and
cold limbs - So, the co-occurrence pattern suggests the TCM
symdrome type (??) YANG DEFICIENCY (??).
- The partition Y29 suggests that
- Among depressive patients, there is a subclass of
patient with YANG DEFICIENCY. - In this subclass, fear of cold and cold
limbs - co-occur with high probabilities (0.8 and
0.85)
31Probabilistic Symptom co-occurrence pattern
- Y28 s1 captures the probabilistic co-occurrence
of aching lumbus, lumbar pain like pressure
and lumbar pain like warmth. - This pattern is present in 27 of the patients.
- It suggests that
- Among depressive patients, there is a subclass
that correspond to the TCM concept of KIDNEY
DEPRIVED OF NOURISHMENT (????) - Characteristics of the subclass given by
distributions for Y28 s1
32Probabilistic Symptom co-occurrence pattern
- Y27 s1 captures the probabilistic co-occurrence
of weak lumbus and knees and cumbersome
limbs. - This pattern is present in 44 of the patients
- It suggests that,
- Among depressive patients, there is a subclass
that correspond to the TCM concept of KIDNEY
DEFICIENCY (??) - Characteristics of the subclass given by
distributions for Y27 s1 - Y27, Y28, Y29 together provide evidence for
defining KIDNEY YANG DEFICIENCY
33Probabilistic Symptom co-occurrence pattern
- Pattern Y21 s1 evidence for defining STAGNANT
QI TURNING INTO FIRE (????) - Y15 s1 evidence for defining QI DEFICIENCY
- Y17 s1 evidence for defining HEART QI
DEFICIENCY - Y16 s1 evidence for defining QI STAGNATION
- Y19 s1 evidence for defining QI STAGNATION IN
HEAD
34Probabilistic Symptom co-occurrence pattern
- Y9 s1 evidence for defining DEFICIENCY OF BOTH
QI AND YIN (????) - Y10 s1 evidence for defining YIN DEFICIENCY
(??) - Y11 s1 evidence for defining DEFICIENCY OF
STOMACH/SPLEEN YIN (????)
35Symptom Mutual-Exclusion Patterns
- Some empirical partitions reveal symptom
exclusion patterns - Y1 reveals the mutual exclusion of white
tongue coating, yellow tongue coating and
yellow-white tongue coating - Y2 reveals the mutual exclusion of thin tongue
coating, thick tongue coating and little
tongue coating.
36Summary of TCM Data Analysis
- By analyzing 604 cases of depressive patient data
using latent tree models we have discovered a
host of probabilistic symptom co-occurrence
patterns and symptom mutual-exclusion patterns. - Most of the co-occurrence patterns have clear TCM
syndrome connotations, while the mutual-exclusion
patterns are also reasonable and meaningful. - The patterns can be used as evidence for the task
of defining TCM classes in the context of
depressive patients and for differentiating
between those classes.
37Another Perspective Statistical Validation of
TCM Postulates
(Zhang et al. JACM 2008)
..
..
Y28 s1
Kidney deprived of nourishment
Y29 s1
Yang Deficiency
- TCM terms such as Yang Deficiency were introduced
to explain symptom co-occurrence patterns
observed in clinic practice.
38Value of Work in View of Others
- D. Haughton and J. Haughton. Living Standards
Analytics Development through the Lens of
Household Survey Data. Springer. 2012 - Zhang et al. provide a very interesting
application of latent class (tree) models to
diagnoses in traditional Chinese medicine (TCM). - The results tend to confirm known theories in
Chinese traditional medicine. - This is a significant advance, since the
scientific bases for these theories are not
known. - The model proposed by the authors provides at
least a statistical justification for them.
39Summary
- Latent tree models
- Tree-structure probabilistic graphical models
- Leaf nodes observed variables
- Internal nodes latent variable
- What can LTA be used for
- Discovery of co-occurrence patterns in binary
data - Discovery of correlation patterns in general
discrete data - Discovery of latent variable/structures
- Multidimensional clustering
- Topic detection in text data
- Key role in TCM patient classification
40- References
- N. L. Zhang (2004). Hierarchical latent class
models for cluster analysis. Journal of Machine
Learning Research, 5(6)697-723, 2004. - T. Chen, N. L. Zhang, T. F. Liu, Y. Wang, L. K.
M. Poon (2011). Model-based multidimensional
clustering of categorical data. Artificial
Intelligence, 176(1), 2246-2269. - T.F.Liu, N. L. Zhang, A.H. Liu, L.K.M. Poon
(2012). A Novel LTM-based Method for
Multidimensional Clustering. European Workshop
on Probabilistic Graphical Models (PGM-12),
203-210. - T.F, Liu, N. L. Zhang, P. X. Chen, A. H.Liu, L.
K. M. Poon, and Yi Wang (2013). Greedy learning
of latent tree models for multidimensional
clustering. Machine Learning, doi10.1007/s10994-0
13-5393-0. - R. Mourad, C. Sinoquet, N. L. Zhang, T.F. Liu and
P. Leray (2013). A survey on latent tree models
and applications. Journal of Artificial
Intelligence Research, 47, 157-203 , 13 May 2013.
doi10.1613/jair.3879. - N. L. Zhang, S. H. Yuan, T. Chen and Y. Wang
(2008). Statistical Validation of TCM Theories.
Journal of Alternative and Complementary
Medicine, 14(5)583-7. - N. L. Zhang, S. H. Yuan, T. Chen and Y. Wang
(2008). Latent tree models and diagnosis in
traditional Chinese medicine. Artificial
Intelligence in Medicine. 42 229-245. - Z.X. Xu, N. L. Zhang, Y.Q. Wang, G.P. Liu, J. Xu,
T. F. Liu, and A. H. Liu (2013). Statistical
Validation of Traditional Chinese Medicine
Syndrome Postulates in the Context of Patients
with Cardiovascular Disease. The Journal of
Alternative and Complementary Medicine. - Y. Zhao, N. L. Zhang, T.F.Wang, Q. G. Wang
(2014). Discovering Symptom Co-Occurrence
Patterns from 604 Cases of Depressive Patient
Data using Latent Tree Models. The Journal of
Alternative and Complementary Medicine.
41