Title: Compact and Understandable Descriptions of Mixtures of Bernoulli Distributions
1Compact and Understandable Descriptions of
Mixtures of Bernoulli Distributions
- Jaakko Hollmén and Jarkko Tikka
- Helsinki Institute of Information Technology
- Helsinki University of Technology
- Espoo, Finland
2Background on the problem
- Collaboration Knuutila, Myllykangas at the
University of Helsinki - DNA copy number amplifications are mutations in
the DNA structure ? cancer - Bibliomics survey of 838 journal articles during
1992-2002 - Data chromosomal mutations of 4500 cancer
patients
3Example on the data collection
S. Myllykangas, J. Himberg, T. Böhling, B. Nagy,
J. Hollmén, and S. Knuutila. DNA copy number
amplification profiling of human neoplasms .
Oncogene, 25(55)7324-7332, November 2006
4Chromosomal regions names
- Standardized nomenclature for chromosomal regions
(spatial) - 1p36.2 chromosome 1, the arm p, region 36,
subregion 2 - Ranges 1p36.1-p36.3
- Hierarchical, irregular naming scheme used in
literature
5DNA copy number amplification data as 0-1 data
Cancer patients (i)
Chromosomal areas spatial coordinates (j)
6Mixture models for 0-1 data
- Cancer is a collection of diseases
- Finite mixture model of multivariate Bernoulli
distributions
- Learn the model with the EM algorithm
7Model selection how many components in a
mixture?
- 5-fold cross validation repeated 10 times
- Try different solutions, based on average
likelihood for a validation set ? J6
training
validation
8Mixture model Chromosome 1
Mixture Components j1,...,6
Chromosomal areas (spatial coordinates)
- Model is summarized by J Jd parameters (about
200 parameters altogether)
9Mixture model in clustering
Clustered cancer patients
Chromosomal areas (spatial coordinates)
10Solution creates a problem
- We solved the modeling problem, but created a
communications problem! - How do the cancer experts understand and refer to
our models? Names?
11Compact and Understandable Descriptions
- Understandable (language, nomenclature)
- Compact (size of the description)
- Describe the parameters of the model
- Use the model to cluster the data and describe
the data in the clusters
12Describe the model parameters
- Mode of the component distribution
- most probable chromosomal area
- Hypothetical mean organism (HMO)
- quantize the parameters to represent a
hypothetical case of data
13Describe the clustered data
- Describe the margins of the clusters with maximal
frequent itemsets - Why maximal describe the largest representative
commonality in the data extracting frequent
itemsets not feasible - Express the itemsets as ranges of contiguous
chromosomal areas
14Descriptions, Chromosome 1
- Maximal frequent itemsets extracted globally
1q21-q22,1q22-q23 - Shadowing and spurious mutations
15Amplification models and patterns
1q32-q44, 1q11-q44, 1q21-q25, 1q21-q23,
1p35-p32, 2p15-p14, 2q32, 2p25-2p24,2 p24-p23,
2p25-2p11.1, 3q26.1-q26.3, 3q11.1-q29, 3p26-q29,
3q25-q29, 3p24, 3q27-q29, 4q12, 4p15.3-p12,
5p13-p12, 5p15.3-p11, 5p15.3 5q35, 6q22,
6p25-q27, 6p25-p22, 6p12, 6p25-p11.1, 6q21-q27,
7q3- q36, 7p21, 7p13-p11.2, 7q21 ,7p22-q36,
7p22-p11.1, 8p23-q24.3, 8q24.1-q24.3, 8q23,
8q21.1-q22, 8q21.1-q24.3, 8q11.1-q24.3, 9q11-q34,
9p24 q34, 9q34, 9p24-p21, 10q11.1-q26, 10p15-p12,
11q11-q25, 11p15-q25, 11q23, 11q13, 11q14-q22,
11p12-p11.2, 11q12-q13, 12p13-p11.1, 12q13-q15,
12q11-q21, 12q12-12q23,12q24.1-q24.3, 12p12,
12q14-q15, 13q32-q34, 13p13-q34, 13q13-q14,
13q22-q34, 13q22-q31, 13q11-q34,
14q12-q21, 14q12-q32, 14q32, 15q11.1-q26,
15q24-q25, 16p13.3-q24, 16p13.3-p11.1, 16q22,
16p13.1-p12, 17q11.1-q25, 17p13-11.1, 17q21-q25,
17q12-q21, 17p13-q25, 17q24-q25, 17q22,
18q11.1-q23, 18q21, 18p11.3-18q23, 18p11.3-11.1,
19q13.1, 19p13.3-p13.2, 19p13.3-q13.4,
19q13.1-q13.4, 20q12, 20p12-p11.2, 20q11.1-q13.3,
20p13-q13.3, 20q13.1-q13.3, 20q11.1-q12,
21p13-q22, 21q11.2-q21, 21q21-q22, 21q11.1-q22,
22q11.1-q13, 22q13, 22p13-q13, Xp22.3-q28,
Xp22.1-p11.2, Xq26-q28, Xq11-q28
16Summary and Conclusions
- DNA copy number amplifications (mutations) in
cancer database collected from literature - Mixture modeling of 0-1 data
- Models summarized based on parameters and
clustered data with maximal frequent itemsets - The collection of DNA copy number amplifications
forms a new basis for cancer classification