Title: Projected%20clustering%20algorithms%20and%20their%20application%20on%20genomic%20data%20analysis
1Projected clustering algorithms and their
application on genomic data analysis
- Probation Talk
- M.Phil. candidate Kevin Yip
- Supervisor Dr. D. Cheung
- 20th Dec 2002
2Presentation Outline
- The research problem clustering high-dimensional
datasets. - Recent work in the community projected
(subspace) clustering. - My previous work HARP and HARP.1.
- My current and future work studying genomic
data, comparing different algorithms, designing
similarity measures, etc.
3Curse of Dimensionality
- In a dataset of high dimensionality, some
attributes may not help identify the
characteristics of data points. - If the number of such attributes is high, a point
can be as close to its nearest neighbor as its
farthest neighbor.
4Curse of Dimensionality
A1
A2
R1
3
8
R2
4
7
R3
9
1
D(R1, R2)
1.4
D(R2, R3)
7.8
Difference
6.4
- Resolution feature selection
5Projected Clustering
- What feature selection cannot do find different
relevant attributes for different clusters.
6Projected Clustering
- Clustering identify similar objects and form
clusters. - Projected clustering identify similar objects
and the relevant attributes, and form clusters.
7Projected Clustering
- Existing approaches
- Grid-based dimension selection
- Association rule hypergraph partitioning
- Context-specific Bayesian clustering
- Monte Carlo algorithm
- Projective clustering
- All produce projected clusters successfully in
different ways.
8Projected Clustering
- Some problems with the approaches
- Accuracy depends on hard-to-determine input
parameters. - Unable to determine the dimensionality of each
cluster automatically. - Can only use density as similarity measure.
- In addition, the algorithms have tested on few
real datasets.
9HARP and HARP.1
- HARP (A Hierarchical Algorithm with Automatic
Relevant Attribute Selection for Projected
Clustering) - A hierarchical clustering framework with no
pre-assumed similarity measure. - Requires no hard-to-determine user inputs.
- Determines the relevant attributes of each
cluster automatically.
10HARP and HARP.1
- HARP.1 - an implementation of HARP using
attribute value density to define the similarity
measure - Makes use of global statistics in attribute
selection. - Handles mutual disagreement and information loss
of the potentially merging clusters. - Accepts mixed types of attributes.
11HARP and HARP.1
- Some experimental results (synthetic data)
Dataset d l HARP.1 PROCLUS Traditional ORCLUS
SynCat1 20 12 0.0 / 5.0 3.6 / 1.4 6.7 / 3.7 2.6 / 26.4 5.8 / 5.3 N/A
SynMix1 20 12 0.4 / 6.8 2.2 / 17.0 6.8 / 10.1 11.6 / 11.2 7.9 / 4.6 N/A
SynNum1 20 12 0.8 / 5.0 1.8 / 21.4 7.2 / 8.3 4.4 / 32.0 5.9 / 9.2 0.4 / 23.8 2.31 / 8.15
d dimensionality of the dataset l average
number of relevant attributes
Best score error / outlier
Average error / outlier
12HARP and HARP.1
- Some experimental results (real data)
Dataset d l HARP.1 PROCLUS Traditional ROCK
Soybean 35 26 0.0 / 0.0 0.0 / 0.0 17.3 / 0.0 2.1 / 0.0 9.2 / 0.0 No published result
Voting 16 11 6.4 / 13.6 2.1 / 55.6 13.8 / 7.9 13.1 / 11.3 13.1 / 1.9 6.2 / 14.5
Mushroom 22 15 1.4 / 0.0 3.2 / 0.0 9.0 / 0.0 6.0 / 0.0 5.2 / 0.0 0.4 / 0.0
d dimensionality of the dataset l average
number of relevant attributes (determined by
HARP.1)
Best score error / outlier
Average error / outlier
13Current and Future Work
- A real application of projected clustering -
analyzing genomic data - High dimensionality
- Noisy
- Correct partition not always available
- A lot of hidden information
- New data available with a high growth rate
14Current and Future Work
15Current and Future Work
- Codon usage data
- Study the relationship between the frequencies of
different codons and the functions of the genes.
16Current and Future Work
- Transcriptome data
- The full complement of activated genes, mRNAs, or
transcripts in a particular tissue at a
particular time. - Study the expression of the genes
- In different samples
- Under different situations
- At different time
17Current and Future Work
18Current and Future Work
1 2 3 4 5 6 7
51 52 53 54 55 56 57
2000 0
1 2 3 4 5 6 7
51 52 53 54 55 56 57
1 2 3 4 5 6 7
51 52 53 54 55 56 57
1 2 3 4 5 6 7
51 52 53 54 55 56 57
1 2 3 4 5 6 7
51 52 53 54 55 56 57
19Current and Future Work
- A sample clustering result
Operon 12_1 trs5_7 0 yefJ 4 wbbK 4 wbbJ 4 wbbI 4
wbbH 4 glf 4 rfbX 4 rfbC 4 rfbA 4 rfbD 4 rfbB 4
Operon 12_2 hyfA 0 hyfB 0 hyfC 0 hyfD 0 hyfE 6 hy
fF 0 hyfG 0 hyfH 0 hyfI 0 b2490 9 hyfR 0 focB 0
Operon 12_3 rpmJ 6 prlA 0 rplO 2 rpmD 2 rpsE 2 rp
lR 2 rplF 2 rpsH 2 rpsN 2 rplE 2 rplX 2 rplN 2
20Current and Future Work
- Goal knowledge ?clustering method
- Subspace v.s. non-subspace
- Similarity density, correlation, pattern, etc.
- Usability sensitivity to input parameters
- Algorithm type hierarchical, partitional,
graph-based, model-based, etc. - Preprocessing logarithm, mean-centered, PCA,
FCA, ICA, etc.
21Current and Future Work
- Other subtasks
- Internal validation for projected clustering
- Designing classifiers based on current results
- Building an integrated system for genomic data
analysis
22Conclusion
- Research space both theoretical and practical.
- Algorithms
- Projected clustering algorithms have much to
improve. - Many open problems in high-dimensional data
analysis. - Genomic data
- A lot to discover. CS people really help.
23References
- References for projected clustering can be found
in the slides of the presentations HARP A
Hierarchical Approach with Automatic Relevant
Attribute Selection for Projected Clustering and
The Subspace Clustering Problem. - Reference web sites for the BioInformatics
materials covered in this presentation - Human Genome Project Information
- HKU-Pasteur Research Centre
- Companion Web Site, Concepts of Genetics (6th Ed.)
24Source of Figures
- Projected clusters (p.5) C. M. Procopiuc, M.
Jones, P. K. Agarwal, and T. M. Murali. A Monte
Carlo Algorithm for Fast Projective Clustering.
In ACM SIGMOD International Conference on
Management of Data, 2002. - Generalized animal cell, genetic information flow
(p. 14, 15) William S. Klug and Michael R.
Cummings. Concepts of Genetics, Sixth Edition (p.
18, 350). - DNA with features (p. 14) Human Genome Project
Information, http//www.ornl.gov/hgmis/.
25Source of Figures
- GeneChip Arrays for Gene Expression Analysis (p.
17) Affymetrix, http//www.affymetrix.com/. - An illuminated microarray (p. 17) EMBL-EMI,
http//www.ebi.ac.uk/.
26Thank You!