Projected%20clustering%20algorithms%20and%20their%20application%20on%20genomic%20data%20analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Projected%20clustering%20algorithms%20and%20their%20application%20on%20genomic%20data%20analysis

Description:

Projected clustering algorithms and their application on ... Operon 12_2: hyfA 0. hyfB 0. hyfC 0. hyfD 0. hyfE 6. hyfF 0. hyfG 0. hyfH 0. hyfI 0. b2490 9 ... – PowerPoint PPT presentation

Number of Views:460

Avg rating:3.0/5.0

Slides: 27

Provided by: kevi60

Category:

more less

Transcript and Presenter's Notes

Title: Projected%20clustering%20algorithms%20and%20their%20application%20on%20genomic%20data%20analysis

1
Projected clustering algorithms and their
application on genomic data analysis

Probation Talk
M.Phil. candidate Kevin Yip
Supervisor Dr. D. Cheung
20th Dec 2002

2
Presentation Outline

The research problem clustering high-dimensional
datasets.
Recent work in the community projected
(subspace) clustering.
My previous work HARP and HARP.1.
My current and future work studying genomic
data, comparing different algorithms, designing
similarity measures, etc.

3
Curse of Dimensionality

In a dataset of high dimensionality, some
attributes may not help identify the
characteristics of data points.
If the number of such attributes is high, a point
can be as close to its nearest neighbor as its
farthest neighbor.

4
Curse of Dimensionality
A1
A2
R1
3
8
R2
4
7
R3
9
1
D(R1, R2)
1.4
D(R2, R3)
7.8
Difference
6.4

Resolution feature selection

5
Projected Clustering

What feature selection cannot do find different
relevant attributes for different clusters.

6
Projected Clustering

Clustering identify similar objects and form
clusters.
Projected clustering identify similar objects
and the relevant attributes, and form clusters.

7
Projected Clustering

Existing approaches
Grid-based dimension selection
Association rule hypergraph partitioning
Context-specific Bayesian clustering
Monte Carlo algorithm
Projective clustering
All produce projected clusters successfully in
different ways.

8
Projected Clustering

Some problems with the approaches
Accuracy depends on hard-to-determine input
parameters.
Unable to determine the dimensionality of each
cluster automatically.
Can only use density as similarity measure.
In addition, the algorithms have tested on few
real datasets.

9
HARP and HARP.1

HARP (A Hierarchical Algorithm with Automatic
Relevant Attribute Selection for Projected
Clustering)
A hierarchical clustering framework with no
pre-assumed similarity measure.
Requires no hard-to-determine user inputs.
Determines the relevant attributes of each
cluster automatically.

10
HARP and HARP.1

HARP.1 - an implementation of HARP using
attribute value density to define the similarity
measure
Makes use of global statistics in attribute
selection.
Handles mutual disagreement and information loss
of the potentially merging clusters.
Accepts mixed types of attributes.

11
HARP and HARP.1

Some experimental results (synthetic data)

Dataset d l HARP.1 PROCLUS Traditional ORCLUS
SynCat1 20 12 0.0 / 5.0 3.6 / 1.4 6.7 / 3.7 2.6 / 26.4 5.8 / 5.3 N/A
SynMix1 20 12 0.4 / 6.8 2.2 / 17.0 6.8 / 10.1 11.6 / 11.2 7.9 / 4.6 N/A
SynNum1 20 12 0.8 / 5.0 1.8 / 21.4 7.2 / 8.3 4.4 / 32.0 5.9 / 9.2 0.4 / 23.8 2.31 / 8.15
d dimensionality of the dataset l average
number of relevant attributes
Best score error / outlier
Average error / outlier
12
HARP and HARP.1

Some experimental results (real data)

Dataset d l HARP.1 PROCLUS Traditional ROCK
Soybean 35 26 0.0 / 0.0 0.0 / 0.0 17.3 / 0.0 2.1 / 0.0 9.2 / 0.0 No published result
Voting 16 11 6.4 / 13.6 2.1 / 55.6 13.8 / 7.9 13.1 / 11.3 13.1 / 1.9 6.2 / 14.5
Mushroom 22 15 1.4 / 0.0 3.2 / 0.0 9.0 / 0.0 6.0 / 0.0 5.2 / 0.0 0.4 / 0.0
d dimensionality of the dataset l average
number of relevant attributes (determined by
HARP.1)
Best score error / outlier
Average error / outlier
13
Current and Future Work

A real application of projected clustering -
analyzing genomic data
High dimensionality
Noisy
Correct partition not always available
A lot of hidden information
New data available with a high growth rate

14
Current and Future Work
15
Current and Future Work

Codon usage data
Study the relationship between the frequencies of
different codons and the functions of the genes.

16
Current and Future Work

Transcriptome data
The full complement of activated genes, mRNAs, or
transcripts in a particular tissue at a
particular time.
Study the expression of the genes
In different samples
Under different situations
At different time

17
Current and Future Work

Transcriptome data

18
Current and Future Work

Transcriptome data

1 2 3 4 5 6 7
51 52 53 54 55 56 57

2000 0
1 2 3 4 5 6 7
51 52 53 54 55 56 57

1 2 3 4 5 6 7
51 52 53 54 55 56 57

1 2 3 4 5 6 7
51 52 53 54 55 56 57

1 2 3 4 5 6 7
51 52 53 54 55 56 57

19
Current and Future Work

A sample clustering result

Operon 12_1 trs5_7 0 yefJ 4 wbbK 4 wbbJ 4 wbbI 4
wbbH 4 glf 4 rfbX 4 rfbC 4 rfbA 4 rfbD 4 rfbB 4
Operon 12_2 hyfA 0 hyfB 0 hyfC 0 hyfD 0 hyfE 6 hy
fF 0 hyfG 0 hyfH 0 hyfI 0 b2490 9 hyfR 0 focB 0
Operon 12_3 rpmJ 6 prlA 0 rplO 2 rpmD 2 rpsE 2 rp
lR 2 rplF 2 rpsH 2 rpsN 2 rplE 2 rplX 2 rplN 2
20
Current and Future Work

Goal knowledge ?clustering method
Subspace v.s. non-subspace
Similarity density, correlation, pattern, etc.
Usability sensitivity to input parameters
Algorithm type hierarchical, partitional,
graph-based, model-based, etc.
Preprocessing logarithm, mean-centered, PCA,
FCA, ICA, etc.

21
Current and Future Work

Other subtasks
Internal validation for projected clustering
Designing classifiers based on current results
Building an integrated system for genomic data
analysis

22
Conclusion

Research space both theoretical and practical.
Algorithms
Projected clustering algorithms have much to
improve.
Many open problems in high-dimensional data
analysis.
Genomic data
A lot to discover. CS people really help.

23
References

References for projected clustering can be found
in the slides of the presentations HARP A
Hierarchical Approach with Automatic Relevant
Attribute Selection for Projected Clustering and
The Subspace Clustering Problem.
Reference web sites for the BioInformatics
materials covered in this presentation
Human Genome Project Information
HKU-Pasteur Research Centre
Companion Web Site, Concepts of Genetics (6th Ed.)

24
Source of Figures

Projected clusters (p.5) C. M. Procopiuc, M.
Jones, P. K. Agarwal, and T. M. Murali. A Monte
Carlo Algorithm for Fast Projective Clustering.
In ACM SIGMOD International Conference on
Management of Data, 2002.
Generalized animal cell, genetic information flow
(p. 14, 15) William S. Klug and Michael R.
Cummings. Concepts of Genetics, Sixth Edition (p.
18, 350).
DNA with features (p. 14) Human Genome Project
Information, http//www.ornl.gov/hgmis/.

25
Source of Figures

GeneChip Arrays for Gene Expression Analysis (p.
17) Affymetrix, http//www.affymetrix.com/.
An illuminated microarray (p. 17) EMBL-EMI,
http//www.ebi.ac.uk/.

26
Thank You!

Write a Comment

User Comments (0)