Projected%20clustering%20algorithms%20and%20their%20application%20on%20genomic%20data%20analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Projected%20clustering%20algorithms%20and%20their%20application%20on%20genomic%20data%20analysis

Description:

Projected clustering algorithms and their application on ... Operon 12_2: hyfA 0. hyfB 0. hyfC 0. hyfD 0. hyfE 6. hyfF 0. hyfG 0. hyfH 0. hyfI 0. b2490 9 ... – PowerPoint PPT presentation

Number of Views:460
Avg rating:3.0/5.0
Slides: 27
Provided by: kevi60
Category:

less

Transcript and Presenter's Notes

Title: Projected%20clustering%20algorithms%20and%20their%20application%20on%20genomic%20data%20analysis


1
Projected clustering algorithms and their
application on genomic data analysis
  • Probation Talk
  • M.Phil. candidate Kevin Yip
  • Supervisor Dr. D. Cheung
  • 20th Dec 2002

2
Presentation Outline
  • The research problem clustering high-dimensional
    datasets.
  • Recent work in the community projected
    (subspace) clustering.
  • My previous work HARP and HARP.1.
  • My current and future work studying genomic
    data, comparing different algorithms, designing
    similarity measures, etc.

3
Curse of Dimensionality
  • In a dataset of high dimensionality, some
    attributes may not help identify the
    characteristics of data points.
  • If the number of such attributes is high, a point
    can be as close to its nearest neighbor as its
    farthest neighbor.

4
Curse of Dimensionality
A1
A2
R1
3
8
R2
4
7
R3
9
1
D(R1, R2)
1.4
D(R2, R3)
7.8
Difference
6.4
  • Resolution feature selection

5
Projected Clustering
  • What feature selection cannot do find different
    relevant attributes for different clusters.

6
Projected Clustering
  • Clustering identify similar objects and form
    clusters.
  • Projected clustering identify similar objects
    and the relevant attributes, and form clusters.

7
Projected Clustering
  • Existing approaches
  • Grid-based dimension selection
  • Association rule hypergraph partitioning
  • Context-specific Bayesian clustering
  • Monte Carlo algorithm
  • Projective clustering
  • All produce projected clusters successfully in
    different ways.

8
Projected Clustering
  • Some problems with the approaches
  • Accuracy depends on hard-to-determine input
    parameters.
  • Unable to determine the dimensionality of each
    cluster automatically.
  • Can only use density as similarity measure.
  • In addition, the algorithms have tested on few
    real datasets.

9
HARP and HARP.1
  • HARP (A Hierarchical Algorithm with Automatic
    Relevant Attribute Selection for Projected
    Clustering)
  • A hierarchical clustering framework with no
    pre-assumed similarity measure.
  • Requires no hard-to-determine user inputs.
  • Determines the relevant attributes of each
    cluster automatically.

10
HARP and HARP.1
  • HARP.1 - an implementation of HARP using
    attribute value density to define the similarity
    measure
  • Makes use of global statistics in attribute
    selection.
  • Handles mutual disagreement and information loss
    of the potentially merging clusters.
  • Accepts mixed types of attributes.

11
HARP and HARP.1
  • Some experimental results (synthetic data)

Dataset d l HARP.1 PROCLUS Traditional ORCLUS
SynCat1 20 12 0.0 / 5.0 3.6 / 1.4 6.7 / 3.7 2.6 / 26.4 5.8 / 5.3 N/A
SynMix1 20 12 0.4 / 6.8 2.2 / 17.0 6.8 / 10.1 11.6 / 11.2 7.9 / 4.6 N/A
SynNum1 20 12 0.8 / 5.0 1.8 / 21.4 7.2 / 8.3 4.4 / 32.0 5.9 / 9.2 0.4 / 23.8 2.31 / 8.15
d dimensionality of the dataset l average
number of relevant attributes
Best score error / outlier
Average error / outlier
12
HARP and HARP.1
  • Some experimental results (real data)

Dataset d l HARP.1 PROCLUS Traditional ROCK
Soybean 35 26 0.0 / 0.0 0.0 / 0.0 17.3 / 0.0 2.1 / 0.0 9.2 / 0.0 No published result
Voting 16 11 6.4 / 13.6 2.1 / 55.6 13.8 / 7.9 13.1 / 11.3 13.1 / 1.9 6.2 / 14.5
Mushroom 22 15 1.4 / 0.0 3.2 / 0.0 9.0 / 0.0 6.0 / 0.0 5.2 / 0.0 0.4 / 0.0
d dimensionality of the dataset l average
number of relevant attributes (determined by
HARP.1)
Best score error / outlier
Average error / outlier
13
Current and Future Work
  • A real application of projected clustering -
    analyzing genomic data
  • High dimensionality
  • Noisy
  • Correct partition not always available
  • A lot of hidden information
  • New data available with a high growth rate

14
Current and Future Work
15
Current and Future Work
  • Codon usage data
  • Study the relationship between the frequencies of
    different codons and the functions of the genes.

16
Current and Future Work
  • Transcriptome data
  • The full complement of activated genes, mRNAs, or
    transcripts in a particular tissue at a
    particular time.
  • Study the expression of the genes
  • In different samples
  • Under different situations
  • At different time

17
Current and Future Work
  • Transcriptome data

18
Current and Future Work
  • Transcriptome data

1 2 3 4 5 6 7
51 52 53 54 55 56 57






2000 0
1 2 3 4 5 6 7
51 52 53 54 55 56 57






1 2 3 4 5 6 7
51 52 53 54 55 56 57






1 2 3 4 5 6 7
51 52 53 54 55 56 57






1 2 3 4 5 6 7
51 52 53 54 55 56 57






19
Current and Future Work
  • A sample clustering result

Operon 12_1 trs5_7 0 yefJ 4 wbbK 4 wbbJ 4 wbbI 4
wbbH 4 glf 4 rfbX 4 rfbC 4 rfbA 4 rfbD 4 rfbB 4
Operon 12_2 hyfA 0 hyfB 0 hyfC 0 hyfD 0 hyfE 6 hy
fF 0 hyfG 0 hyfH 0 hyfI 0 b2490 9 hyfR 0 focB 0
Operon 12_3 rpmJ 6 prlA 0 rplO 2 rpmD 2 rpsE 2 rp
lR 2 rplF 2 rpsH 2 rpsN 2 rplE 2 rplX 2 rplN 2
20
Current and Future Work
  • Goal knowledge ?clustering method
  • Subspace v.s. non-subspace
  • Similarity density, correlation, pattern, etc.
  • Usability sensitivity to input parameters
  • Algorithm type hierarchical, partitional,
    graph-based, model-based, etc.
  • Preprocessing logarithm, mean-centered, PCA,
    FCA, ICA, etc.

21
Current and Future Work
  • Other subtasks
  • Internal validation for projected clustering
  • Designing classifiers based on current results
  • Building an integrated system for genomic data
    analysis

22
Conclusion
  • Research space both theoretical and practical.
  • Algorithms
  • Projected clustering algorithms have much to
    improve.
  • Many open problems in high-dimensional data
    analysis.
  • Genomic data
  • A lot to discover. CS people really help.

23
References
  • References for projected clustering can be found
    in the slides of the presentations HARP A
    Hierarchical Approach with Automatic Relevant
    Attribute Selection for Projected Clustering and
    The Subspace Clustering Problem.
  • Reference web sites for the BioInformatics
    materials covered in this presentation
  • Human Genome Project Information
  • HKU-Pasteur Research Centre
  • Companion Web Site, Concepts of Genetics (6th Ed.)

24
Source of Figures
  • Projected clusters (p.5) C. M. Procopiuc, M.
    Jones, P. K. Agarwal, and T. M. Murali. A Monte
    Carlo Algorithm for Fast Projective Clustering.
    In ACM SIGMOD International Conference on
    Management of Data, 2002.
  • Generalized animal cell, genetic information flow
    (p. 14, 15) William S. Klug and Michael R.
    Cummings. Concepts of Genetics, Sixth Edition (p.
    18, 350).
  • DNA with features (p. 14) Human Genome Project
    Information, http//www.ornl.gov/hgmis/.

25
Source of Figures
  • GeneChip Arrays for Gene Expression Analysis (p.
    17) Affymetrix, http//www.affymetrix.com/.
  • An illuminated microarray (p. 17) EMBL-EMI,
    http//www.ebi.ac.uk/.

26
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com