BiClustering - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

BiClustering

Description:

11. The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL. Motivation ... found via the state of the art method in computational biology field is 12.54 ... – PowerPoint PPT presentation

Number of Views:224
Avg rating:3.0/5.0
Slides: 25
Provided by: anselmo9
Category:

less

Transcript and Presenter's Notes

Title: BiClustering


1
Bi-Clustering
  • COMP 790-90 Seminar
  • Spring 2009

2
Data Mining Clustering
K-means clustering minimizes
Where
3
Clustering by Pattern Similarity (p-Clustering)
  • The micro-array raw data shows 3 genes and
    their values in a multi-dimensional space
  • Parallel Coordinates Plots
  • Difficult to find their patterns
  • non-traditional clustering

4
Clusters Are Clear After Projection
5
Motivation
  • E-Commerce collaborative filtering

6
Motivation
7
Motivation
8
Motivation
9
Motivation
  • DNA microarray analysis

10
Motivation
11
Motivation
  • Strong coherence exhibits by the selected objects
    on the selected attributes.
  • They are not necessarily close to each other but
    rather bear a constant shift.
  • Object/attribute bias
  • bi-cluster

12
Challenges
  • The set of objects and the set of attributes are
    usually unknown.
  • Different objects/attributes may possess
    different biases and such biases
  • may be local to the set of selected
    objects/attributes
  • are usually unknown in advance
  • May have many unspecified entries

13
Previous Work
  • Subspace clustering
  • Identifying a set of objects and a set of
    attributes such that the set of objects are
    physically close to each other on the subspace
    formed by the set of attributes.
  • Collaborative filtering Pearson R
  • Only considers global offset of each
    object/attribute.

14
bi-cluster
  • Consists of a (sub)set of objects and a (sub)set
    of attributes
  • Corresponds to a submatrix
  • Occupancy threshold ?
  • Each object/attribute has to be filled by a
    certain percentage.
  • Volume number of specified entries in the
    submatrix
  • Base average value of each object/attribute (in
    the bi-cluster)

15
bi-cluster
16
bi-cluster
  • Perfect ?-cluster
  • Imperfect ?-cluster
  • Residue

dij
diJ
dIJ
dIj
17
bi-cluster
  • The smaller the average residue, the stronger the
    coherence.
  • Objective identify ?-clusters with residue
    smaller than a given threshold

18
Cheng-Church Algorithm
  • Find one bi-cluster.
  • Replace the data in the first bi-cluster with
    random data
  • Find the second bi-cluster, and go on.
  • The quality of the bi-cluster degrades (smaller
    volume, higher residue) due to the insertion of
    random data.

19
The FLOC algorithm
Generating initial clusters
Determine the best action for each row and each
column
Perform the best action of each row and column
sequentially
Y
Improved?
N
20
The FLOC algorithm
  • Action the change of membership of a row(or
    column) with respect to a cluster

column
M4
1
2
3
4
row
3
4
2
2
1
MN actions are Performed at each iteration
1
3
3
2
2
N3
4
2
0
4
3
21
The FLOC algorithm
  • Gain of an action the residue reduction incurred
    by performing the action
  • Order of action
  • Fixed order
  • Random order
  • Weighted random order
  • Complexity O((MN)MNkp)

?
22
The FLOC algorithm
  • Additional features
  • Maximum allowed overlap among clusters
  • Minimum coverage of clusters
  • Minimum volume of each cluster
  • Can be enforced by temporarily blocking certain
    action during the mining process if such action
    would violate some constraint.

23
Performance
  • Microarray data 2884 genes, 17 conditions
  • 100 bi-clusters with smallest residue were
    returned.
  • Average residue 10.34
  • The average residue of clusters found via the
    state of the art method in computational biology
    field is 12.54
  • The average volume is 25 bigger
  • The response time is an order of magnitude faster

24
Conclusion Remark
  • The model of bi-cluster is proposed to capture
    coherent objects with incomplete data set.
  • base
  • residue
  • Many additional features can be accommodated
    (nearly for free).
Write a Comment
User Comments (0)
About PowerShow.com