Title: Onebyone data recovery clustering
1One-by-one data recovery clustering
- Boris Mirkin
- http//www.dcs.bbk.ac.uk/mirkin
- UKCI05 5-7 September 05, LKL http//www.dcs.bbk.a
c.uk/ukci05 - School of Computer Science
- Birkbeck University of London
2- WHAT IS CLUSTERING WHAT IS DATA
- K-MEANS CLUSTERING Conventional K-Means
Initialization of K-Means Intelligent K-Means
Interpretation Aids - WARD HIERARCHICAL CLUSTERING Agglomeration
Divisive Clustering with Ward Criterion
Extensions of Ward Clustering - DATA RECOVERY MODELS Statistics Modelling as
Data Recovery - Data Recovery Model for K-Means for Ward
Extensions to Other Data Types One-by-One
Clustering - DIFFERENT CLUSTERING APPROACHES Extensions of
K-Means Graph-Theoretic Approaches Conceptual
Description of Clusters - GENERAL ISSUES Feature Selection and Extraction
Similarity on Subsets and Partitions Validity
and Reliability
3Data Recovery framework for data analysis
methods
- Type of Data
- Similarity
- Temporal
- Entity-to-feature
- Co-occurrence
- Type of Model
- Regression
- Principal components
- Clusters
4Basic Equation and DSD
- Model
- Data Model_Data Residual
- Pythagoras (Decomposition of Data Scatter)
-
- Data2 Model_Data2 Residual2
5Cluster model for similarity data
- Given A(aik) similarity over entities i, k ? I
- Find clusters St ? I with intensities ?t (t ?
T) - aik ?1 si1 sk1 ?2 si2 sk2 ?T siT skT
eik - ?t gt 0 and sit , skt 1 if i, k ? S
0 if i, k ? S - No restrictions on sit ? the spectral
decomposition
6Decomposition of Data scatter for similarity data
- Given A(aik) similarity over entities i, k ? I
- Find clusters St ? I with intensities ?t (t ?
T) - Data scatter Explained
Unexplained - ?ikaik2 ?12?isi1?ksk1 ?T2?isiT ?kskT
?ikeik2 -
- ?ikaik2 ?12S12 ?T2ST2
?ikeik2 - Minimise unexplained part of data scatter,
?ikeik2 - ?t ?i,k?St aik / St2 a(St) average
similarity within St
7One cluster model (non-overlapping)
- Start with t1 and ItI
- Find binary s(si), I ? It
- aik ? si sk eik
- By minimising
- ?ik(aik - ? si sk) 2 ?ikaik2 - ?2S2, or
- By maximising the anomaly measure
- (?S)2(a(S)S)2 (? ?)
- Put StS, It It St , tt1 and reiterate
8One cluster method ADDI-S
- Maximise ?Sa(S)S (?)
- This is density of S if no negative similarities
- a(S) - average similarity
within S - () squared is measure of how anomalous S is
- 0. Cycle over i ? It put Si.
- 1. Start Find maximum aik. Put k into S if aik
gt0 otherwise, stop. - 2. Steepest ascent. Find i maximising
- ?(i) s(i)(a(i, S)-a(S)/2)
- a(i, S) - average similarity of i and S
- s_(i) -1 if i ? S and 1, if not.
- 3. If ?(i) gt 0, end. Else update S, -s(i).
Goto Step 2. - 4. Return S (over all i) that maximises
(a(S)S)2 . - Theorem S is a strict cluster a(i,S) lt
a(S)/2 for all i ? It S.
9Catch effect of similarity shift
aika1 few clusters aika2 more clusters,
may get larger, may merge Not necessarily if no
1D structure on pairs i, k
10Application to set clustering
- Each i ? I is assigned with Fi ?
F - aik should depend on overlap of Fi and Fk
- Jaccard Fi ? Fk /(Fi Fk - Fi ? Fk
) - underestimates similarity
- Mbc (Fi ? Fk /Fi Fi ? Fk / Fk )/2
- ok
11Aggregating homologous protein families (HPFs)
across 30 herpes virus genomes
- An HPF is defined as a set of
proteins with a similar fragment - Each HPF h is assigned with set
Fh of BLAST (whole sequence) based homologues - Effects
- Different assignments F are compatible
- Different shift values (high level, 0.96, 0.9,
0.8) lead to hierarchically organised clusterings
as if the Figure was true
12Kinship terms Six similarity matrices
133-way one cluster model
- Find binary s(si), i ? I
- aik,v ?v si sk eik,v
- by minimising
- ?ikv(aik,v - ?v si sk) 2 ?ik,vaik,v2 -
?2S2,
14(No Transcript)
15Future work
- Research into hierarchical similarities
- Bioinformatics clustering
- Temporal data clustering
- Multi-region data clustering
- Web clustering