Onebyone data recovery clustering - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Onebyone data recovery clustering

Description:

Minimise unexplained part of data scatter, ... By minimising ik(aik - ? si sk) 2 = ikaik2 -?2|S|2, or. By maximising the anomaly measure ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 16
Provided by: mir84
Category:

less

Transcript and Presenter's Notes

Title: Onebyone data recovery clustering


1
One-by-one data recovery clustering
  • Boris Mirkin
  • http//www.dcs.bbk.ac.uk/mirkin
  • UKCI05 5-7 September 05, LKL http//www.dcs.bbk.a
    c.uk/ukci05
  • School of Computer Science
  • Birkbeck University of London

2
  • WHAT IS CLUSTERING WHAT IS DATA
  • K-MEANS CLUSTERING Conventional K-Means
    Initialization of K-Means Intelligent K-Means
    Interpretation Aids
  • WARD HIERARCHICAL CLUSTERING Agglomeration
    Divisive Clustering with Ward Criterion
    Extensions of Ward Clustering
  • DATA RECOVERY MODELS Statistics Modelling as
    Data Recovery
  • Data Recovery Model for K-Means for Ward
    Extensions to Other Data Types One-by-One
    Clustering
  • DIFFERENT CLUSTERING APPROACHES Extensions of
    K-Means Graph-Theoretic Approaches Conceptual
    Description of Clusters
  • GENERAL ISSUES Feature Selection and Extraction
    Similarity on Subsets and Partitions Validity
    and Reliability

3
Data Recovery framework for data analysis
methods
  • Type of Data
  • Similarity
  • Temporal
  • Entity-to-feature
  • Co-occurrence
  • Type of Model
  • Regression
  • Principal components
  • Clusters

4
Basic Equation and DSD
  • Model
  • Data Model_Data Residual
  • Pythagoras (Decomposition of Data Scatter)
  • Data2 Model_Data2 Residual2

5
Cluster model for similarity data
  • Given A(aik) similarity over entities i, k ? I
  • Find clusters St ? I with intensities ?t (t ?
    T)
  • aik ?1 si1 sk1 ?2 si2 sk2 ?T siT skT
    eik
  • ?t gt 0 and sit , skt 1 if i, k ? S
    0 if i, k ? S
  • No restrictions on sit ? the spectral
    decomposition

6
Decomposition of Data scatter for similarity data
  • Given A(aik) similarity over entities i, k ? I
  • Find clusters St ? I with intensities ?t (t ?
    T)
  • Data scatter Explained
    Unexplained
  • ?ikaik2 ?12?isi1?ksk1 ?T2?isiT ?kskT
    ?ikeik2
  • ?ikaik2 ?12S12 ?T2ST2
    ?ikeik2
  • Minimise unexplained part of data scatter,
    ?ikeik2
  • ?t ?i,k?St aik / St2 a(St) average
    similarity within St

7
One cluster model (non-overlapping)
  • Start with t1 and ItI
  • Find binary s(si), I ? It
  • aik ? si sk eik
  • By minimising
  • ?ik(aik - ? si sk) 2 ?ikaik2 - ?2S2, or
  • By maximising the anomaly measure
  • (?S)2(a(S)S)2 (? ?)
  • Put StS, It It St , tt1 and reiterate

8
One cluster method ADDI-S
  • Maximise ?Sa(S)S (?)
  • This is density of S if no negative similarities
  • a(S) - average similarity
    within S
  • () squared is measure of how anomalous S is
  • 0. Cycle over i ? It put Si.
  • 1. Start Find maximum aik. Put k into S if aik
    gt0 otherwise, stop.
  • 2. Steepest ascent. Find i maximising
  • ?(i) s(i)(a(i, S)-a(S)/2)
  • a(i, S) - average similarity of i and S
  • s_(i) -1 if i ? S and 1, if not.
  • 3. If ?(i) gt 0, end. Else update S, -s(i).
    Goto Step 2.
  • 4. Return S (over all i) that maximises
    (a(S)S)2 .
  • Theorem S is a strict cluster a(i,S) lt
    a(S)/2 for all i ? It S.

9
Catch effect of similarity shift
  • graph

aika1 few clusters aika2 more clusters,
may get larger, may merge Not necessarily if no
1D structure on pairs i, k
10
Application to set clustering
  • Each i ? I is assigned with Fi ?
    F
  • aik should depend on overlap of Fi and Fk
  • Jaccard Fi ? Fk /(Fi Fk - Fi ? Fk
    )
  • underestimates similarity
  • Mbc (Fi ? Fk /Fi Fi ? Fk / Fk )/2
  • ok

11
Aggregating homologous protein families (HPFs)
across 30 herpes virus genomes
  • An HPF is defined as a set of
    proteins with a similar fragment
  • Each HPF h is assigned with set
    Fh of BLAST (whole sequence) based homologues
  • Effects
  • Different assignments F are compatible
  • Different shift values (high level, 0.96, 0.9,
    0.8) lead to hierarchically organised clusterings
    as if the Figure was true

12
Kinship terms Six similarity matrices
  • This

13
3-way one cluster model
  • Find binary s(si), i ? I
  • aik,v ?v si sk eik,v
  • by minimising
  • ?ikv(aik,v - ?v si sk) 2 ?ik,vaik,v2 -
    ?2S2,

14
(No Transcript)
15
Future work
  • Research into hierarchical similarities
  • Bioinformatics clustering
  • Temporal data clustering
  • Multi-region data clustering
  • Web clustering
Write a Comment
User Comments (0)
About PowerShow.com