Structuring Interactive Cluster Analysis

About This Presentation

Title:

Structuring Interactive Cluster Analysis

Description:

Title: Structuring Interactive Cluster Analysis Subject: Interactive data mining Author: Wayne Oldford Keywords: clustering, k-means, visual empirical regions of ... – PowerPoint PPT presentation

Number of Views:388

Avg rating:3.0/5.0

Slides: 83

Provided by: Wayne241

Category:

more less

Transcript and Presenter's Notes

Title: Structuring Interactive Cluster Analysis

1
Structuring Interactive Cluster Analysis

Wayne Oldford
University of Waterloo

2
Structuring Interactive Cluster Analysis
This talk is about interactive cluster analysis,
that is about interactive tools for finding and
identifying groups in data. But more than
that, it's about stepping back and understanding
the structure of this process so that software
tools can be organized to simplify and to aid the
analysis.

Wayne Oldford
University of Waterloo

3
Overview
The problem of cluster analysis' or of finding
groups in data' is ill defined. So there can be
no universal solution and any claimed solution
must necessarily solve some other suitably
constrained problem and not the more general
one. What we need instead are highly interactive
tools which allow us to adapt to the
peculiarities of the data and the problem at
hand. These tools are usefully organized and
integrated if we step back and consider the
problem as one of exploratory data analysis,
except that now, in addition to the data itself,
the exploration is to take place as well on the
space of partitions of the data. Existing
algorithms need to be recast, and new ones
developed, in terms of exploring the space of
partitions. The algorithms can then be easily
integrated with other interactive tools so that
jointly they provide a broadly useful and easily
adapted tool-set for finding and identifying
groups in data.
Argument

ill-defined problem
high-interaction desirable
explore partitions
recast algorithms

4
Overview
Develop by example
Argument

ill-defined problem
high-interaction desirable
explore partitions
recast algorithms

problems
resources
interactive clustering
partition moves
implications
prototype interface

5
Problem
geometric/visual structure

Visual system easily identifies groups
algorithms are often motivated and/or understood
via visual intuition and geometric structure
6
Problem
geometric/visual structure

Visual system easily identifies groups
algorithms are often motivated and/or understood
via visual intuition and geometric structure
7
Problem
Consider visually grouping here

Context matters each point is a document
located by each words frequency within the
document
8
Problem

two similar documents of different lengths
should be closer one of these has more text
than the other.
9
Problem

green closer to orange than to red?
distance measured by angle?
10
Problem
structure in context
segmentation in MRI
groups are spatially contiguous in the plane
of the image and nearby in the intensity.
shape is not defined a priori
image source
11
Problem
context specific structure
aneurysm presents as intensity in blood
vessels
groups are spatially contiguous tubes of
similar intensity
shape is restricted a priori to be 3-d tubes
image source
12
Problem
some specific some not
image source
same slice, five different measurements at
each location
spatial grouping as before, additional
grouping possible across measurements
13
Problem
some specific some not
image source
4 dimensional data from connected images
2d spatial with clear biological grouping,
connected to
2d intensity measures with abstract
structure/grouping
14
Problem

Find groups in data
Similar objects are together
Groups are separated

Problem is ill defined

What do you mean similar?

E.g. what is contiguous structure?

When are groups separate?

Can we believe it?

15
Computational resources

1. Processing

2. Memory
3. Display
16
Computational resources (and response)

1. Processing

Gflops, Tflops, multiple processors

computationally intensive methods

problem constrained and optimized

2. Memory
3. Display
17
Computational resources (and response)

1. Processing

2. Memory

GBs, TBs, disk and RAM

try to analyze huge data-sets

data-sets larger than necessary?

3. Display
18
Computational resources (and response)

1. Processing

2. Memory
3. Display

high resolution, large

graphics processors, digital video

more data, more visual detail

19
Computational resources

1. Processing

2. Memory
3. Display
Exploit no one resource exclusively Balance and
integrate
20
High interaction (much overlooked by researchers)

assume multiple displays

integrate computational resources

challenge is to design software to be simple,
understandable, integrated and extensible

21
Example image analysis find groups via
intensity (contours and two small unusual
structures revealed)
22
Example image analysis other measurements may
contain interesting structure
23
Example image analysis identify new structure
location in the original image
24
Example image analysis mark new groups by
colour (hue, preserving lightness in original
image)
25
Example image analysis explore relation
between old and new groups via contours in the
image itself
26
Example 8 dimensions from teeth
measurements on species ( sex)
27
Example apes, hominids, modern humans

multiple and very different views
3-d point clouds (of first 3 discriminant
co-ordinates)
cases identified in a list
each point represented as a smooth curve by
projecting it on a direction vector smoothly
moving around the surface of an 8-d sphere
all linked via colour by cases being displayed

context helps
knowing the species encourages grouping
grouping based on context the visual
information

grouping is confirmed across different kinds of
display

28
Example mutual support and shapes
a 3-d projection
Shape from all dimensions
How many groups?
29
Example mutual support and shapes
Groups found here
Same in all dimensions?
How many groups?
30
Example mutual support and shapes
Observe effect here
Split black group by shape
How many groups?
31
Example mutual support and shapes
Get new 3-d projection
Coloured by shape
Five groups corroborated
32
Example exploratory data analysis
How many groups?
33
Example exploratory data analysis
Choose data to cut away
Explore the rest
Distinguish groups
34
Example exploratory data analysis
Bring data back
Explore all together
Some black with red?
Focus on centre
35
Example exploratory data analysis
Explore separately
Mark group
Discard new view
Explore all together
Two groups
36
Interactive clustering

visual grouping
location, motion, shape, texture, ...
linking across displays
manual
selection
cases, variates, groups, ...
colouring
focus
immediate and incremental
context can be used to form groups
multiple partitions

37
Automated clustering typical software

resources dedicated to numerical computation
teletype interaction
runs to completion
graphical output
dont always work so well (no universal solution)
confirm via exploratory data analysis

Must be integrated with interactive methods
38
Example K-means clustering
K 2 groups
Starting groups as shown have centre ball in one
group
K-means moves one point at a time to improve 2
groups
39
Example K-means clustering
K 2 groups
Final groups shown maximize F-like statistic
(between/within)
Central ball is lost
K-means poor for this data configuration
40
Example VERI Visual Empirical Regions of
Influence

join points if no third point falls in this
region
Visual Empirical Regions of Influence
41
Example VERI Visual Empirical Regions of
Influence

join points if no third point falls in this
region
Visual Empirical Regions of Influence
42
Visual Empirical Regions of Influence

psychophysical experiments of human visual
perception to join data points
very special circumstances (two lines of three
equi-spaced points each)
works well on demonstration 2-d cases
extends to higher dimensions
two points are joined or not depending on their
joint configuration with a third point
each third point examined forms a plane with the
candidate pair and so VERI shape applies
works in high-d with published demonstration cases

43
Example VERI
Each colour is a different group found by VERI.
Central ball is lost.
VERI fails for this data configuration (also for
small perturbations of demonstration cases).
There is no universal method, nor can there be.
44
Example VERI (with parameters)
VERI algorithm, but parameterized now to shrink
region size. Becomes minimal spanning tree in the
limit (MST gets 2 groups here).
Again. no universal method possible, but methods
can be parameterized.
45
Integrating automatic methods

Move about the space of partitions
Pa --gt Pb --gt Pc --gt .

Which operators f f(Pa) --gt Pb
are of interest?
46
Refine
Need not be nested. Nesting produces hierarchy
Reduce
47
Reassign
48
Refinement sequence

Begin with partition containing all points in one
group.
49
Refinement sequence

-gt 2
Refine partition to move to a new partition
containing two groups.
This refinement was had by projecting all points
onto the eigen-vector of the largest eigen value
of the sample variance covariance matrix and
splitting at the largest gap between projected
points.
Blue points are on the outer sphere.
50
Refinement sequence

-gt 2
-gt 3
Refine partition (2) to move to a new partition
containing three groups.

Refinement move
select group whose sample var-cov matrix has
largest eigen-value
for that group, project and split as before.

Green points are also on the outer sphere.
51
Refinement sequence

-gt 2
-gt 3
-gt 4
Refine partition (3) to move to a new partition
containing four groups.
Refinement move as before, again splits red group.
New group contains a single (magenta) point on
the outer sphere (middle right, up).
Exploration of the data shows this to be a very
poor partition with that single isolated point.
52
Refinement sequence

-gt 2
-gt 3
-gt 4
-gt 5
Refine partition (4) to move to a new partition
containing five groups.
Refinement move as before, again splits red group.
New group contains a single (black) point on the
outer sphere (bottom left).
Again a poor partition no further refinement
step taken at this point.
53
Reassign, reduce sequence

-gt 5
A reassign move from one partition of five to
another.
Reassignment move k-means maximizing an F
statistic.
Seems a better partition than before explore to
confirm.
54
Explore present partition

Reassignment seems to have isolated central red
ball.
Remaining groups distributed around a spherical
surface.
Consider reduction moves from this partition to
nearby partitions with fewer groups.
55
Partition to be reduced

Same partition - back in the original position to
make subsequent reduction moves visually
comparable with previous refinement and
reassignment moves.
Choice of reduction move can be based on what we
have learned from exploring this partition.
56
Reduce sequence

-gt 4
Reduce partition (5) to move to a new partition
containing four groups.
Reduction move Single-linkage between
groups. i.e. join closest two groups as measured
by euclidean distance between nearest points in
each group.
Seems reasonable choice given structure observed
in previous exploration.
57
Reduce sequence

-gt 4
-gt 3
Reduce partition (4) to move to a new partition
containing three groups.
Reduction move As before.
Red ball remains.
Exploration suggests one more reduction move.
58
Reduce sequence

-gt 4
-gt 3
-gt 2
Reduce partition (3) to move to a new partition
containing two groups.
Reduction move As before.
This partition seems best.
Interactive exploration important to choose type
and details of potentially interesting moves from
one partition to another.
59
Moves (generic functions)
examples

refine (Pold) --gt Pnew

break minimal spanning tree

reduce (Pold) --gt Pnew

join near centres

reassign (Pold) --gt Pnew

k-means maximize F

partition (graphic) --gt Pnew

colours from point cloud
60
Challenges

varying focus
subsets (selected manually and at random)
merging new data into partition

exploring multiple partitions
interactive display and comparison
resolving many to one

interface design
control panels, options
interaction

61
A prototype interface

cluster analysis hub
an analysis hub (Oldford, 1997) created on
demand for partition
having all points in one group for named
data-set, or
as defined by colours of all points in topmost
plot, or
as defined by colours of selected points in
topmost plot
new hub can always be created for any subset
maintains list of saved partitions
offers moves from current partition via one of
reduce, refine, or reassign
manually from current colours (so as to capture
interactive modification of existing partition)
Other operations on one or more partitions (e.g.
cluster plot, dendrogram, ...)

62
Interface illustration details of moves

Each move - refine, reduce, reassign - is an
entire collection of possible moves, each with
many possible choices.
The next few slides illustrate the prototype
implementation where
Buttons for refine, reduce, and reassign are
given at the topmost level.
Once selected, each button pops up its own
control panel where various different kinds of
moves and parameter choices can be made. E.g.
the analyst might choose to reduce by any of
Join groups with closest centres using Euclidean
distance
Join groups whose farthest points are closest
(i.e. complete linkage)
Choose group with greatest spread and disperse
its points among the remaining groups.

63
Interface - reduce
64
Interface - refine
65
Interface - reassign
66
Interface illustration example of use

The next few slides illustrate the prototype
implementation applied to a ball in a sphere
data-set (a different one from before).
Moves are made about the partition space (refines
and reassign)
Partitions are saved (can be named, deleted,
revisited, etc.)
Nested partitions compared via a dendrogram
Non-nested partition compared with nested ones
N.B. at any time, the analyst could have
interacted with any graphic
to create a new partition by colouring - using
manual button
focus on a subset to examine via a new cluster
analysis hub and subsequently incorporate that
into the partition of the whole data-set.

67
Interaction
Start with partition having all points in a
single group.
Selecting refine pops up the refinement panel.
Choose refinement details.

Refinement move
Choose group with var-cov having largest eigen
value.
Project these points onto corresponding
eigen-vector.
Split this group where the projected gap is
largest.

68
Interaction
New partition appears as Refine Dataset in
panel at left.
Refinement details unchanged.
Refine produces new partition having two groups
as shown by different colours in all graphics.
69
name and save partition
Saved partition list.
New partition is named and saved.
Refinement details unchanged.
New partition has three groups.
70
prototype - refine to 4
Refinement details unchanged.
New partition has four groups.
71
prototype - refine to 5
Refinement details unchanged.
No further refinement pursued beyond this one.
New partition has five groups. The fifth group
contains a single point (blue, top right).
72
Select nested partitionsand view dendrogram
1
Select nested partitions
2
Dendrogram button.
3

Dendrogram shows 5 nested partitions
Each block is a group, horizontal cuts at each
vertical level is a partition.
Size and colour proportions vary with number of
points.
Colouring is as displayed in point cloud (here
showing the current partition) .

73
Reassign, dendrogram updated
New partition appears as Reassign Dataset in
panel at left.

Reassign move to new partition.
Details
k-means
max F statistic

Colours update in all graphics including the
dendrogram
Reassignment partition can be explored as usual.
This partition can be visually compared with
previous partitions via the updated colours in
the dendrogram.

74
Cluster plot dendrograminteraction movie
Cluster plot button operates on selected partition

Cluster plot
groups as boxes
close groups are visually close (via
multi-dimensional scaling)

Nested and non-nested partitions can be visually
compared simultaneously through interaction.
75
Other operators

dissimilarity (Pi, Pj) --gt di,j

display (P1, ..., Pm)

dendrogram if P1 lt lt Pm

mds plot of all clusters in P1, , Pm

mds plot of all partitions P1, , Pm

76
Creation

partition (Data ...) --gt Pnew
manually from colours
k-means, random start, mst, veri, etc
from existing classifier.

partition-path (Data ) --gt P1 , P2 , , Pn

partition-path (Pold ...)
--gt Pold , P1 , P2 ,
, Pn

e.g. nested sequence from hierarchical clustering

77
Composition

resolve (P1, ..., Pm ) --gt Pnew
combine different partitions of the same data

merge (Data, Pold ) --gt Pnew
classify additional points

merge (Pa , Pb ) --gt Pnew
combine non-overlapping partitions

78
Implications

Algorithms (re)cast in terms of moves
refine, reduce
reassign
partition, partition-path
easily understandable (e.g. geometric structures)
specify required data structures
e.g. ms tree, triangulation, var-cov matrix,

79
New problems

interface design
multiple partitions
comparison and/or resolution
multiple display
inference

80
Summary

Cluster analysis is naturally exploratory and
needs integration with modern interactive data
analysis.
Enlarging the problem to partitions
simplifies and gives structure
encourages exploratory approach
integrates naturally
introduces new possibilities (analysis and
research)

81
Related references

Interactive clustering CASI talk, Oldford (2001)
Quail Overview (Interface 1998), graphics
(Hurley and Oldford, ISI 1999) and code.
Design principles Oldford (Interface1999)
Analysis hubs Oldford (Interface 1997)