Title: Matrix Tile Analysis
1MATRIX TILE ANALYSIS (MTA)
Inmar Givoni, Vincent Cheung, Brendan Frey
2Goal explain patterns in matrix data as a set
of non-overlapping generalized tiles
- A tile is a subset of similar matrix
elements defined by a subset of rows and columns - Under some permutation of the rows and columns,
all elements of a particular tile form a
contiguous block. - But there may not exist one permutation for
which all tiles appear contiguous. - Assumptions
- Each matrix element belongs to at most one tile.
- A row/column may contain elements belonging to
many tiles.
MTA is applicable to datasets that are
represented as matrices e.g. high-throughput
biological data, collaborative filtering.
3Existing methods
- Matrix Factorization (PCA,ICA ,NNMF,)
- Assume sensibly defined addition and
multiplication of matrix elements - Not necessarily the case likelihoods/non-ordinal
input, etc.
X
Y
Z
N
M
C
4Existing methods
- Clustering (Hierarchical, K-means,)
- Assume each row/column is in a single cluster
- Similarity based on the entire row/column
- We may wish for each row/column to be explained
by several clusters.
5Probabilistic Model for MTA
- Input for NxM matrix X
- T - of tiles to search for
- By ranging over T we can do model selection.
- L0, L user defined data likelihoods of Xs
elements - Under background model and foreground model
- The model can easily be extended to more than one
foreground model.
For example Background (L0) tile elements
(L)
6Probabilistic Model for MTA
- Latent indicator variables for each tile
- if ith row of X contains elements in
tile t - if jth row of X contains elements in
tile t
P(X,r,c) P(r,c) P(X r,c)
7Factor Graph for MTA
- We introduce dummy inidcator variables for
every matrix element in every tile.
Tile T
Tile 1
Enforces the constraint that if xij is in tile t,
the corresponding row and column indicators must
be active.
Enforces the constraint that each element is
accounted for by at most one tile.
MTA-SP Perform inference using the sum-product
algorithm (Loopy Belief Propagation)
8Alternative methods to sum-product for solving MTA
- MTA-ICM - Iterative row/column update s.t.
constraints are not violated. - PCA - Extract T principle components (of e.g.
columns), for each component, threshold to find
corresponding component elements (which row is in
which tile), project matrix columns. - Plaid (Lazzeroni Owen,2000) - Interpret each
layer as either background or foreground tile.
9Experimental Results
- Generate synthetic tile data
- Corrupt with noise
- X 40x40,T5,s20.0316
MTA-SP
MTA-ICM
PCA
Plaid
- Evaluation criteria Hamming distance,
classification error, cost
10Experimental Results Hamming Distance
Error Rate
Noise Level (s2)
- Vary of tiles, matrix size.
- Test across 7 noise levels, 20 different matrices
per setting.
11Experimental Results Classification Error
Error Rate
Noise Level (s2)
12Experimental Results Cost
13Experimental Results on SGA Data
Synthetic Genetic Interaction Tong et
al.,2004 135 x 1023 binary interaction matrix.
When the deletion of gene A or gene B yields a
viable mutant, while the double knockout is
lethal.
Functional enrichment by GO categories
14Summary
- Introduced Matrix Tile Analysis
- Factorization of a matrix into non-overlapping
similar tiles. - Probabilistic model and inference algorithms.
- Comparison to existing methods, and application
to synthetic and real data.