Title: Mining Discrete Patterns via Binary Matrix Factorization
1Mining Discrete Patterns via Binary Matrix
Factorization
- Jieping Ye
- Arizona State University
Joint work with Baohong Shen and Shuiwang Ji
2Rank-One Binary Matrix Factorization
compression, clustering, pattern discovery
10110101110110 01110000000110 00110101110110 00110
101110110 00110101110111 00000111101010
indicator vector
3Application I Image Compression
4An Example of Tree for 45 images from Stage Range
4-6Built byOur Algorithm
Application II Hierarchy Construction
5An Example of Tree for 45 images from Stage Range
4-6Built byOur Algorithm
Application III Pattern Discovery
M. Koyuturk, A. Grama, and N. Ramakrishnan,
Compression, clustering and pattern discovery in
very high dimensional discrete-attribute
datasets, IEEE TKDE, 2005.
6Binary Rank-One Approximation Problem
Formulation
7Binary Rank-One Approximation Challenges
- Can we compute an approximate solution with a
guaranteed error bound? - Can we compute it efficiently?
- Conjectured to be NP-Hard.
- Existing approach based on the iterative updating
- Koyutürk, M. Grama, A. PROXIMUS A framework
for analyzing very high dimensional
discrete-attributed datasets. KDD'03. - Heuristics, without known guarantees on
approximation errors. - It very often results in undesirable rank-one
approximations.
8Regularized Binary Rank-One Approximation
9Equivalent Reformulation
Maximum Weight Problem (MWP)
10Our Main Contributions
- An exact formulation for MWP, using integer
linear programming. - A formulation for error-bounded integer linear
programming, using integer linear programming. - The proof of an error bound
. - Efficient algorithms to solve the error-bounded
approximation.
11Overview
- This is the first polynomial time algorithm that
computes an approximate solution with a
guaranteed error bound.
reformulation
Binary Rank-one Matrix Approximation
Maximum Weight Problem (MWP)
- This is the first work that explicitly connects
binary matrix factorization and minimum s-t cut.
12Formulation for Exact Solutions
Original formulation
- If x1i x2j1, then zi,j 1.
- Ui,j gt0?zi,j1.
- If one of x1i and x2j is o, then zi,j 0.5.
- zi,j is an integer? zi,j0.
13Formulation for Approximate Solutions I
14Formulation for Approximate Solutions II
Proposition The objective value of ILP2 is no
less than that of ILP1 for the same problem
instance.
15Approximation Error
- ILP2 achieves an error-bounded approximation.
16Linear Programming Relaxation of ILP2
- Proposition The coefficient matrix of the
constraints in ILP2 - is totally unimodular.
- I. Heller and C. B. Tompkins. An extension of a
theorem of Dantzig's. - Ann. of Math. Stud., no. 38, pages 247-254.
1956. - We can obtain an exact solution of ILP2 by
solving its LP relaxation. - LP is still computationally expensive for a large
matrix A.
17Overview
reformulation
Binary Rank-one Matrix Approximation
Maximum Weight Problem (MWP)
reformulation
error-bounded approximation
Integer Linear Programming (ILP1)
Integer Linear Programming (ILP2)
LP relaxation
reformulation
minimum s-t cut problem
Linear Programming Relaxation of ILP2
18Generalized Independent Set Problem
- Generalized Independent Set Problem (GIS)
- An undirected graph G(V,E),
- A nonnegative weight w(v) for each vertex v in V,
- A nonnegative penalty p(e) for each edge e in E.
- GIS Problem find a vertex subset S in V
19Transform ILP2 into a GIS Problem
- ILP2 defines an instance of GIS, and the
corresponding graph is bipartite.
20Efficient Approximation
- GIS is NP-Hard for general graphs.
- However, it can be solved in polynomial time for
bipartite graphs. - GIS for bipartite graphs can be solved by solving
minimum s-t cuts / maximum flows. - Hochbaum, D. S. Pathria, A. Forest harvesting
and minimum cuts a new approach to handling
spatial constraints, Forest Science, 1997, 43,
544-554
21Experimental Evaluation Error Bound
- We present results by the minimum s-t cut (P1),
the improvement by iterative updating (P2), and
theoretical upper bounds.
22Experimental Evaluation Error Bound
- We present results by the minimum s-t cut (P1),
the improvement by iterative updating (P2), and
theoretical upper bounds.
23Experimental Evaluation Running Time
- One dimension is fixed at 1000.
24Conclusion
reformulation
Binary Rank-one Matrix Approximation
Maximum Weight Problem (MWP)
reformulation
error-bounded approximation
Integer Linear Programming (ILP1)
Integer Linear Programming (ILP2)
LP relaxation
reformulation
minimum s-t cut problem
Linear Programming Relaxation of ILP2
25Thank you!