Title: Multilinear Algebra for Analyzing Data with Multiple Linkages
1Multilinear Algebra for Analyzing Data with
Multiple Linkages
- Tamara G. Kolda
- plus Brett Bader, Danny Dunlavy, Philip
Kegelmeyer - Sandia National Labs
- TRICAP 2006, Chania, Greece, June 4-9, 2006
2Linear Algebra for Data with Linkages
Circle-Square Matrix
SVD Rank-k Approximation (k2)
3Latent Semantic Indexing (LSI) for Text Retrieval
Term-Document Matrix
SMART Retrieval SystemG. Salton (1971)
LSI S. Dumais et al. (1988)
Query
Car Service
- S. T. Dumais, G. W. Furnas, T. K. Landauer,
S. Deerwester, and R. Harshman. Using latent
semantic analysis to improve access to textual
information. In CHI '88, pp. 281285, 1988 - S. C. Deerwester, S. T. Dumais, T. K. Landauer,
G. W. Furnas, and R. A. Harshman. Indexing by
latent semantic analysis. J. Am. Soc. Inform.
Sci., 41(6)391407, 1990 - M. W. Berry, S. T. Dumais, and G. W. O'Brien.
Using linear algebra for intelligent information
retrieval. SIAM Rev., 37(4)573595, 1995
4Applications of LSI
Graph the Results using U2 and V2
Term-Document Similarities
carservicemilitary repair
Terms
d1
car
d1
d2
d3
d2
service
Term-Term
d3
military
carservicemilitary repair
Document-Document
Documents
repair
5Caveats for LSI
- Term-document matrix weighting is critical!
Local WeightLogfij frequency
Global Term WeightInverse Document Frequency N
total docs ni docs with term i
Normalization FactorCosine
6Citation/Link Analysis(Same Nodes)
Link Matrix
Hub Scores
Doc 1 is the most important hub!
Co-Citation Matrix
Authority Scores
Examples Citation data, Web links
Doc 3 is the most important authority!
Co-Reference Matrix
J. M. Kleinberg. Authoritative sources in a
hyperlinked environment. J. ACM, 46(5)604632,
1999.
7Multiple Links?
Suppose the connections between nodes are
labeled in some fashion. In other words, we
have meta-data on the connections. Can we
somehow use multilinear algebra for link analysis?
8PARAFAC
- PARAFAC Parallel Factors
- aka. CANDECOMP Canonical Decomposition
- Higher-order analogue of the SVD
- Columns of A, B, and C are not orthonormal
- If R is minimal, then R is called the rank of the
tensor (Kruskal 1977) - Can have rank(X) gt minI,J,K
- Often guaranteed to be a unique rank
decomposition!
K x R
C
I x J x K
I x R
J x R
B
A
I
R x R x R
- R. A. Harshman. Foundations of the PARAFAC
procedure models and conditions for an
explanatory multi-modal factor analysis. UCLA
working papers in phonetics, 16184, 1970 - J. D. Carroll and J. J. Chang. Analysis of
individual differences in multidimensional
scaling via an N-way generalization of
Eckart-Young' decomposition. Psychometrika,
35283319, 1970.
9Many ways to write PARAFAC
Kruskal Operator
Easy to write N-way case
J. B. Kruskal. Three-way arrays rank and
uniqueness of trilinear decompositions, with
application to arithmetic complexity and
statistics. Linear Algebra Appl., 18(2)95138,
1977.
10Properties of the Kruskal Operator
PARAFAC core for a Tucker decomposition
Matricize (arbitrary map of indices to rows and
columns)
Mode-n matricize
Norm of a PARAFAC decomposition
11PARAFAC for sparse data approximations
- Our interest in the mathematical operations is
motivated on two fronts - (1) Sparse computations
- (2) Using tensor decompositions for approximation
- Ex Considering how to efficiently implement
PARAFAC-ALS for sparse data - Can PARAFAC be used for the best rank-k
approximation, rather than finding an exact
decomposition (excepting noise) - What does it even mean in this case??
12Multilink Analysis using PARAFAC
- Quick Review Tensors for Web Link Analysis
- page x page x anchor text (TOPHITS)
- New work Tensors for Publication Data Analysis
- Case 1 doc x doc x similarity
- Case 2 term x doc x author (HO-LSA??)
13TOPHITS PARAFAC for Web Link Analysis
Graph representation shows basic connectivity
A set of four hyperlinked web pages
Labeled edges capture context
14Analyzing Publication DataDoc x Doc x
Similarity Representation
15Computing Different Doc-Doc Similarities
Computing term-based similarities (k1,2,3)
- 5022 papers
- 16617 unique terms (ignoring stop words, words
with length less than 3 or greater than 30
characters, and words that appear less than 2
times) - Titles 5164
- Abstracts 15752
- Keywords 5248
- 6891 authors
- 2659 citations
Enforces sparseness!
Computing author similarities (k4)
16PARAFAC for Doc x Doc x Similarity
Central idea Each triplet provides a core
grouping of the data, i.e., a specific topic.
- H hubs
- A authorities
- C connections
- Rank-30 decomposition
17Sample Grouping 1
18Sample Grouping 10
19Applications of the H,A,C Decomposition
- Latent document similarities
- Calculate S ½ HHT ½ AAT
- Analyzing a body of work
- ch hub centroid, ca authority centroid
- s ½ H ch ½ A ca
- Disambiguation (EXAMPLE)
- Calculate centroids using A (could also use H or
AH) - Calculate simiarlities of centroids
- Journal predicition
- Use matrix A as features for input to a decision
tree ensemeble classifier
20Example of Disambiguation Results
Two authors with missing middle initials.
3 possible matches
Matrix of Similarities
21Analyzing Publication DataTerm x Doc x Author
Representation
Terms must appear in at least 3 documents and no
more than 10 of all documents. Moreover, it must
have at least 2 characters and no more than 30.
767 documents 2251 terms 1072 authors 59738
nonzeros
Element (i,j,k) is nonzero only if author k wrote
document j using term i.
22Different Graph Interpretations for Term x Doc x
Author
- term-doc with author links
- term-author with doc links
- author-doc with term links
- term-doc-author with links
Term
Doc
Different author links represented by different
colors
23Author Data is Too Sparse
Result Resulting tensor has just a few nonzero
columns in each lateral slice.
term
author
doc
Experimentally, PARAFAC seems to overfit such
data and not do a good job of mixing different
authors.
24Idea Use Tucker Transformation to Compress
We transform the tensor to a smaller tensor as
follows
or, equivalently
This transformation forces the authors to be
mixed and produces a dense result. Main problem
How to transform sparse tensor without creating
dense intermediate results?
Compute rank-25 PARAFAC on compressed tensor and
transform.
25Tucker PARAFAC
- Want PARAFAC for X in term x doc x author space
- First, apply dimensionality reduction to X to
obtain Y - Y in conceptual space
- Next, compute PARAFAC on Y
- Finally, reassemble results to yield PARAFAC for
X
26Three-Way Fingerprints
- Each of the Terms, Docs, and Authors has a rank-k
(k25) fingerprint from the PARAFAC approximation - All items can be directly compared in concept
space - Thus, we can compare any of the following
- Term-Term
- Doc-Doc
- Term-Doc
- Author-Author
- Author-Term
- Author-Doc
- The fingerprints can be used as inputs for
clustering, classification, etc.
27Sample Results Term
28Sample Results Term
29Sample Results Author
30Summary Future Work
- PARAFAC provides a technique for analyzing
semantic graphs - Third dimension captures different connection
types - Or may consider it as the interconnection of 3
different node types - Analyzed journal articles using different tensor
representations - Doc x Doc x Connection
- Need to make definitive case of why 3D is better
than 2D - Term x Doc x Author
- Too sparse?
- Still working towards large-scale, sparse
problems - Need implicit compression for PARAFAC
- 5M nonzeros
- Other decompositions?
- Other hybrids
- Symmetry
31Acknowledgments More Information
Thank You!
- Thanks to
- Brett Bader, Danny Dunlavy, Philip Kegelmeyer
- Web data Joe Kenny, Travis Bauer et al., Ken
Kolda - Journal data Kevin Boyack
- Graph viz Ann Yoshimura
- Related papers
- Algorithm xxx MATLAB Tensor Classes for Fast
Algorithm Prototyping (with B.W. Bader), ACM
TOMS, to appear. - Multilinear algebra for analyzing data with
multiple linkages (with D. Dunlavy and W. P.
Kegelmeyer), Technical Report SAND2006-2079, Apr.
2006. - Temporal analysis of social networks using
three-way DEDICOM (with B.W. Bader and
R.Harshman), Technical Report SAND2006-2161, Apr.
2006. - Multilinear operators for higher-order
decompositions. Technical Report SAND2006-2081,
Apr. 2006. - The TOPHITS model for higher-order web link
analysis (with B. Bader), in Proc. Workshop on
Link Analysis, Counterterrorism and Security,
SDM06, Apr. 2006 - Higher-order web link analysis using multilinear
algebra (with B.W.Bader), ICDM 2005, pp. 242249,
Nov. 2005. - Contact Info
- tgkolda_at_sandia.gov
- http//csmr.ca.sandia.gov/tgkolda/