Title: Linear Algebra and Terrorist Threats:
1Linear Algebra and Terrorist Threats
- Finding Relationships in Large Sets of Text
Catherine Crawford October 31, 2007 Elmhurst
College
2Acknowledgements
- Relationship Discovery in Large Text Collections
Using Latent Semantic Indexing - R. B. Bradford, SAIC
- 2006 SIAM Conference on Data Mining Workshop on
Link Analysis, Counterterrorism and Security - William M. Pottenger, Ph.D.
- DyDAn Center, Rutgers University
- 2007 DIMACS Reconnect Conference on Data Analysis
in Law Enforcement and Homeland Security
3Linear Algebra Concepts
- For an n x n matrix A, a nonzero vector x is
called an eigenvector of A, if - Ax ?x for some scalar ?.
- The scalar ? is called an eigenvalue.
4eigenvalues ?1 6 and ?2 -1 eigenvectors
5Diagonalization
- An n x n matrix A is said to be diagonalizable if
it is similar to a diagonal matrix. - A PDP-1
- for some diagonal matrix D and invertible matrix
P.
6Diagonalization A PDP-1
- Columns of P
- Entries of D
Eigenvectors (n linearly independent)
Eigenvalues
7Limitations
- What if A does not have n linearly independent
eigenvectors? - What if A is not square, i.e. A is m x n?
- Not diagonalizable, but
8- If A is m x n, then ATA is n x n
- ATA is symmetric
- ATA can be diagonalized i.e. ATA PDP-1
- D is a diagonal matrix with the eigenvalues of
ATA as the diagonal entries
9Singular Values
The matrix ATA has eigenvalues
The singular values of the m x n matrix A are
given by
10Singular Value Decomposition
- Any m x n matrix A can be factored as
- A USVT
- where U is an m x m orthogonal matrix and
- V is an orthogonal n x n matrix and
- S is an m x n matrix of the form
11r is the rank of A
m - r rows
n - r columns
- D is an r x r diagonal matrix with the first r
singular values of A along the diagonal.
12Eigenvalues of ATA are ?1 360, ?2 90, and ?3
0.
So the Singular Values of A are
13A USVT
14Information Retrieval
- Given a set of documents and a query
- (a collection of terms)
- Return the documents ranked by their similarity
to the query
15Search Engines
- Google, Ask Jeeves, etc.
- Term Matching
- Type in words (query) and it returns webpages
(documents) that contain those words - Focus on one document at a time
- e.g. movies in 2006 returns sites that list
movies from 2006 - But what if I want to know the common theme(s),
if any, of popular movies in 2006?
16Common Themes Movies 2006
- Sites with the term common themes often in a
review of a 2006 movie - Results of someone else having researched the
question and posting their comments. (If you are
lucky)
17We want
- Computer to search several documents and infer
the answer and return it to us. - (Finding relationships in large sets of text)
- ?Latent Semantic Indexing
18Latent
Use information that might not be obvious
Semantic
Focus on context and meaning rather than just
matching
Indexing
Organize data to provide an efficient search and
retrieval
19Latent Semantic Indexing (LSI)
- Steps
- Construct the Term-Document Matrix A
- Factor A into its Singular Value Decomposition
(SVD), i.e. A USVT - Reduce the dimensions of the matrices
- Compute the final LSI Space
20Latent Semantic Indexing
A
Documents
X
X
U
S
VT
VkT
X
X
Uk
Sk
See Deerwester et al., Indexing by Latent
Semantic Analysis, Journal of the American
Society for Information Science, 41(6), pp.
391-407, October, 1990.
21Interpretation of SVD
A USVT
x
x
Concept Inherent, Latent, Underlying
Information
22LSI and Terrorist Threats
- Database
- 158,492 Documents
- English-language News Articles from 2002 2003
23Entity Extraction (Preprocessing)
Example Results of Entity Extraction
Document
24Entity Extraction Results
- 334,557 Unique Entities
- 126,372 Persons
- 37,706 Locations
- 170,479 Organizations
- 101,533 Unique Entities Occuring More than Once
25LSI Indexing
- 332,386 Indexed Objects
- 158,492 Documents
- 230,853 Individual Terms
- 101,533 Entity Names
26Example
Term-Document Matrix
A
q
Query Is GSPC planning an attack on the
cathedral in Strasbourg?
27Latent Semantic Indexing
Reduced Dimension Term-Document Matrix Ak
VkT
X
X
Uk
Sk
See Deerwester et al., Indexing by Latent
Semantic Analysis, Journal of the American
Society for Information Science, 41(6), pp.
391-407, October, 1990.
28Reduced Dimensions
- Query Vector
- Document Vector
reduced dimension jth document vector jth
column of VTk
29Representation Vectors
Entity
Document
?
?
?
Term
LSI Representation Space
Cosine between vectors is a measure of similarity
30Similarity Measure
31Rankings
- Results of sim( ), rank how similar each document
is to the query. - Document with the highest value is most similar
32Entities of Particular Interest
Groups
Individuals
Targets
Weapons
Activities
33Relationships of Particular Interest
- Group Group
- Person Group
- Person Person
- Group Target
- Group Weapon
34Procedure (Contd)
Create Matrix of Items to be Compared
35Terrorist Groups vs. Targets
36Review of Procedure
- Pre-Process Text with Entity Extraction Software
- Create LSI Representation Space
- Construct the Term-Document Matrix A
- Factor A into its SVD, i.e. A USVT
- Reduce the dimensions of the matrices
- Compute the final LSI Space
- Create Matrix of Items to Be Compared
- Compute Cosines between Pairs of Representation
Vectors
37References
- Overview of the LSI Technique
- Deerwester et al., Indexing by Latent Semantic
Analysis, Journal of the American Society for
Information Science, 41(6), October, 1990 pp.
391-407. - Review of the LSI Literature
- Dumais, S., Latent Semantic Analysis, in
Annual Review of Information Science and
Technology, Vol. 38, Information Today Inc.,
Medford, New Jersey, 2004, pp. 189-230. - Effects of LSI Parameter Choices
- Dumais, S., Enhancing Performance in Latent
Semantic Indexing (LSI) Retrieval, Bellcore
Technical Report TM-ARH-017527, 1990. - LSI Capture of Higher-order Associations
- Kontostathis, A. and Pottenger, W. M. (2006) A
Framework for Understanding LSI Performance.
Information Processing Management, Volume 42,
Issue 1, Pages 56-73. January. - Utility of Matrix Decomposition Techniques in
Social Network Analysis - Skillicorn, D., Social Network Analysis via
Matrix Decompositions, Emergent Information
Technologies and Enabling Policies for Counter
Terrorism, IEEE-Wiley, 2006. - Information Retrieval through LSI
- DIMACS Education Module (High School)
38Questions?