Title: Text Databases
1Text Databases
2Outline
- Spatial Databases
- Temporal Databases
- Spatio-temporal Databases
- Data Mining
- Multimedia Databases
- Text databases
- Image and video databases
- Time Series databases
3Text - Detailed outline
- Text databases
- problem
- full text scanning
- inversion
- signature files (a.k.a. Bloom Filters)
- Vector model and clustering
- information filtering and LSI
4Vector Space Model and Clustering
- Keyword (free-text) queries (vs Boolean)
- each document -gt vector (HOW?)
- each query -gt vector
- search for similar vectors
5Vector Space Model and Clustering
- main idea each document is a vector of size d d
is the number of different terms in the database
document
zoo
aaron
data
indexing
...data...
d ( vocabulary size)
6Document Vectors
- Documents are represented as bags of words
- Represented as vectors when used computationally
- A vector is like an array of floating points
- Has direction and magnitude
- Each vector holds a place for every term in the
collection - Therefore, most vectors are sparse
7Document VectorsOne location for each word.
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
8Document VectorsOne location for each word.
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
Hollywood occurs 7 times in text I Film
occurs 5 times in text I Diet occurs 1 time in
text I Fur occurs 3 times in text I
9Document Vectors
Document ids
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
10We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
11Vector Space Model and Clustering
- Then, group nearby vectors together
- Q1 cluster search?
- Q2 cluster generation?
- Two significant contributions
- ranked output
- relevance feedback
12Vector Space Model and Clustering
- cluster search visit the (k) closest
superclusters continue recursively
MD TRs
CS TRs
13Vector Space Model and Clustering
MD TRs
CS TRs
14Vector Space Model and Clustering
- relevance feedback (brilliant idea) Roccio73
MD TRs
CS TRs
15Vector Space Model and Clustering
- relevance feedback (brilliant idea) Roccio73
- How?
MD TRs
CS TRs
16Vector Space Model and Clustering
- How? A by adding the good vectors and
subtracting the bad ones
MD TRs
CS TRs
17Cluster generation
- Problem
- given N points in V dimensions,
- group them
18Cluster generation
- Problem
- given N points in V dimensions,
- group them (typically a k-means or AGNES is used)
19Assigning Weights to Terms
- Binary Weights
- Raw term frequency
- tf x idf
- Recall the Zipf distribution
- Want to weight terms highly if they are
- frequent in relevant documents BUT
- infrequent in the collection as a whole
20Binary Weights
- Only the presence (1) or absence (0) of a term is
included in the vector
21Raw Term Weights
- The frequency of occurrence for the term in each
document is included in the vector
22Assigning Weights
- tf x idf measure
- term frequency (tf)
- inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution - Goal assign a tf idf weight to each term in
each document
23tf x idf
24Inverse Document Frequency
- IDF provides high values for rare words and low
values for common words
For a collection of 10000 documents
25Similarity Measures for document vectors
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
26tf x idf normalization
- Normalize the term weights (so longer documents
are not unfairly given more weight) - normalize usually means force all values to fall
within a certain range, usually between 0 and 1,
inclusive.
27Vector space similarity(use the weights to
compare the documents)
28Computing Similarity Scores
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
1.0
0.2
29Vector Space with Term Weights and Cosine Matching
Di(di1,wdi1di2, wdi2dit, wdit) Q
(qi1,wqi1qi2, wqi2qit, wqit)
Term B
1.0
Q (0.4,0.8) D1(0.8,0.3) D2(0.2,0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
30Text - Detailed outline
- Text databases
- problem
- full text scanning
- inversion
- signature files (a.k.a. Bloom Filters)
- Vector model and clustering
- information filtering and LSI
31Information Filtering LSI
- Foltz,92 Goal
- users specify interests ( keywords)
- system alerts them, on suitable news-documents
- Major contribution LSI Latent Semantic
Indexing - latent (hidden) concepts
32Information Filtering LSI
- Main idea
- map each document into some concepts
- map each term into some concepts
- Concept a set of terms, with weights, e.g.
- data (0.8), system (0.5), retrieval (0.6)
-gt DBMS_concept
33Information Filtering LSI
- Pictorially term-document matrix (BEFORE)
34Information Filtering LSI
- Pictorially concept-document matrix and...
35Information Filtering LSI
- ... and concept-term matrix
36Information Filtering LSI
- Q How to search, eg., for system?
37Information Filtering LSI
- A find the corresponding concept(s) and the
corresponding documents
38Information Filtering LSI
- A find the corresponding concept(s) and the
corresponding documents
39Information Filtering LSI
- Thus it works like an (automatically constructed)
thesaurus - we may retrieve documents that DONT have the
term system, but they contain almost everything
else (data, retrieval)
40SVD - Detailed outline
- Motivation
- Definition - properties
- Interpretation
- Complexity
- Case studies
- Additional properties
41SVD - Motivation
- problem 1 text - LSI find concepts
- problem 2 compression / dim. reduction
42SVD - Motivation
- problem 1 text - LSI find concepts
43SVD - Motivation
- problem 2 compress / reduce dimensionality
44Problem - specs
- 106 rows 103 columns no updates
- random access to any cell(s) small error OK
45SVD - Motivation
46SVD - Motivation
47SVD - Detailed outline
- Motivation
- Definition - properties
- Interpretation
- Complexity
- Case studies
- Additional properties
48SVD - Definition
- An x m Un x r L r x r (Vm x r)T
- A n x m matrix (eg., n documents, m terms)
- U n x r matrix (n documents, r concepts)
- L r x r diagonal matrix (strength of each
concept) (r rank of the matrix) - V m x r matrix (m terms, r concepts)
49SVD - Properties
- THEOREM Press92 always possible to decompose
matrix A into A U L VT , where - U, L, V unique ()
- U, V column orthonormal (ie., columns are unit
vectors, orthogonal to each other) - UT U I VT V I (I identity matrix)
- L eigenvalues are positive, and sorted in
decreasing order
50SVD - Example
retrieval
inf.
lung
brain
data
CS
x
x
MD
51SVD - Example
retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x
MD
52SVD - Example
doc-to-concept similarity matrix
retrieval
CS-concept
inf.
lung
MD-concept
brain
data
CS
x
x
MD
53SVD - Example
retrieval
strength of CS-concept
inf.
lung
brain
data
CS
x
x
MD
54SVD - Example
term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x
MD
55SVD - Example
term-to-concept similarity matrix
retrieval
inf.
lung
brain
data
CS-concept
CS
x
x
MD
56SVD - Detailed outline
- Motivation
- Definition - properties
- Interpretation
- Complexity
- Case studies
- Additional properties
57SVD - Interpretation 1
- documents, terms and concepts
- U document-to-concept similarity matrix
- V term-to-concept sim. matrix
- L its diagonal elements strength of each
concept
58SVD - Interpretation 2
- best axis to project on (best min sum of
squares of projection errors)
59SVD - Motivation
60SVD - interpretation 2
SVD gives best axis to project
v1
61SVD - Interpretation 2
62SVD - Interpretation 2
63SVD - Interpretation 2
variance (spread) on the v1 axis
x
x
64SVD - Interpretation 2
- A U L VT - example
- U L gives the coordinates of the points in
the projection axis
x
x
65SVD - Interpretation 2
- More details
- Q how exactly is dim. reduction done?
66SVD - Interpretation 2
- More details
- Q how exactly is dim. reduction done?
- A set the smallest eigenvalues to zero
x
x
67SVD - Interpretation 2
x
x
68SVD - Interpretation 2
x
x
69SVD - Interpretation 2
x
x
70SVD - Interpretation 2
71SVD - Interpretation 2
- Equivalent
- spectral decomposition of the matrix
x
x
72SVD - Interpretation 2
- Equivalent
- spectral decomposition of the matrix
l1
x
x
u1
u2
l2
v1
v2
73SVD - Interpretation 2
- Equivalent
- spectral decomposition of the matrix
m
...
n
74SVD - Interpretation 2
- spectral decomposition of the matrix
m
r terms
...
n
n x 1
1 x m
75SVD - Interpretation 2
- approximation / dim. reduction
- by keeping the first few terms (Q how many?)
m
...
n
assume l1 gt l2 gt ...
76SVD - Interpretation 2
- A (heuristic - Fukunaga) keep 80-90 of
energy ( sum of squares of li s)
m
...
n
assume l1 gt l2 gt ...
77SVD - Interpretation 3
- finds non-zero blobs in a data matrix
x
x
78SVD - Interpretation 3
- finds non-zero blobs in a data matrix
x
x
79SVD - Interpretation 3
- Drill find the SVD, by inspection!
- Q rank ??
x
x
??
??
??
80SVD - Interpretation 3
- A rank 2 (2 linearly independent rows/cols)
x
x
??
??
??
??
81SVD - Interpretation 3
- A rank 2 (2 linearly independent rows/cols)
x
x
orthogonal??
82SVD - Interpretation 3
- column vectors are orthogonal - but not unit
vectors
0
0
x
x
0
0
0
0
0
0
0
0
83SVD - Interpretation 3
0
0
x
x
0
0
0
0
0
0
0
0
84SVD - Interpretation 3
- A SVD properties
- matrix product should give back matrix A
- matrix U should be column-orthonormal, i.e.,
columns should be unit vectors, orthogonal to
each other - ditto for matrix V
- matrix L should be diagonal, with positive values
85SVD - Detailed outline
- Motivation
- Definition - properties
- Interpretation
- Complexity
- Case studies
- Additional properties
86SVD - Complexity
- O( n m m) or O( n n m) (whichever is
less) - less work, if we just want eigenvalues
- or if we want first k eigenvectors
- or if the matrix is sparse Berry
- Implemented in any linear algebra package
(LINPACK, matlab, Splus, mathematica ...)
87SVD - Complexity
- Faster algorithms for approximate eigenvector
computations exist - Alan Frieze, Ravi Kannan, Santosh Vempala Fast
Monte-Carlo Algorithms for finding low-rank
approximations, Proceedings of the 39th FOCS,
p.370, November 08-11, 1998 - Sudipto Guha, Dimitrios Gunopulos, Nick Koudas
Correlating synchronous and asynchronous data
streams. KDD 2003 529-534
88SVD - conclusions so far
- SVD A U L VT unique ()
- U document-to-concept similarities
- V term-to-concept similarities
- L strength of each concept
- dim. reduction keep the first few strongest
eigenvalues (80-90 of energy) - SVD picks up linear correlations
- SVD picks up non-zero blobs
89References
- Berry, Michael http//www.cs.utk.edu/lsi/
- Fukunaga, K. (1990). Introduction to Statistical
Pattern Recognition, Academic Press. - Press, W. H., S. A. Teukolsky, et al. (1992).
Numerical Recipes in C, Cambridge University
Press.