Information Retrieval - PowerPoint PPT Presentation

1 / 151
About This Presentation
Title:

Information Retrieval

Description:

Information Retrieval – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 152
Provided by: petro150
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval Data Mining A Linear
Algebraic Perspective
Petros Drineas Rensselaer Polytechnic
Institute Computer Science Department
To access my web page
drineas
2
Modern data
Facts Computers make it easy to collect and
store data. Costs of storage are very low and
are dropping very fast. (most laptops have a
storage capacity of more than 100 GB ) When it
comes to storing data The current policy
typically is store everything in case it is
needed later instead of deciding what could be
deleted.
3
Data mining
Facts Computers make it easy to collect and
store data. Costs of storage are very low and
are dropping very fast. (most laptops have a
storage capacity of more than 100 GB ) When it
comes to storing data The current policy
typically is store everything in case it is
needed later instead of deciding what could be
deleted. Data Mining Extract useful information
from the massive amount of available data.
4
About the tutorial
Tools Introduce matrix algorithms and matrix
decompositions for data mining and information
retrieval applications. Goal Learn a model for
the underlying physical system generating the
dataset.
5
About the tutorial
Tools Introduce matrix algorithms and matrix
decompositions for data mining and information
retrieval applications. Goal Learn a model for
the underlying physical system generating the
dataset.
data
Math is necessary to design and analyze
principled algorithmic techniques to data-mine
the massive datasets that have become ubiquitous
in scientific research.
mathematics
algorithms
6
Why linear (or multilinear) algebra?
Data are represented by matrices Numerous modern
datasets are in matrix form. Data are
represented by tensors Data in the form of
tensors (multi-mode arrays) are becoming very
common in the data mining and information
retrieval literature in the last few years.
7
Why linear (or multilinear) algebra?
Data are represented by matrices Numerous modern
datasets are in matrix form. Data are
represented by tensors Data in the form of
tensors (multi-mode arrays) are becoming very
common in the data mining and information
retrieval literature in the last few
years. Linear algebra (and numerical analysis)
provide the fundamental mathematical and
algorithmic tools to deal with matrix and tensor
computations. (This tutorial will focus on
matrices pointers to some tensor decompositions
will be provided.)
8
Why matrix decompositions?
  • Matrix decompositions
  • (e.g., SVD, QR, SDD, CX and CUR, NMF, MMMF, etc.)
  • They use the relationships between the available
    data in order to identify components of the
    underlying physical system generating the data.
  • Some assumptions on the relationships between
    the underlying components are necessary.
  • Very active area of research some matrix
    decompositions are more than one century old,
    whereas others are very recent.

9
Overview
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

10
Datasets in the form of matrices
We are given m objects and n features describing
the objects. (Each object has n numeric values
describing it.) Dataset An m-by-n matrix A, Aij
shows the importance of feature j for object
i. Every row of A represents an object. Goal We
seek to understand the structure of the data,
e.g., the underlying process generating the data.
11
Market basket matrices
n products (e.g., milk, bread, wine, etc.)
Common representation for association rule
mining.
  • Data mining tasks
  • Find association rules
  • E.g., customers who buy product x buy product y
    with probility 89.
  • Such rules are used to make item display
    decisions, advertising decisions, etc.

m customers
Aij quantity of j-th product purchased by the
i-th customer
12
Social networks (e-mail graph)
n users
Represents the email communications between
groups of users.
  • Data mining tasks
  • cluster the users
  • identify dense networks of users (dense
    subgraphs)

n users
Aij number of emails exchanged between users i
and j during a certain time period
13
Document-term matrices
A collection of documents is represented by an
m-by-n matrix (bag-of-words model).
n terms (words)
  • Data mining tasks
  • Cluster or classify documents
  • Find nearest neighbors
  • Feature selection find a subset of terms that
    (accurately) clusters or classifies documents.

m documents
Aij frequency of j-th term in i-th document
14
Document-term matrices
A collection of documents is represented by an
m-by-n matrix (bag-of-words model).
n terms (words)
  • Data mining tasks
  • Cluster or classify documents
  • Find nearest neighbors
  • Feature selection find a subset of terms that
    (accurately) clusters or classifies documents.

m documents
Aij frequency of j-th term in i-th document
Example later
15
Recommendation systems
The m-by-n matrix A represents m customers and n
products.
products
Data mining task Given a few samples from A,
recommend high utility products to customers.
customers
Aij utility of j-th product to i-th customer
16
Biology microarray data
tumour specimens
Microarray Data Rows genes (¼ 5,500) Columns
46 soft-issue tumour specimens (different types
of cancer, e.g., LIPO, LEIO, GIST, MFH,
etc.) Tasks Pick a subset of genes (if it
exists) that suffices in order to identify the
cancer type of a patient
genes
Nielsen et al., Lancet, 2002
17
Biology microarray data
tumour specimens
Microarray Data Rows genes (¼ 5,500) Columns
46 soft-issue tumour specimens (different types
of cancer, e.g., LIPO, LEIO, GIST, MFH,
etc.) Tasks Pick a subset of genes (if it
exists) that suffices in order to identify the
cancer type of a patient
genes
Example later
Nielsen et al., Lancet, 2002
18
Human genetics
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
Matrices including hundreds of individuals and
more than 300,000 SNPs are publicly
available. Task split the individuals in
different clusters depending on their ancestry,
and find a small subset of genetic markers that
are ancestry informative.
19
Human genetics
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
Matrices including hundreds of individuals and
more than 300,000 SNPs are publicly
available. Task split the individuals in
different clusters depending on their ancestry,
and find a small subset of genetic markers that
are ancestry informative.
Example later
20
Tensors recommendation systems
  • Economics
  • Utility is ordinal and not cardinal concept.
  • Compare products dont assign utility values.
  • Recommendation Model Revisited
  • Every customer has an n-by-n matrix (whose
    entries are 1,-1) and represent pair-wise
    product comparisons.
  • There are m such matrices, forming an
    n-by-n-by-m 3-mode tensor A.

21
Tensors hyperspectral images
Spectrally resolved images may be viewed as a
tensor.
Task Identify and analyze regions of
significance in the images.
22
Overview
x
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

23
The Singular Value Decomposition (SVD)
Recall data matrices have m rows (one for each
object) and n columns (one for each
feature). Matrix rows points (vectors) in a
Euclidean space, e.g., given 2 objects (x d),
each described with respect to two features, we
get a 2-by-2 matrix. Two objects are close if
the angle between their corresponding vectors is
small.
24
SVD, intuition
Let the blue circles represent m data points in a
2-D Euclidean space. Then, the SVD of the m-by-2
matrix of the data will return
25
Singular values
?2
?1 measures how much of the data variance is
explained by the first singular vector. ?2
measures how much of the data variance is
explained by the second singular vector.
?1
26
SVD formal definition
? rank of A U (V) orthogonal matrix containing
the left (right) singular vectors of A. S
diagonal matrix containing the singular values of
A.
27
Rank-k approximations via the SVD
?
A
VT
U

features
significant
sig.
noise
noise

significant
noise
objects
28
Rank-k approximations (Ak)
Uk (Vk) orthogonal matrix containing the top k
left (right) singular vectors of A. S k diagonal
matrix containing the top k singular values of A.
29
PCA and SVD
Principal Components Analysis (PCA) essentially
amounts to the computation of the Singular Value
Decomposition (SVD) of a covariance matrix. SVD
is the algorithmic tool behind MultiDimensional
Scaling (MDS) and Factor Analysis. SVD is the
Rolls-Royce and the Swiss Army Knife of Numerical
Linear Algebra. Dianne OLeary, MMDS 06
30
Ak as an optimization problem
Frobenius norm
Given ?, it is easy to find X from standard least
squares. However, the fact that we can find the
optimal ? is intriguing!
31
Ak as an optimization problem
Frobenius norm
Given ?, it is easy to find X from standard least
squares. However, the fact that we can find the
optimal ? is intriguing! Optimal ? Uk, optimal
X UkTA.
32
LSI Ak for document-term matrices(Berry,
Dumais, and O'Brien 92)
Latent Semantic Indexing (LSI) Replace A by Ak
apply clustering/classification algorithms on Ak.
n terms (words)
  • Pros
  • Less storage for small k.
  • O(kmkn) vs. O(mn)
  • Improved performance.
  • Documents are represented in a concept space.

m documents
Aij frequency of j-th term in i-th document
33
LSI Ak for document-term matrices(Berry,
Dumais, and O'Brien 92)
Latent Semantic Indexing (LSI) Replace A by Ak
apply clustering/classification algorithms on Ak.
n terms (words)
  • Pros
  • Less storage for small k.
  • O(kmkn) vs. O(mn)
  • Improved performance.
  • Documents are represented in a concept space.
  • Cons
  • Ak destroys sparsity.
  • Interpretation is difficult.
  • Choosing a good k is tough.

m documents
Aij frequency of j-th term in i-th document
34
Ak and k-means clustering(Drineas, Frieze,
Kannan, Vempala, and Vinay 99)
k-means clustering A standard objective function
that measures cluster quality. (Often denotes an
iterative algorithm that attempts to optimize the
k-means objective function.) k-means objective
Input set of m points in Rn, positive integer
k Output a partition of the m points to k
clusters Partition the m points to k clusters in
order to minimize the sum of the squared
Euclidean distances from each point to its
cluster centroid.
35
k-means, contd
We seek to split the input points in 5 clusters.
36
k-means, contd
We seek to split the input points in 5
clusters. The cluster centroid is the average
of all the points in the cluster.
37
k-means a matrix formulation
Let A be the m-by-n matrix representing m points
in Rn. Then, we seek to
X is a special cluster membership matrix Xij
denotes if the i-th point belongs to the j-th
cluster.
38
k-means a matrix formulation
Let A be the m-by-n matrix representing m points
in Rn. Then, we seek to
X is a special cluster membership matrix Xij
denotes if the i-th point belongs to the j-th
cluster.
clusters
  • Columns of X are normalized to have unit length.
  • (We divide each column by the square root of the
    number of points in the cluster.)
  • Every row of X has at most one non-zero element.
  • (Each element belongs to at most one cluster.)
  • X is an orthogonal matrix, i.e., XTX I.

points
39
SVD and k-means
If we only require that X is an orthogonal matrix
and remove the condition on the number of
non-zero entries per row of X, then
is easy to minimize! The solution is X Uk.
40
SVD and k-means
If we only require that X is an orthogonal matrix
and remove the condition on the number of
non-zero entries per row of X, then
is easy to minimize! The solution is X Uk.
  • Using SVD to solve k-means
  • We can get a 2-approximation algorithm for
    k-means.
  • (Drineas, Frieze, Kannan, Vempala, and Vinay 99,
    04)
  • We can get heuristic schemes to assign points to
    clusters.
  • (Zha, He, Ding, Simon, and Gu 01)
  • There exist PTAS (based on random projections)
    for the k-means problem.
  • (Ostrovsky and Rabani 00, 02)
  • Deeper connections between SVD and clustering in
    Kannan, Vempala, and Vetta 00, 04.

41
Ak and Kleinbergs HITS algorithm(Kleinberg 98,
99)
Hypertext Induced Topic Selection (HITS) A link
analysis algorithm that rates Web pages for their
authority and hub scores. Authority score an
estimate of the value of the content of the
page. Hub score an estimate of the value of the
links from this page to other pages. These values
can be used to rank Web search results.
42
Ak and Kleinbergs HITS algorithm
Hypertext Induced Topic Selection (HITS) A link
analysis algorithm that rates Web pages for their
authority and hub scores. Authority score an
estimate of the value of the content of the
page. Hub score an estimate of the value of the
links from this page to other pages. These values
can be used to rank Web search results. Authority
a page that is pointed to by many pages with
high hub scores. Hub a page pointing to many
pages that are good authorities. Recursive
definition notice that each node has two scores.
43
Ak and Kleinbergs HITS algorithm
Phase 1 Given a query term (e.g., jaguar),
find all pages containing the query term (root
set). Expand the resulting graph by one move
forward and backward (base set).
44
Ak and Kleinbergs HITS algorithm
Phase 2 Let A be the adjacency matrix of the
(directed) graph of the base set. Let h , a 2 Rn
be the vectors of hub (authority) scores. Then,
h Aa and a ATh h AATh and a ATAa.
45
Ak and Kleinbergs HITS algorithm
Phase 2 Let A be the adjacency matrix of the
(directed) graph of the base set. Let h , a 2 Rn
be the vectors of hub (authority) scores. Then,
h Aa and a ATh h AATh and a ATAa.
Thus, the top left (right) singular vector of A
corresponds to hub (authority) scores.
46
Ak and Kleinbergs HITS algorithm
Phase 2 Let A be the adjacency matrix of the
(directed) graph of the base set. Let h , a 2 Rn
be the vectors of hub (authority) scores. Then,
h Aa and a ATh h AATh and a ATAa.
Thus, the top left (right) singular vector of A
corresponds to hub (authority) scores. What about
the rest? They provide a natural way to extract
additional densely linked collections of hubs and
authorities from the base set. See the jaguar
example in Kleinberg 99.
47
SVD example microarray data
genes
Microarray Data (Nielsen et al., Lancet,
2002) Columns genes (¼ 5,500) Rows 32 patients,
three different cancer types (GIST, LEIO, SynSarc)
48
SVD example microarray data
Microarray Data Applying k-means with k3 in
this 3D space results to 3 misclassifications.
Applying k-means with k3 but retaining 4 PCs
results to one misclassification. Can we find
actual genes (as opposed to eigengenes) that
achieve similar results?
49
SVD example ancestry-informative SNPs
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are ¼ 10 million SNPs in the human genome,
so this table could have 10 million columns.
50
Two copies of a chromosome (father, mother)
51
Two copies of a chromosome (father, mother)
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
52
C
C
Two copies of a chromosome (father, mother)
  • An individual could be
  • Heterozygotic (in our study, CT TC)
  • Homozygotic at the first allele, e.g., C

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
53
T
T
Two copies of a chromosome (father, mother)
  • An individual could be
  • Heterozygotic (in our study, CT TC)
  • Homozygotic at the first allele, e.g., C
  • Homozygotic at the second allele, e.g., T
  • Encode as 0
  • Encode as 1
  • Encode as -1

SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
54
(a) Why are SNPs really important?
Association studies Locating causative genes for
common complex disorders (e.g., diabetes, heart
disease, etc.) is based on identifying
association between affection status and known
SNPs. No prior knowledge about the function of
the gene(s) or the etiology of the disorder is
necessary.
The subsequent investigation of candidate genes
that are in physical proximity with the
associated SNPs is the first step towards
understanding the etiological pathway of a
disorder and designing a drug.
55
(b) Why are SNPs really important?
Among different populations (eg., European,
Asian, African, etc.), different patterns of SNP
allele frequencies or SNP correlations are often
observed.
Understanding such differences is crucial in
order to develop the next generation of drugs
that will be population specific (eventually
genome specific) and not just disease
specific.
56
The HapMap project
  • Mapping the whole genome sequence of a single
    individual is very expensive.
  • Mapping all the SNPs is also quite expensive,
    but the costs are dropping fast.

HapMap project (130,000,000 funding from NIH
and other sources) Map approx. 4 million SNPs
for 270 individuals from 4 different populations
(YRI, CEU, CHB, JPT), in order to create a
genetic map to be used by researchers.
Also, funding from pharmaceutical companies, NSF,
the Department of Justice, etc.
Is it possible to identify the ethnicity of a
suspect from his DNA?
57
CHB and JPT
  • Let A be the 902.7 million matrix of the CHB and
    JPT population in HapMap.
  • Run SVD (PCA) on A, keep the two (left) singular
    vectors, and plot the results.
  • Run a (naïve, e.g., k-means) clustering
    algorithm to split the data points in two
    clusters.

Paschou, Ziv, Burchard, Mahoney, and Drineas, to
appear in PLOS Genetics 07 (data from E. Ziv and
E. Burchard, UCSF)
Paschou, Mahoney, Javed, Kidd, Pakstis, Gu, Kidd,
and Drineas, Genome Research 07 (data from K.
Kidd, Yale University)
58
(No Transcript)
59
EigenSNPs can not be assayed
Not altogether satisfactory the (top two left)
singular vectors are linear combinations of all
SNPs, and of course can not be assayed! Can
we find actual SNPs that capture the information
in the (top two left) singular vectors? (E.g.,
spanning the same subspace ) Will get back to
this later
60
Overview
x
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

x
61
CX decomposition
C
C
Constrain ? to contain exactly k columns of
A. Notation replace ? by C(olumns). Easy to
prove that optimal X CA. (C is the
Moore-Penrose pseudoinverse of C.) Also called
interpolative approximation. (some extra
conditions on the elements of X are required)
62
CX decomposition
C
C
Why? If A is an object-feature matrix, then
selecting representative columns is equivalent
to selecting representative features. This
leads to easier interpretability compare to
eigenfeatures, which are linear combinations of
all features.
63
Column Subset Selection problem (CSS)
Given an m-by-n matrix A, find k columns of A
forming an m-by-k matrix C that minimizes the
above error over all O(nk) choices for C.
64
Column Subset Selection problem (CSS)
Given an m-by-n matrix A, find k columns of A
forming an m-by-k matrix C that minimizes the
above error over all O(nk) choices for C. C
pseudoinverse of C, easily computed via the SVD
of C. (If C U ? VT, then C V ?-1 UT.) PC
CC is the projector matrix on the subspace
spanned by the columns of C.
65
Column Subset Selection problem (CSS)
Given an m-by-n matrix A, find k columns of A
forming an m-by-k matrix C that minimizes the
above error over all O(nk) choices for C. PC
CC is the projector matrix on the subspace
spanned by the columns of C.
Complexity of the problem? O(nkmn) trivially
works NP-hard if k grows as a function of n.
(NP-hardness in Civril Magdon-Ismail 07)
66
Spectral norm
Given an m-by-n matrix A, find k columns of A
forming an m-by-k matrix C such that
  • is minimized over all O(nk) possible choices for
    C.
  • Remarks
  • PCA is the projection of A on the subspace
    spanned by the columns of C.
  • The spectral or 2-norm of an m-by-n matrix X is

67
A lower bound for the CSS problem
For any m-by-k matrix C consisting of at most k
columns of A
Ak
  • Remarks
  • This is also true if we replace the spectral norm
    by the Frobenius norm.
  • This is a potentially weak lower bound.

68
Prior work numerical linear algebra
  • Numerical Linear Algebra algorithms for CSS
  • Deterministic, typically greedy approaches.
  • Deep connection with the Rank Revealing QR
    factorization.
  • Strongest results so far (spectral norm) in
    O(mn2) time

some function p(k,n)
69
Prior work numerical linear algebra
  • Numerical Linear Algebra algorithms for CSS
  • Deterministic, typically greedy approaches.
  • Deep connection with the Rank Revealing QR
    factorization.
  • Strongest results so far (Frobenius norm) in
    O(nk) time

70
Working on p(k,n) 1965 today
71
Prior work theoretical computer science
  • Theoretical Computer Science algorithms for CSS
  • Randomized approaches, with some failure
    probability.
  • More than k rows are picked, e.g., O(poly(k))
    rows.
  • Very strong bounds for the Frobenius norm in low
    polynomial time.
  • Not many spectral norm bounds

72
The strongest Frobenius norm bound
Given an m-by-n matrix A, there exists an O(mn2)
algorithm that picks at most O( k log k / ?2 )
columns of A such that with probability at least
1-10-20
73
The CX algorithm
Input m-by-n matrix A, 0 lt ? lt 1, the
desired accuracy Output C, the matrix consisting
of the selected columns
  • CX algorithm
  • Compute probabilities pj summing to 1
  • Let c O(k log k / ?2)
  • For each j 1,2,,n, pick the j-th column of A
    with probability min1,cpj
  • Let C be the matrix consisting of the chosen
    columns
  • (C has in expectation at most c columns)

74
Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
75
Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
Subspace sampling in O(mn2) time
Normalization s.t. the pj sum up to 1
76
Prior work in TCS
  • Drineas, Mahoney, and Muthukrishnan 2005
  • O(mn2) time, O(k2/?2) columns
  • Drineas, Mahoney, and Muthukrishnan 2006
  • O(mn2) time, O(k log k/?2) columns
  • Deshpande and Vempala 2006
  • O(mnk2) time and O(k2 log k/?2) columns
  • They also prove the existence of
    k columns of A forming a matrix C, such that
  • Compare to prior best existence result

77
Open problems
  • Design
  • Faster algorithms (next slide)
  • Algorithms that achieve better approximation
    guarantees (a hybrid approach)

78
Prior work spanning NLA and TCS
  • Woolfe, Liberty, Rohklin, and Tygert 2007
  • (also Martinsson, Rohklin, and Tygert 2006)
  • O(mn logk) time, k columns, same spectral norm
    bounds as prior work
  • Beautiful application of the Fast
    Johnson-Lindenstrauss transform of Ailon-Chazelle

79
A hybrid approach(Boutsidis, Mahoney, and
Drineas 07)
  • Given an m-by-n matrix A (assume m n for
    simplicity)
  • (Randomized phase) Run a randomized algorithm to
    pick c O(k logk) columns.
  • (Deterministic phase) Run a deterministic
    algorithm on the above columns to pick exactly k
    columns of A and form an m-by-k matrix C.

Not so simple
80
A hybrid approach(Boutsidis, Mahoney, and
Drineas 07)
  • Given an m-by-n matrix A (assume m n for
    simplicity)
  • (Randomized phase) Run a randomized algorithm to
    pick c O(k logk) columns.
  • (Deterministic phase) Run a deterministic
    algorithm on the above columns to pick exactly k
    columns of A and form an m-by-k matrix C.

Not so simple
Our algorithm runs in O(mn2) and satisfies, with
probability at least 1-10-20,
81
Comparison Frobenius norm
Our algorithm runs in O(mn2) and satisfies, with
probability at least 1-10-20,
  1. We provide an efficient algorithmic result.
  2. We guarantee a Frobenius norm bound that is at
    most (k logk)1/2 worse than the best known
    existential result.

82
Comparison spectral norm
Our algorithm runs in O(mn2) and satisfies, with
probability at least 1-10-20,
  1. Our running time is comparable with NLA
    algorithms for this problem.
  2. Our spectral norm bound grows as a function of
    (n-k)1/4 instead of (n-k)1/2!
  3. Do notice that with respect to k our bound is
    k1/4log1/2k worse than previous work.
  4. To the best of our knowledge, our result is the
    first asymptotic improvement of the work of Gu
    Eisenstat 1996.

83
Randomized phase O(k log k) columns
  • Randomized phase c O(k logk)
  • Compute probabilities pj summing to 1
  • For each j 1,2,,n, pick the j-th column of A
    with probability min1,cpj
  • Let C be the matrix consisting of the chosen
    columns
  • (C has in expectation at most c columns)

84
Subspace sampling
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark We need more elaborate subspace sampling
probabilities than previous work.
Subspace sampling in O(mn2) time
Normalization s.t. the pj sum up to 1
85
Deterministic phase k columns
  • Deterministic phase
  • Let S1 be the set of indices of the columns
    selected by the randomized phase.
  • Let (VkT)S1 denote the set of columns of VkT
    with indices in S1,
  • (An extra technicality is that the columns of
    (VkT)S1 must be rescaled )
  • Run a deterministic NLA algorithm on (VkT)S1 to
    select exactly k columns.
  • (Any algorithm with p(k,n) k1/2(n-k)1/2 will
    do.)
  • Let S2 be the set of indices of the selected
    columns (the cardinality of S2 is exactly k).
  • Return AS2 (the columns of A corresponding to
    indices in S2) as the final output.

86
Back to SNPs CHB and JPT
Let A be the 902.7 million matrix of the CHB and
JPT population in HapMap.
Can we find actual SNPs that capture the
information in the top two left singular vectors?
87
Results
Number of SNPs Misclassifications
40 (c 400) 6
50 (c 500) 5
60 (c 600) 3
70 (c 700) 1
  • Essentially as good as the best existing metric
    (informativeness).
  • However, our metric is unsupervised!
  • (Informativeness is supervised it essentially
    identifies SNPs that are correlated with
    population membership, given such membership
    information).
  • The fact that we can select ancestry informative
    SNPs in an unsupervised manner based on PCA is
    novel, and seems interesting.

88
Overview
x
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

x
x
89
CUR-type decompositions
For any matrix A, we can find C, U and R such
that the norm of A CUR is almost equal to the
norm of A-Ak. This might lead to a better
understanding of the data.
90
Theorem relative error CUR(Drineas, Mahoney,
Muthukrishnan 06, 07)
For any k, O(mn2) time suffices to construct C,
U, and R s.t. holds with probability at least
1-?, by picking O( k log k log(1/?) / ?2 )
columns, and O( k log2k log(1/?) / ?6 ) rows.
91
From SVD to CUR
Exploit structural properties of CUR to analyze
data
n features
A CUR-type decomposition needs O(minmn2, m2n)
time.
m objects
  • Instead of reifying the Principal Components
  • Use PCA (a.k.a. SVD) to find how many Principal
    Components are needed to explain the data.
  • Run CUR and pick columns/rows instead of
    eigen-columns and eigen-rows!
  • Assign meaning to actual columns/rows of the
    matrix! Much more intuitive! Sparse!

92
CUR decompositions a summary
G.W. Stewart (Num. Math. 99, TR 04 ) C variant of the QR algorithm R variant of the QR algorithm U minimizes A-CURF No a priori bounds Solid experimental performance
Goreinov, Tyrtyshnikov, and Zamarashkin (LAA 97, Cont. Math. 01) C columns that span max volume U W R rows that span max volume Existential result Error bounds depend on W2 Spectral norm bounds!
Williams and Seeger (NIPS 01) C uniformly at random U W R uniformly at random Experimental evaluation A is assumed PSD Connections to Nystrom method
Drineas, Kannan, and Mahoney (SODA 03, 04) C w.r.t. column lengths U in linear/constant time R w.r.t. row lengths Randomized algorithm Provable, a priori, bounds Explicit dependency on A Ak
Drineas, Mahoney, and Muthu (05, 06) C depends on singular vectors of A. U (almost) W R depends on singular vectors of C (1?) approximation to A Ak Computable in SVDk(A) time.
93
Data applications of CUR
CMD factorization (Sun, Xie, Zhang, and Faloutsos
07, best paper award in SIAM Conference on Data
Mining 07) A CUR-type decomposition that avoids
duplicate rows/columns that might appear in some
earlier versions of CUR-type decomposition. Many
interesting applications to large network
datasets, DBLP, etc. extensions to
tensors. Fast computation of Fourier Integral
Operators (Demanet, Candes, and Ying
06) Application in seismology imaging data
(PBytes of data can be generated) The problem
boils down to solving integral equations, i.e.,
matrix equations after discretization. CUR-type
structures appear uniform sampling seems to work
well in practice.
94
Overview
x
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

x
x
x
95
Decompositions that respect the data
Non-negative matrix factorization (Lee and Seung
00) Assume that the Aij are non-negative for all
i,j.
96
The Non-negative Matrix Factorization
Non-negative matrix factorization (Lee and Seung
00) Assume that the Aij are non-negative for all
i,j. Constrain ? and X to have only non-negative
entries as well. This should respect the
structure of the data better than Ak Uk?kVkT
which introduces a lot of (difficult to
interpret) negative entries.
97
The Non-negative Matrix Factorization
  • It has been extensively applied to
  • Image mining (Lee and Seung 00)
  • Enron email collection (Berry and Brown 05)
  • Other text mining tasks (Berry and Plemmons 04)
  • Algorithms for NMF
  • Multiplicative updage rules (Lee and Seung 00,
    Hoyer 02)
  • Gradient descent (Hoyer 04, Berry and Plemmons
    04)
  • Alternating least squares (dating back to Paatero
    94)

98
Algorithmic challenges for NMF
  • Algorithmic challenges for the NMF
  • NMF (as stated above) is convex given ? or X, but
    not if both are unknown.
  • No unique solution many matrices ? and X that
    minimize the error.
  • Other optimization objectives could be chosen
    (e.g., spectral norm, etc.)
  • NMF becomes harder if sparsity constraints are
    included (e.g., X has a small number of
    non-zeros).
  • For the multiplicative update rules there exists
    some theory proving that they converge to a fixed
    point this might be a local optimum or a saddle
    point.
  • Little theory is known for the other algorithms.

99
Overview
x
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

x
x
x
x
100
SemiDiscrete Decomposition (SDD)
Dk diagonal matrix Xk, Yk all entries are in
-1,0,1 SDD identifies regions of the matrix
that have homogeneous density.
Xk
YkT
ASDD
Dk
101
SemiDiscrete Decomposition (SDD)
SDD looks for blocks of similar height towers and
similar depth holes bump hunting. Applications
include image compression and text mining.
OLeary and Peleg 83, Kolda and OLeary 98,
00, OLeary and Roth 06 The figures are from D.
Skillkorns book on Data Mining with Matrix
Decompositions.
102
Overview
x
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

x
x
x
x
x
103
Collaborative Filtering and MMMF
User ratings for movies Goal predict unrated
movies (?)
104
Collaborative Filtering and MMMF
User ratings for movies Goal predict unrated
movies (?) Maximum Margin Matrix Factorization
(MMMF) A novel, semi-definite programming based
matrix decomposition that seems to perform very
well in real data, including the Netflix
challenge. Srebro, Rennie, and Jaakkola 04,
Rennie and Srebro 05 Some pictures are from
Srebros presentation in NIPS 04.
105
A linear factor model
106
A linear factor model
T
User biases for different movie attributes
107
All users
T
(Possible) solution to collaborative filtering
fit a rank (exactly) k matrix X to Y. Fully
observed Y ? X is the best rank k approximation
to Y. Azar, Fiat, Karlin, McSherry, and Saia
01, Drineas, Kerenidis, and Raghavan 02
108
Imputing the missing entries via SVD(Achlioptas
and McSherry 01, 06)
  • Reconstruction Algorithm
  • Compute the SVD of the matrix filling in the
    missing entries with zeros.
  • Some rescaling prior to computing the SVD is
    necessary, e.g., multiply by 1/(fraction of
    observed entries).
  • Keep the resulting top k principal components.

109
Imputing the missing entries via SVD(Achlioptas
and McSherry 01, 06)
  • Reconstruction Algorithm
  • Compute the SVD of the matrix filling in the
    missing entries with zeros.
  • Some rescaling prior to computing the SVD is
    necessary, e.g., multiply by 1/(fraction of
    observed entries).
  • Keep the resulting top k principal components.

Under assumptions on the quality of the
observed entries, reconstruction accuracy bounds
may be proven. The error bounds scale with the
Frobenius norm of the matrix.
110
A convex formulation
T
  • MMMF
  • Focus on 1 rankings (for simplicity).
  • Fit a prediction matrix X UVT to the
    observations.

111
A convex formulation
T
  • MMMF
  • Focus on 1 rankings (for simplicity).
  • Fit a prediction matrix X UVT to the
    observations.
  • Objectives (CONVEX!)
  • Minimize the total number of mismatches between
    the observed data and the predicted data.
  • Keep the trace norm of X small.

112
A convex formulation
T
  • MMMF
  • Focus on 1 rankings (for simplicity).
  • Fit a prediction matrix X UVT to the
    observations.
  • Objectives (CONVEX!)
  • Minimize the total number of mismatches between
    the observed data and the predicted data.
  • Keep the trace norm of X small.

113
MMMF and SDP
T
MMMF This may be formulated as a semi-definite
program, and thus may be solved efficiently.
114
Bounding the factor contribution
T
MMMF Instead of a hard rank constraint
(non-convex), a softer constraint is introduced.
The total number of contributing factors (number
of columns/rows in U/VT) is unbounded, but their
total contribution is bounded.
115
Overview
x
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

x
x
x
x
x
x
116
Tensors
  • Tensors appear both in Math and CS.
  • Connections to complexity theory (i.e., matrix
    multiplication complexity)
  • Data Set applications (i.e., Independent
    Component Analysis, higher order statistics,
    etc.)
  • Also, many practical applications, e.g., Medical
    Imaging, Hyperspectral Imaging, video,
    Psychology, Chemometrics, etc.

However, there does not exist a definition of
tensor rank (and associated tensor SVD) with the
nice properties found in the matrix case.
117
Tensor rank
A definition of tensor rank Given a tensor find
the minimum number of rank one tensors into it
can be decomposed.
  • only weak bounds are known
  • tensor rank depends on the underlying ring of
    scalars
  • computing it is NP-hard
  • successive rank one approxi-imations are no good

118
Tensors decompositions
  • Many tensor decompositions matricize the tensor
  • PARAFAC, Tucker, Higher-Order SVD, DEDICOM, etc.
  • Most are computed via iterative algorithms (e.g.,
    alternating least squares).

create the unfolded matrix
Given
unfold
119
Useful links on tensor decompositions
  • Workshop on Algorithms for Modern Massive Data
    Sets (MMDS) 06
  • http//www.stanford.edu/group/mmds/
  • Check the tutorial by Lek-Heng Lim on tensor
    decompositions.
  • Tutorial by Faloutsos, Kolda, and Sun in SIAM
    Data Mining Conference 07
  • Tammy Koldas web page
  • http//csmr.ca.sandia.gov/tgkolda/

120
Overview
x
  • Datasets in the form of matrices (and tensors)
  • Matrix Decompositions
  • Singular Value Decomposition (SVD)
  • Column-based Decompositions (CX, interpolative
    decomposition)
  • CUR-type decompositions
  • Non-negative matrix factorization
  • Semi-Discrete Decomposition (SDD)
  • Maximum-Margin Matrix Factorization (MMMF)
  • Tensor decompositions
  • Regression
  • Coreset constructions
  • Fast algorithms for least-squares regression

x
x
x
x
x
x
x
x
121
Problem definition and motivation
In many applications (e.g., statistical data
analysis and scientific computation), one has n
observations of the form
122
Least-norm approximation problems
Recall a linear measurement model
In order to estimate x, solve
123
Application data analysis in science
  • First application Astronomy
  • Predicting the orbit of the asteroid Ceres (in
    1801!).
  • Gauss (1809) -- see also Legendre (1805) and
    Adrain (1808).
  • First application of least squares
    optimization and runs in O(nd2) time!
  • Data analysis Fit parameters of a biological,
    chemical, economical, physical (astronomical),
    social, internet, etc. model to experimental
    data.

124
Norms of common interest
Let y b and define the residual
Least-squares approximation
Chebyshev or mini-max approximation
Sum of absolute residuals approximation
125
Lp norms and their unit balls
Recall the Lp norm for
126
Lp regression problems
We are interested in over-constrained Lp
regression problems, n gtgt d. Typically, there
is no x such that Ax b. Want to find the
best x such that Ax b. Lp regression problems
are convex programs (or better). There exist
poly-time algorithms. We want to solve them
faster.
127
Exact solution to L2 regression
Cholesky Decomposition If A is full rank and
well-conditioned, decompose ATA RTR, where R
is upper triangular, and solve the normal
equations RTRx ATb. QR Decomposition
Slower but numerically stable, esp. if A is
rank-deficient. Write A QR, and solve Rx
QTb. Singular Value Decomposition Most
expensive, but best if A is very
ill-conditioned. Write A U?VT, in which case
xOPT Ab V?-1UTb. Complexity is O(nd2) , but
constant factors differ.
128
Questions
Approximation algorithms Can we approximately
solve Lp regression faster than exact
methods? Core-sets (or induced
sub-problems) Can we find a small set of
constraints such that solving the Lp regression
on those constraints gives an approximation to
the original problem?
129
Randomized algorithms for Lp regression
Alg. 1 p2 Sampling (core-set) (1?)-approx O(nd2) Drineas, Mahoney, and Muthu 06, 07
Alg. 2 p2 Projection (no core-set) (1?)-approx O(nd logd) Sarlos 06 Drineas, Mahoney, Muthu, and Sarlos 07
Alg. 3 p ? 1,8) Sampling (core-set) (1?)-approx O(nd5) o(exact) DasGupta, Drineas, Harb, Kumar, Mahoney 07
Note Clarkson 05 gets a (1?)-approximation for
L1 regression in O(d3.5/?4) time. He
preprocessed A,b to make it well-rounded or
well-conditioned and then sampled.
130
Algorithm 1 Sampling for L2 regression
  • Algorithm
  • Fix a set of probabilities pi, i1n, summing up
    to 1.
  • Pick the i-th row of A and the i-th element of b
    with probability
  • min 1, rpi,
  • and rescale both by (1/min1,rpi)1/2.
  • Solve the induced problem.

Note in expectation, at most r rows of A and r
elements of b are kept.
131
Sampling algorithm for L2 regression
sampled rows of b
sampled rows of A
132
Our results for p2
If the pi satisfy a condition, then with
probability at least 1-?,
?(A) condition number of A
The sampling complexity is
133
Notation
U(i) i-th row of U
? rank of A U orthogonal matrix containing the
left singular vectors of A.
134
Condition on the probabilities
The condition that the pi must satisfy is, for
some ? ? (0,1
  • Notes
  • O(nd2) time suffices (to compute probabilities
    and to construct a core-set).
  • Important question
  • Is O(nd2) necessary? Can we compute the pis, or
    construct a core-set, faster?

135
The Johnson-Lindenstrauss lemma
  • Results for J-L
  • Johnson Lindenstrauss 84 project to a random
    subspace
  • Frankl Maehara 88 random orthogonal matrix
  • DasGupta Gupta 99 matrix with entries from
    N(0,1), normalized
  • Indyk Motwani 98 matrix with entries from
    N(0,1)
  • Achlioptas 03 matrix with entries in -1,0,1
  • Alon 03 optimal dependency on n, and almost
    optimal dependency on ?

136
Fast J-L transform (1 of 2)(Ailon Chazelle 06)
137
Fast J-L transform (2 of 2)(Ailon Chazelle 06)
  • Multiplication of the vectors by PHD is fast,
    since
  • (Du) is O(d) - since D is diagonal
  • (HDu) is O(d logd) use Fast Fourier Transform
    algorithms
  • (PHDu) is O(poly (logn)) - P has on average
    O(poly(logn)) non-zeros per row.

138
O(nd logd) L2 regression
Fact 1 since Hn (the n-by-n Hadamard matrix) and
Dn (an n-by-n diagonal with 1 in the diagonal,
chosen uniformly at random) are orthogonal
matrices,
Thus, we can work with HnDnAx HnDnb. Lets use
our sampling approach
139
O(nd logd) L2 regression
Fact 1 since Hn (the n-by-n Hadamard matrix) and
Dn (an n-by-n diagonal with 1 in the diagonal,
chosen uniformly at random) are orthogonal
matrices,
Thus, we can work with HnDnAx HnDnb. Lets use
our sampling approach
Fact 2 Using a Chernoff-type argument, we can
prove that the lengths of all the rows of the
left singular vectors of HnDnA are, with
probability at least .9,
140
O(nd logd) L2 regression
DONE! We can perform uniform sampling in order to
keep r O(d logd/?2) rows of HnDnA our L2
regression theorem guarantees the accuracy of the
approximation. Running time is O(nd logd),
since we can use the fast Hadamard-Walsh
transform to multiply Hn and DnA.
141
Open problem sparse approximations
Sparse approximations and l2 regression (Nataraja
n 95, Tropp 04, 06) In the sparse
approximation problem, we are given a d-by-n
matrix A forming a redundant dictionary for Rd
and a target vector b 2 Rd and we seek to solve
subject to
In words, we seek a sparse, bounded error
representation of b in terms of the vectors in
the dictionary.
142
Open problem sparse approximations
Sparse approximations and l2 regression (Nataraja
n 95, Tropp 04, 06) In the sparse
approximation problem, we are given a d-by-n
matrix A forming a redundant dictionary for Rd
and a target vector b 2 Rd and we seek to solve
subject to
In words, we seek a sparse, bounded error
representation of b in terms of the vectors in
the dictionary. This is (sort of)
under-constrained least squares regression. Can
we use the aforementioned ideas to get better
and/or faster approximation algorithms for the
sparse approximation problem?
143
Application feature selection for RLSC
Regularized Least Squares Regression (RLSC)
Given a term-document matrix A and a class
label for each document, find xopt to minimize
Here c is the vector of labels. For simplicity
assume two classes, thus ci 1.
144
Application feature selection for RLSC
Regularized Least Squares Regression (RLSC)
Given a term-document matrix A and a class
label for each document, find xopt to minimize
Here c is the vector of labels. For simplicity
assume two classes, thus ci 1. Given a new
document-vector q, its classification is
determined by the sign of
145
Feature selection for RLSC
Feature selection for RLSC Is it possible to
select a small number of actual features (terms)
and apply RLSC only on the selected terms without
a huge loss in accuracy?
Well studied problem supervised (they employ the
class label vector c) algorithms exist. We
applied our L2 regression sampling scheme to
select terms unsupervised!
146
A smaller RLSC problem
147
A smaller RLSC problem
148
TechTC data from ODP(Gabrilovich and Markovitch
04)
TechTC data 100 term-document matrices average
size ¼ 20,000 terms and ¼ 150 documents.
In prior work, feature selection was performed
using a supervised metric called information gain
(IG), an entropic measure of correlation with
class labels.
Conclusion of the experiments Our unsupervised
technique had (on average) comparable performance
to IG.
149
(No Transcript)
150
Conclusions
Linear Algebraic techniques (e.g., matrix
decompositions and regression) are fundamental in
data mining and information retrieval. Randomized
algorithms for linear algebra computations
contribute novel results and ideas, both from a
theoretical as well as an applied perspective.
151
Conclusions and future directions
  • Linear Algebraic techniques (e.g., matrix
    decompositions and regression) are fundamental in
    data mining and information retrieval.
  • Randomized algorithms for linear algebra
    computations contribute novel results and ideas,
    both from a theoretical as well as an applied
    perspective.
  • Important direction
Write a Comment
User Comments (0)
About PowerShow.com