Dennis DeCoste - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Dennis DeCoste

Description:

Yahoo! Research Labs. http://research.yahoo.com/staff/algorithms/decosted.xml. Yahoo! Research Labs Spot Workshop on Recommender Systems. August 24, 2004. August ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 47

Provided by: researc

Category:

more less

Transcript and Presenter's Notes

Title: Dennis DeCoste

1
Optimizing Nearest-Neighbor Methods

Dennis DeCoste
Yahoo! Research Labs
http//research.yahoo.com/staff/algorithms/decoste
d.xml

Yahoo! Research Labs Spot Workshop on
Recommender Systems August 24, 2004
2
Recommendation via Nearest Neighbors

given
database X of n (d-dimensional) (sparse) example
vectors
e.g. n people, d movies, X(i,j)person
is rating of movie j
(d-dimensional) (sparse) query vector Q
suitable similarity/distance metric dist(x,y)
(one) approach based on k-NN (user-user CF)
find Qs k most-similar neighbors (X1,,Xi,,Xk)
e.g. k people who rated movies most similarly as
person Q
report recommendation over neighbors
e.g. movie j with Q(j)unknown highest
meanX(1,j),,X(k,j)
common dist(x,y) Euclidian missing vals
mean
e.g. mean rating per person, per movie, or combo

3
Fast Nearest Neighbors The Landscape

linear scan compute query Qs similarity to all
n examples Xi
speed-up via faster and/or fewer distance
computations
faster (query time cost O(n), but less than
O(d n))
low-dimensional embeddings to quickly prune
clearly-far examples
PCA, FastMap
fewer (cost potentially lt O(n))
spatial indexing
kd-trees (generalize binary search to d-dims
ideal case gets cost O(log(n))
metric indexing
vp-trees (for Euclidian dists roughly
combines PCA with kd-trees)
upfront many other open/frontier issues
?speed ? ?search
two biggies 1) disk vs RAM and 2) what is
right distance metric?
approximate vs exact NN (proving found NN
dominates query time)
exploiting labels (nearest class vs nearest
neighbor per se)
many alternatives (locality-preserving hash,
IGRID, x-tree, tv-tree, )

4
When Complexity Linear in n Isnt Good Enough

in DM, n may be so large that linear in n too
slow
especially at query-time
common goal O(dn) train O(d) or O(d log n)
query time
for many tasks, subsampling n may suffice
e.g. linear regression, for which O(nd2 d3)
subsample of huge n often leads to very similar
covariance matrix
but, for k-nearest neighbor, subsample can be bad
k-NN only converges to 2 Bayes error for as n ?
?
especially hard to pick good subsamples when d
large
linear scans O(nd) time cost can be too high
especially model (dist) selection search (multi n
queries)
often prefer some pre-query cost for faster
queries ...

5
Linear Scan The Thing to Beat

many clever indexing methods try better than
O(nd)
R-tree, kd-tree, vp-tree, ball-tree, X-tree,
TV-tree, etc. ...
for high d rarely much faster than (good) linear
scan
often useless if intrinsic d even moderate (e.g.
dgt20)
linear scans simplicity makes it highly
optimizable
modern CPU prefetch ops no pipeline stall due to
branch
branch and bound relatively easy
order d coordinates via PCA, stop when partial
dist gt best-so-far
use priority queue for best-first-search (e.g.
exact match in O(d))
also O(n) not so bad -- O(n d) is often the
problem
dim reduction (PCA, FastMap, random projections)
often suffice ...
at very least linear scan within buckets (tree
leafs)

6
GEMINI Framework

Qs length (d) often large (100-1000)
even more so for multi-variate data ...
high d performs poorly for similarity search
linear scan slow O(nd)
indexing methods (e.g. kd-trees) degrade for high
d
solution dimensionality reduction to d ltlt d
linear scan d/d faster indexing works well for
d lt 20
using PCA, FastMAP, ...
GEMINI framework Faloutsos et al.
lower bounding lemma distd(A,B) lt distd(A,B)
thus, no false dismissals of top matches (100
recall)
since only stop search when all (partial) dists gt
best so far

7
FastMap MDS Suitable for DM
Faloutsos Lin, SIGMOD 1995

FastMap finds k-dimensional embedding of n data
like PCA, but using distances (not require vector
data)
MDS computes n2 dists FastMap approxs using O(k
n)
achieved by approximately finding farthest pairs

distant orthogonal
O(n) times
limited (e.g. 5) repeats, until a,b stable

b
a

b
8
FastMap Compute Embed Coord xi from a,b

Pythagorean theorem directly gives
dai2 xi2 z2
dbi2 (dab - xi)2 z2
z2 equivalences
dai2 - xi2
dbi2 - (dab - xi)2
dbi2 - dab2 - 2dabxi xi2
cancelling xi2 terms
dai2 dbi2 - dab2 2dabxi
xi dai2 dab2 - dbi2 / 2dab

a,b pivot points selected by choose-distant
z
dab - xi
maximized
9
FastMap Computing Embedded Distances

project data to hyperplane H orthogonal to line
ab
D(OI,OJ)2 D(OI,OJ)2 - (xI - xJ)2 for I,J
1,,n
to get distances after k pivot pairs
recursively D(OI,OJ)2 D(OI,OJ)2 - ?s1k
(xI(s) - xJ(s))2
after kn steps, all D(OI,OJ)0, some for kltn

10
FastMap k-Dimensional Embeddings
after k pivot pairs Dk(OI,OJ)2 D(OI,OJ)2 -
?s1k (xI(s) - xJ(s))2
Qk Dkaq2 Dkab2 - Dkbq2 / 2Dkab
so, first k embeddings Q1,Q2,,Qk require
O(12k) O(k2) work
pivot-pair 1
pivot-pair 2
q
...
Q1
1 ... k
Q2
Qk
alt Cholesky S ZT Z
(note qTq - Q12Qk2 represents reconstruction
error when using only kltd dims)
11
(FastMap MDS which can embed queries)

RFastMap(X) returns result
k pivot id pairs (a1,b1) (ak,bk)
k distances between the pivot pairs (daibi)
k2 embedding coords (k coords for each of k
pivots)
ZFastMap_embed(Y,R,X)
embeds d-by-n Y into k-by-n Z
for p1 to k do
(a,b) p-th pivot pair
use dai2 Dai2 - ?s1p-1 (Zsa - Zsi)2 to
compute all dai2 (and dbi2)
Zpi dai2 dab2 - dbi2 / 2dab for i1,,n
where DIJ dists (only as needed) in original
input space
O(2k) dists (dai2 and dbi2) to embed each query
of Y

12
Spatial Indexing kd-Trees

classic binary search (CS 101, kindergarten, ...)
given univariate (d1) real-valued data x(x1,
x2, xn)
prepare by sorting x once -- O(n log(n)) time,
O(n) space
fast queries via binary search -- O(log(n)) time
use binary search tree of data
balanced, so height log2(n)
kd-tree generalize binary search to data X with
dgt1
goal still get O(d log(n)) query times and O(dn)
space
prepare (indexing) time O(d n log(n)) to build
kd-tree

13
2d Example Constructing kd-Tree

each node partitions subtrees data by one dim
issue 1 which dimension i1d to use for given
node?
common max spread for values of subtrees data
in dim i (so max pruning)
issue 2 where cut for dimension i?
common at median of values (so that tree evenly
balanced)

d2
1st cut by dimension d2
d1
14
Constructing kd-Tree Pivot Strategies
2d tree using median-of-max-spread
closest-to-center-of-widest-dim
latter preferred, except after some depth prefer
former (for balance)
15
2d Example Querying kd-Tree
first, DFS directly to leaf containing Q
often immediately finds decent neighbor (with
best-so-far dist r)
Q
often, kd-tree finds decent neighbor quickly,
and spends most of search proving its the best
(motivates approx-NN )
r
16
2d Example Querying kd-Tree
2nd, backup to leafs parent check if other
child needs to be searched
checking if hyperrectangle HR intersects
hypersphere (radius r, center Q) find point p
in HR closest to Q intersect if dist(p,Q) ? r
p
Q need not search this HR
pi hrimin if Qi ? hrimin Qi if
hrimin lt Qi lt hrimax hrimax if Qi ? hrimax
17
Querying kd-Tree Best and Worst Cases
good-case
bad-case
expected search cost depends on expected
distribution of X and Q worst case query time
more like O(2d log(n)) than O(d log(n))
18
Example kd-Tree Performance
number of DB nodes examined during query
search (n10,000)
dimensionality of data (d)
19
Training Time of k-NN Revisted

so, in practice, training a k-NN usually not
O(1)
e.g. O(2d n log(n)), to build kd-tree
for faster query times O(d log(n)) vs O(d n) for
linear scan for k1
or O(d n) time to LOO cross-validate to find best
k
or other pre-processing
i.e. condensation (prune train set to reduce
effective n)
feature selection/weighting
find best distance metric to use (model
selection)
...

20
Metric Indexing Vantage Point Trees
Data Structures and Algos for NN Search in
General Metric Spaces (vp-trees), by Peter
Yianilos. Also ball trees, gh-trees,

not require data in vector form (e.g. SVM
kernel dists)
exploit any structure among n data points
anything helpless for pathos
e.g. all data at unit hypercube corners lt1,0,0gt,
lt0,1,0gt,lt0,0,1gt
define distance between pivot p objects
(q,b,)
dp(q,b) ? d(q,p) - d(b,p)
d(q,b) ? d(q,p) - d(b,p) dp(q,b)
via triangular inequality d(q,p) ? d(q,b)
d(b,p)
lower bounding, so
dp(q,b) gt r ? d(q,b) gt r
can early stop search for b as qs neighbor, once
gt best-so-far (r)

q
d(p,q)
d(q,b)
p
b
d(p,b)
21
vp-Tree Pivot Choices Depends on Distribution
corner optimal for uniform over square closest
to 50/50 split
22
2d Example vp-Trees vs kd-Trees
23
Constructing Simple vp-Tree
function nodebuild(S) if S? then return node
? p select_vp(S) node.pp. m
medians?S d(p,s) node.mm L s ? S-p
d(p,s) lt m R s ? S-p d(p,s) ? m
node.left build(L) node.right build(R)
function bestpselect_vp(S) P random sample
of S best0 for p?P Q random sample of
S s varianceOfMedianq?Q d(p,q) if
(sgtbest) bests bestpp
O(c n log(n)) time (c time to compute
distance often cd) O(n) space
r
q
m
p ?

L R
when can skip kids n.R if d(n.p,q)r lt n.m n.L
if d(n.p,q)-r gt n.m
24
Searching vp-Tree
simple VP search R if xgtm-r L if xlt mr
function search(n,q) if n? then return
xd(q,n.id) if (xltr) then rx bestn.id
middle (n.leftBndHigh n.rightBndLow)/2 if
(x lt middle) then if (x ? IL)
search(n.left,q) if (x ? IR)
search(n.right,q) else if (x ? IR)
search(n.right,q) if (x ? IL)
search(n.left,q)
heuristic ordering of L,R or R
often both searched
often both searched
r
r

rightBndLow
leftBndHigh

vp id ?
middle
narrow by ? for approx
IL
IR
25
Results for vp-Tree on 2d Data in 10d Space
number (of 4000 total) vectors examined
2d hypercube
10d hypercube
sample Q from 2d
sample Q from 10d
since not axis-align partitioning like kd-trees,
vp-tree complexity stays close to 2d performance
and better than 10d random hypercube (i.e. build
same vp-tree regardless of how rotate data not
need PCA)
26
Same 2d Rand Data in 3d to 50d Space Noise
vp-trees
amplitude of added noise
27
Examples of vp-trees on Real Image Data
n 1 million
fraction of DB touched per query (average)
kd-tree linear scan by d256
Q DB similar
Q DB general
32x32
32x32
28
kd-Tree vs vp-Tree

kd-tree
build time O(d n log(n))
query time O(min(dn,2d-1log(n))) worst-case
each query path (root to some leaf) takes
O(log(n))
vp-tree (where d?d ? effective dimensionality)
assuming distance computation is O(d)
build time O(d n log(n)) using random
pivot samples
query time O(min(dn,d 2d-1log(n)) worst-case
each query path (root to some leaf) takes O(d
log(n))
so, vp-tree not always better than kd-tree each
path O(d) more
exploits local structure for each subtree (not
need PCA)
also can mix (e.g. SR-trees, etc.)

Q need not search this HR
29
FastMAP revisited (wrt kd-tree and vp-tree)

FastMap can make linear scan faster
FastMAP(k) like PCA(k), but only requires dists,
not vects
satisfies lower bounding lemma (just as PCA does)
however, FastMap time complexity is O(dkn k2n)
or O(dk k2) for k-dim embed per query for
O(d) dists
pro avoids O(dn2) time of MDS
when kltltn
con k2, due to enforcing orthogonality of
embedding axes
at k-th pivot pair D(OI,OJ)2 D(OI,OJ)2 -
?s1k (xI(s) - xJ(s))2
negligable (vs linear scans O(nd)) for common
case of k sqrt(d)
e.g. often just k2 or k3 for simple
visualization applications
but, k could be as large as n (e.g. for poly2
kernel distance) ...
so, for large k vp-tree faster than FastMap
kd-tree
e.g. O(d log(n)) lt O(k2 log(n)), where kd and
log(n)ltd

30
Triangular Inequality Insufficient for (Very)
High d

vp-tree (and other metric trees) based on TIs
for all p,q,b d(p,q) ? d(p,b) d(b,q)
etc.
for query q, prune point b via pivot p whenever
see
d(q,p) - d(p,b) gt r since d(q,b) gt r must
be true via TI
e.g. vp-tree caches bounds on d(p,b) (over ps
subtree)
unfortunately, in high d, TI is often USELESS
let minD smallest (non-self) d(a,b) distance in
database
let maxD largest d(a,b) in database
if 2minDgtmaxD (maxD-minDltminD), no pruning
possible
since for most any q ? X d(q,p) - d(p,b) lt minD
but r gt minD

b
d(p,b)
d(b,q)
p
q
d(p,q)
31
High d Hopeless?

only high intrinsic d (e.g. gt 20) is a problem
else PCA, triangular inequality, etc. will handle
fine
also, kernel methods like SVM handle high d fine
e.g. poly 9 kernel uses huge implicit dim,
O((d9)!/(d!9!))
but SVM weights that feature space -- only O(n)
non-zero
SVM doesnt employ triangular inequality
training determines which examples (h SVs h?n)
need to search
key is often to use appropriate distance metric
Euclidean (L2) distance over original d usually
not best ...

32
STOP!
33
VA-File

accept that linear scan inevitable for high d
compress data so linear scan faster (e.g. 10x)
Vector-Approximation
fit more of data in RAM
less/faster disk scans

34
Locality-Preserving Hashing

hash table very fast O(1) time
but unlikely query nearest neighbor would
collide
so, use multiple (Lgt1) hash tables
each one hashes over random k of d
linear scan over union of all L hashed buckets
for query
gives fast approx NN (within 1? factor of
optimal dist)
experiments suggest L10-100 often gives
reasonable error (? 2-20)
with max bucket size B (e.g. B 100)
opt k max prob (p1) nears hash same bucket, min
(p2) fars not
k log(1/p2) (n/B), L (n/B)p, where p
ln(1/p1) / ln(1/p2)
e.g. Similarity Search in High Dimensions via
Hashing, by Gionis, Indyk, and Motwani, VLDB 99.

35
IGrid (Inverted Grid Index) Motivations
(Aggarwal et al, KDD 2000)

problem with high d is that (Dmax - Dmin)/Dmin ?
0
i.e. as d grows, var(dists) tends to zero
(relative to mean)
for very high d
nearest-neighbor farthest neighbor could swap
places
with relatively small perturbation in vector
reps of the data
even most-similar records likely have some
well-sep dims
due to noise effects (e.g. Brownian motion)
these problems suggest
need more meaningful distance measure
simple Euclidean (L2) sensible for d2,3 not
d100
if crafting new distance score, might as well be
fast too ...

36
IGrid Similarity Functions

for two vectors x(x1,x2,xd) and y(y1,y2,,yd)
basic scale-normalized similarity
IDIST(x,y) ?i1d ( 1 - xi - yi / (ui - li)
)p 1/p
where ui,li are upper/lower range on dimension i
values
p 2 is similar to Euclidean L2
proximity-based similarity
bin each of d dimensions into kd equi-depth
ranges
PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
/ (mi - ni) )p 1/p
S(x,y,kd) is set of dimensions for which xi yi
in same bin
where mi ni are upper/lower bounds of (same)
bin for i
value of PIDIST ranges from 0 to S(x,y,kd)
on average, summation has d/kd bin matches (noise
reduction)
equi-depth similar to IDF-normalization in IR
(relative to all data)

37
IGrid Index Structure

uses inverted index on grid representation of
data
e.g. much like Google, by discretizing data in
words
IGrid structure simply d-by-kd matrix Ri,j,
with
Ri,j.hi and RI.j.lo (two real values)
upper and lower range bounds for each bin j for
each dimension i
Ri,j.ids1n/kd (of integers from
range 1n)
lists records whose value for dim i is within
Ri,js range
length of each list will be n/kd due to
equi-depth binning
Ri,j.vals1n/kd (of real values)
each record in Ri,js list also stores actual
value for dim i
note inverted indexs size 2 size of X
so, IGrid only requires O(d n) space

38
Computing Proximity Similarity Given IGrid

given d-by-kd Ri,j.hi, Ri.j.lo, Ri,j.ids
Ri.j.vals
query-time similarity computes very simple and
fast
Google would just intersect d ids lists (each of
size kd)
general way hash table h1n of similarities
all 0 initially
for each dim i and corresponding j of range for
query val for dim i
update hid (accumulate partial PIDIST) for each
id in Ri,j.ids
avoiding HT overflow (when O(nd) space gt RAM)
divide large X into chunks (each of max_hash_size
kd records)
at most only 1/kd of data accessed at all (i.e.
require PIDIST accumulations)
NOTE thus, speed is O(n/kd), so improves with
increasing kd
each chunk build inverted index and query
separately report best
PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
/ (mi - ni) )p 1/p
S(x,y,kd) is set of dimensions for which xi yi
in same bin

39
Picking a Good kd for IGrid

as saw, higher kd leads to faster query time
higher kd also improves meaningfulness
ignores exact degree of dissimilarity on noisy
dims
too large of kd bad loss of information
tradeoff resolution pick minimum kd to avoid
?0
assume Bernoulli distribution (bin match prob
1/kd)
0,1 match random variable Mi has ?1/kd ,
?(1/kd)(1-1/kd)
L?i1d Mi has ? d/kd , ?
sqrt((d/kd)(1-1/kd))
? sqrt((d/kd)(kd -1)/kd) sqrt(d(kd -1)) / kd
if lim d?? ? /? 0 then (Dmax-Dmin) / Dmin ?0
want lim d?? sqrt(kd - 1)/d) gt 0
implies pick kd at least linear in d they use
kd ? d ...

40
Pick kd ? d Using Some ? ? 0.5,1

for linear dependence on d (kd ? d)
?1 exactly gives Lp norm distance for d1
increasingly different from Lp as d grows
?0.5 and p1 similar to L1 norm distance for d2
increasingly different from L1 as d grows
suggests ? ? 0.5,1 is good range
also ensures that speed improves as d grows (kd
grows)
PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
/ (mi - ni) )p 1/p
S(x,y,kd) is set of dimensions for which xi yi
in same bin

41
Handling Edge Effects of Binning

when bin (discretize), need to consider edge
effects
example
value of xi could be in bin j very close to
bins high
value of yi could be in bin j1 very close to
bins low
PIDIST contribution would be 0
recall PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 -
xi - yi / (mi - ni) )p 1/p
S(x,y,kd) is set of dimensions for which xi yi
in same bin
eliminate edge effect by
each of kd bins subdivided into L equi-depth
ranges
each getting list of n/(kd L) record ids
at search find subbin for query, then (L-1)/2 on
each side
L3 seems sufficient

42
Supporting Multiple ? At Query Time

building one index which can support multiple
distance metrics at query time often desirable
since right one often not clear pre-query
IGrid supports multiple ?lt ?0 easily
? ? / ?0 , use ? bins nearest bin for each
query dim
i.e. treat as hierarchy of binnings (most
detailed for ?0)
suggests building for large ?0 (e.g. ?0 1) first

43
IGrid Issues

does this approach really make sense for huge d?
Is it reasonable to discretize into 3000 bins for
d1000?
e.g. many real-values come from 8 or 16 bit A/D
sensors
nevertheless, may be useful for moderate d
still larger than d 20 where kd-tree fails
also, PCA on data might help even for huge d
PCA would spread variance, per dim i
epecially for earlier dimensions (top principal
components)
for bottom dims (less variance), only match if
vals VERY close
for huge n, equi-depth for O(d) bins might make
sense
KDD00 paper not discuss this issue, seems critical

44
UCI Data Meaningfulness Checks
k5, ?1
numbers show agreements with class labels, using
5-NN with various distance measures
This table shows PIdist metric may be useful
but doesnt prove IGrids use of many bins (kd
d) really makes sense when d is huge (largest
here is only d160) ...
45
Meaningfulness With Increasing Dimension
L2 norm
PIdist
i.e. doesnt tend toward 0
relative meaningfulness (Dmax - Dmin) / Dmean
using random, uniformly distributed data (n100)
46
Query Speed with Increasing Dimensionality
?1
?1
IGrid fraction of DB accessed can be predicted
analytically 2/(? d)

Write a Comment

User Comments (0)