Title: Dennis DeCoste
1Optimizing Nearest-Neighbor Methods
- Dennis DeCoste
- Yahoo! Research Labs
- http//research.yahoo.com/staff/algorithms/decoste
d.xml
Yahoo! Research Labs Spot Workshop on
Recommender Systems August 24, 2004
2Recommendation via Nearest Neighbors
- given
- database X of n (d-dimensional) (sparse) example
vectors - e.g. n people, d movies, X(i,j)person
is rating of movie j - (d-dimensional) (sparse) query vector Q
- suitable similarity/distance metric dist(x,y)
- (one) approach based on k-NN (user-user CF)
- find Qs k most-similar neighbors (X1,,Xi,,Xk)
- e.g. k people who rated movies most similarly as
person Q - report recommendation over neighbors
- e.g. movie j with Q(j)unknown highest
meanX(1,j),,X(k,j) - common dist(x,y) Euclidian missing vals
mean - e.g. mean rating per person, per movie, or combo
3Fast Nearest Neighbors The Landscape
- linear scan compute query Qs similarity to all
n examples Xi - speed-up via faster and/or fewer distance
computations - faster (query time cost O(n), but less than
O(d n)) - low-dimensional embeddings to quickly prune
clearly-far examples - PCA, FastMap
- fewer (cost potentially lt O(n))
- spatial indexing
- kd-trees (generalize binary search to d-dims
ideal case gets cost O(log(n)) - metric indexing
- vp-trees (for Euclidian dists roughly
combines PCA with kd-trees) - upfront many other open/frontier issues
?speed ? ?search - two biggies 1) disk vs RAM and 2) what is
right distance metric? - approximate vs exact NN (proving found NN
dominates query time) - exploiting labels (nearest class vs nearest
neighbor per se) - many alternatives (locality-preserving hash,
IGRID, x-tree, tv-tree, )
4When Complexity Linear in n Isnt Good Enough
- in DM, n may be so large that linear in n too
slow - especially at query-time
- common goal O(dn) train O(d) or O(d log n)
query time - for many tasks, subsampling n may suffice
- e.g. linear regression, for which O(nd2 d3)
- subsample of huge n often leads to very similar
covariance matrix - but, for k-nearest neighbor, subsample can be bad
- k-NN only converges to 2 Bayes error for as n ?
? - especially hard to pick good subsamples when d
large - linear scans O(nd) time cost can be too high
- especially model (dist) selection search (multi n
queries) - often prefer some pre-query cost for faster
queries ...
5Linear Scan The Thing to Beat
- many clever indexing methods try better than
O(nd) - R-tree, kd-tree, vp-tree, ball-tree, X-tree,
TV-tree, etc. ... - for high d rarely much faster than (good) linear
scan - often useless if intrinsic d even moderate (e.g.
dgt20) - linear scans simplicity makes it highly
optimizable - modern CPU prefetch ops no pipeline stall due to
branch - branch and bound relatively easy
- order d coordinates via PCA, stop when partial
dist gt best-so-far - use priority queue for best-first-search (e.g.
exact match in O(d)) - also O(n) not so bad -- O(n d) is often the
problem - dim reduction (PCA, FastMap, random projections)
often suffice ... - at very least linear scan within buckets (tree
leafs)
6GEMINI Framework
- Qs length (d) often large (100-1000)
- even more so for multi-variate data ...
- high d performs poorly for similarity search
- linear scan slow O(nd)
- indexing methods (e.g. kd-trees) degrade for high
d - solution dimensionality reduction to d ltlt d
- linear scan d/d faster indexing works well for
d lt 20 - using PCA, FastMAP, ...
- GEMINI framework Faloutsos et al.
- lower bounding lemma distd(A,B) lt distd(A,B)
- thus, no false dismissals of top matches (100
recall) - since only stop search when all (partial) dists gt
best so far
7FastMap MDS Suitable for DM
Faloutsos Lin, SIGMOD 1995
- FastMap finds k-dimensional embedding of n data
- like PCA, but using distances (not require vector
data) - MDS computes n2 dists FastMap approxs using O(k
n) - achieved by approximately finding farthest pairs
distant orthogonal
O(n) times
limited (e.g. 5) repeats, until a,b stable
b
a
b
8FastMap Compute Embed Coord xi from a,b
- Pythagorean theorem directly gives
- dai2 xi2 z2
- dbi2 (dab - xi)2 z2
- z2 equivalences
- dai2 - xi2
- dbi2 - (dab - xi)2
- dbi2 - dab2 - 2dabxi xi2
- cancelling xi2 terms
- dai2 dbi2 - dab2 2dabxi
- xi dai2 dab2 - dbi2 / 2dab
a,b pivot points selected by choose-distant
z
dab - xi
maximized
9FastMap Computing Embedded Distances
- project data to hyperplane H orthogonal to line
ab - D(OI,OJ)2 D(OI,OJ)2 - (xI - xJ)2 for I,J
1,,n - to get distances after k pivot pairs
- recursively D(OI,OJ)2 D(OI,OJ)2 - ?s1k
(xI(s) - xJ(s))2 - after kn steps, all D(OI,OJ)0, some for kltn
10FastMap k-Dimensional Embeddings
after k pivot pairs Dk(OI,OJ)2 D(OI,OJ)2 -
?s1k (xI(s) - xJ(s))2
Qk Dkaq2 Dkab2 - Dkbq2 / 2Dkab
so, first k embeddings Q1,Q2,,Qk require
O(12k) O(k2) work
pivot-pair 1
pivot-pair 2
q
...
Q1
1 ... k
Q2
Qk
alt Cholesky S ZT Z
(note qTq - Q12Qk2 represents reconstruction
error when using only kltd dims)
11(FastMap MDS which can embed queries)
- RFastMap(X) returns result
- k pivot id pairs (a1,b1) (ak,bk)
- k distances between the pivot pairs (daibi)
- k2 embedding coords (k coords for each of k
pivots) - ZFastMap_embed(Y,R,X)
- embeds d-by-n Y into k-by-n Z
- for p1 to k do
- (a,b) p-th pivot pair
- use dai2 Dai2 - ?s1p-1 (Zsa - Zsi)2 to
compute all dai2 (and dbi2) - Zpi dai2 dab2 - dbi2 / 2dab for i1,,n
- where DIJ dists (only as needed) in original
input space - O(2k) dists (dai2 and dbi2) to embed each query
of Y
12Spatial Indexing kd-Trees
- classic binary search (CS 101, kindergarten, ...)
- given univariate (d1) real-valued data x(x1,
x2, xn) - prepare by sorting x once -- O(n log(n)) time,
O(n) space - fast queries via binary search -- O(log(n)) time
- use binary search tree of data
- balanced, so height log2(n)
- kd-tree generalize binary search to data X with
dgt1 - goal still get O(d log(n)) query times and O(dn)
space - prepare (indexing) time O(d n log(n)) to build
kd-tree
132d Example Constructing kd-Tree
- each node partitions subtrees data by one dim
- issue 1 which dimension i1d to use for given
node? - common max spread for values of subtrees data
in dim i (so max pruning) - issue 2 where cut for dimension i?
- common at median of values (so that tree evenly
balanced)
d2
1st cut by dimension d2
d1
14Constructing kd-Tree Pivot Strategies
2d tree using median-of-max-spread
closest-to-center-of-widest-dim
latter preferred, except after some depth prefer
former (for balance)
152d Example Querying kd-Tree
first, DFS directly to leaf containing Q
often immediately finds decent neighbor (with
best-so-far dist r)
Q
often, kd-tree finds decent neighbor quickly,
and spends most of search proving its the best
(motivates approx-NN )
r
162d Example Querying kd-Tree
2nd, backup to leafs parent check if other
child needs to be searched
checking if hyperrectangle HR intersects
hypersphere (radius r, center Q) find point p
in HR closest to Q intersect if dist(p,Q) ? r
p
Q need not search this HR
pi hrimin if Qi ? hrimin Qi if
hrimin lt Qi lt hrimax hrimax if Qi ? hrimax
17Querying kd-Tree Best and Worst Cases
good-case
bad-case
expected search cost depends on expected
distribution of X and Q worst case query time
more like O(2d log(n)) than O(d log(n))
18Example kd-Tree Performance
number of DB nodes examined during query
search (n10,000)
dimensionality of data (d)
19Training Time of k-NN Revisted
- so, in practice, training a k-NN usually not
O(1) - e.g. O(2d n log(n)), to build kd-tree
- for faster query times O(d log(n)) vs O(d n) for
linear scan for k1 - or O(d n) time to LOO cross-validate to find best
k - or other pre-processing
- i.e. condensation (prune train set to reduce
effective n) - feature selection/weighting
- find best distance metric to use (model
selection) - ...
20Metric Indexing Vantage Point Trees
Data Structures and Algos for NN Search in
General Metric Spaces (vp-trees), by Peter
Yianilos. Also ball trees, gh-trees,
- not require data in vector form (e.g. SVM
kernel dists) - exploit any structure among n data points
- anything helpless for pathos
- e.g. all data at unit hypercube corners lt1,0,0gt,
lt0,1,0gt,lt0,0,1gt - define distance between pivot p objects
(q,b,) - dp(q,b) ? d(q,p) - d(b,p)
- d(q,b) ? d(q,p) - d(b,p) dp(q,b)
- via triangular inequality d(q,p) ? d(q,b)
d(b,p) - lower bounding, so
- dp(q,b) gt r ? d(q,b) gt r
- can early stop search for b as qs neighbor, once
gt best-so-far (r)
q
d(p,q)
d(q,b)
p
b
d(p,b)
21vp-Tree Pivot Choices Depends on Distribution
corner optimal for uniform over square closest
to 50/50 split
222d Example vp-Trees vs kd-Trees
23Constructing Simple vp-Tree
function nodebuild(S) if S? then return node
? p select_vp(S) node.pp. m
medians?S d(p,s) node.mm L s ? S-p
d(p,s) lt m R s ? S-p d(p,s) ? m
node.left build(L) node.right build(R)
function bestpselect_vp(S) P random sample
of S best0 for p?P Q random sample of
S s varianceOfMedianq?Q d(p,q) if
(sgtbest) bests bestpp
O(c n log(n)) time (c time to compute
distance often cd) O(n) space
r
q
m
p ?
L R
when can skip kids n.R if d(n.p,q)r lt n.m n.L
if d(n.p,q)-r gt n.m
24Searching vp-Tree
simple VP search R if xgtm-r L if xlt mr
function search(n,q) if n? then return
xd(q,n.id) if (xltr) then rx bestn.id
middle (n.leftBndHigh n.rightBndLow)/2 if
(x lt middle) then if (x ? IL)
search(n.left,q) if (x ? IR)
search(n.right,q) else if (x ? IR)
search(n.right,q) if (x ? IL)
search(n.left,q)
heuristic ordering of L,R or R
often both searched
often both searched
r
r
rightBndLow
leftBndHigh
vp id ?
middle
narrow by ? for approx
IL
IR
25Results for vp-Tree on 2d Data in 10d Space
number (of 4000 total) vectors examined
2d hypercube
10d hypercube
sample Q from 2d
sample Q from 10d
since not axis-align partitioning like kd-trees,
vp-tree complexity stays close to 2d performance
and better than 10d random hypercube (i.e. build
same vp-tree regardless of how rotate data not
need PCA)
26Same 2d Rand Data in 3d to 50d Space Noise
vp-trees
amplitude of added noise
27Examples of vp-trees on Real Image Data
n 1 million
fraction of DB touched per query (average)
kd-tree linear scan by d256
Q DB similar
Q DB general
32x32
32x32
28kd-Tree vs vp-Tree
- kd-tree
- build time O(d n log(n))
- query time O(min(dn,2d-1log(n))) worst-case
- each query path (root to some leaf) takes
O(log(n)) - vp-tree (where d?d ? effective dimensionality)
- assuming distance computation is O(d)
- build time O(d n log(n)) using random
pivot samples - query time O(min(dn,d 2d-1log(n)) worst-case
- each query path (root to some leaf) takes O(d
log(n)) - so, vp-tree not always better than kd-tree each
path O(d) more - exploits local structure for each subtree (not
need PCA) - also can mix (e.g. SR-trees, etc.)
Q need not search this HR
29FastMAP revisited (wrt kd-tree and vp-tree)
- FastMap can make linear scan faster
- FastMAP(k) like PCA(k), but only requires dists,
not vects - satisfies lower bounding lemma (just as PCA does)
- however, FastMap time complexity is O(dkn k2n)
- or O(dk k2) for k-dim embed per query for
O(d) dists - pro avoids O(dn2) time of MDS
when kltltn - con k2, due to enforcing orthogonality of
embedding axes - at k-th pivot pair D(OI,OJ)2 D(OI,OJ)2 -
?s1k (xI(s) - xJ(s))2 - negligable (vs linear scans O(nd)) for common
case of k sqrt(d) - e.g. often just k2 or k3 for simple
visualization applications - but, k could be as large as n (e.g. for poly2
kernel distance) ... - so, for large k vp-tree faster than FastMap
kd-tree - e.g. O(d log(n)) lt O(k2 log(n)), where kd and
log(n)ltd
30Triangular Inequality Insufficient for (Very)
High d
- vp-tree (and other metric trees) based on TIs
- for all p,q,b d(p,q) ? d(p,b) d(b,q)
- etc.
- for query q, prune point b via pivot p whenever
see - d(q,p) - d(p,b) gt r since d(q,b) gt r must
be true via TI - e.g. vp-tree caches bounds on d(p,b) (over ps
subtree) - unfortunately, in high d, TI is often USELESS
- let minD smallest (non-self) d(a,b) distance in
database - let maxD largest d(a,b) in database
- if 2minDgtmaxD (maxD-minDltminD), no pruning
possible - since for most any q ? X d(q,p) - d(p,b) lt minD
but r gt minD
b
d(p,b)
d(b,q)
p
q
d(p,q)
31High d Hopeless?
- only high intrinsic d (e.g. gt 20) is a problem
- else PCA, triangular inequality, etc. will handle
fine - also, kernel methods like SVM handle high d fine
- e.g. poly 9 kernel uses huge implicit dim,
O((d9)!/(d!9!)) - but SVM weights that feature space -- only O(n)
non-zero - SVM doesnt employ triangular inequality
- training determines which examples (h SVs h?n)
need to search - key is often to use appropriate distance metric
- Euclidean (L2) distance over original d usually
not best ...
32STOP!
33VA-File
- accept that linear scan inevitable for high d
- compress data so linear scan faster (e.g. 10x)
- Vector-Approximation
- fit more of data in RAM
- less/faster disk scans
34Locality-Preserving Hashing
- hash table very fast O(1) time
- but unlikely query nearest neighbor would
collide - so, use multiple (Lgt1) hash tables
- each one hashes over random k of d
- linear scan over union of all L hashed buckets
for query - gives fast approx NN (within 1? factor of
optimal dist) - experiments suggest L10-100 often gives
reasonable error (? 2-20) - with max bucket size B (e.g. B 100)
- opt k max prob (p1) nears hash same bucket, min
(p2) fars not - k log(1/p2) (n/B), L (n/B)p, where p
ln(1/p1) / ln(1/p2) - e.g. Similarity Search in High Dimensions via
Hashing, by Gionis, Indyk, and Motwani, VLDB 99.
35IGrid (Inverted Grid Index) Motivations
(Aggarwal et al, KDD 2000)
- problem with high d is that (Dmax - Dmin)/Dmin ?
0 - i.e. as d grows, var(dists) tends to zero
(relative to mean) - for very high d
- nearest-neighbor farthest neighbor could swap
places - with relatively small perturbation in vector
reps of the data - even most-similar records likely have some
well-sep dims - due to noise effects (e.g. Brownian motion)
- these problems suggest
- need more meaningful distance measure
- simple Euclidean (L2) sensible for d2,3 not
d100 - if crafting new distance score, might as well be
fast too ...
36IGrid Similarity Functions
- for two vectors x(x1,x2,xd) and y(y1,y2,,yd)
- basic scale-normalized similarity
- IDIST(x,y) ?i1d ( 1 - xi - yi / (ui - li)
)p 1/p - where ui,li are upper/lower range on dimension i
values - p 2 is similar to Euclidean L2
- proximity-based similarity
- bin each of d dimensions into kd equi-depth
ranges - PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
/ (mi - ni) )p 1/p - S(x,y,kd) is set of dimensions for which xi yi
in same bin - where mi ni are upper/lower bounds of (same)
bin for i - value of PIDIST ranges from 0 to S(x,y,kd)
- on average, summation has d/kd bin matches (noise
reduction) - equi-depth similar to IDF-normalization in IR
(relative to all data)
37IGrid Index Structure
- uses inverted index on grid representation of
data - e.g. much like Google, by discretizing data in
words - IGrid structure simply d-by-kd matrix Ri,j,
with - Ri,j.hi and RI.j.lo (two real values)
- upper and lower range bounds for each bin j for
each dimension i - Ri,j.ids1n/kd (of integers from
range 1n) - lists records whose value for dim i is within
Ri,js range - length of each list will be n/kd due to
equi-depth binning - Ri,j.vals1n/kd (of real values)
- each record in Ri,js list also stores actual
value for dim i - note inverted indexs size 2 size of X
- so, IGrid only requires O(d n) space
38Computing Proximity Similarity Given IGrid
- given d-by-kd Ri,j.hi, Ri.j.lo, Ri,j.ids
Ri.j.vals - query-time similarity computes very simple and
fast - Google would just intersect d ids lists (each of
size kd) - general way hash table h1n of similarities
all 0 initially - for each dim i and corresponding j of range for
query val for dim i - update hid (accumulate partial PIDIST) for each
id in Ri,j.ids - avoiding HT overflow (when O(nd) space gt RAM)
- divide large X into chunks (each of max_hash_size
kd records) - at most only 1/kd of data accessed at all (i.e.
require PIDIST accumulations) - NOTE thus, speed is O(n/kd), so improves with
increasing kd - each chunk build inverted index and query
separately report best - PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
/ (mi - ni) )p 1/p - S(x,y,kd) is set of dimensions for which xi yi
in same bin
39Picking a Good kd for IGrid
- as saw, higher kd leads to faster query time
- higher kd also improves meaningfulness
- ignores exact degree of dissimilarity on noisy
dims - too large of kd bad loss of information
- tradeoff resolution pick minimum kd to avoid
?0 - assume Bernoulli distribution (bin match prob
1/kd) - 0,1 match random variable Mi has ?1/kd ,
?(1/kd)(1-1/kd) - L?i1d Mi has ? d/kd , ?
sqrt((d/kd)(1-1/kd)) - ? sqrt((d/kd)(kd -1)/kd) sqrt(d(kd -1)) / kd
- if lim d?? ? /? 0 then (Dmax-Dmin) / Dmin ?0
- want lim d?? sqrt(kd - 1)/d) gt 0
- implies pick kd at least linear in d they use
kd ? d ...
40Pick kd ? d Using Some ? ? 0.5,1
- for linear dependence on d (kd ? d)
- ?1 exactly gives Lp norm distance for d1
- increasingly different from Lp as d grows
- ?0.5 and p1 similar to L1 norm distance for d2
- increasingly different from L1 as d grows
- suggests ? ? 0.5,1 is good range
- also ensures that speed improves as d grows (kd
grows) - PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
/ (mi - ni) )p 1/p - S(x,y,kd) is set of dimensions for which xi yi
in same bin
41Handling Edge Effects of Binning
- when bin (discretize), need to consider edge
effects - example
- value of xi could be in bin j very close to
bins high - value of yi could be in bin j1 very close to
bins low - PIDIST contribution would be 0
- recall PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 -
xi - yi / (mi - ni) )p 1/p - S(x,y,kd) is set of dimensions for which xi yi
in same bin - eliminate edge effect by
- each of kd bins subdivided into L equi-depth
ranges - each getting list of n/(kd L) record ids
- at search find subbin for query, then (L-1)/2 on
each side - L3 seems sufficient
42Supporting Multiple ? At Query Time
- building one index which can support multiple
distance metrics at query time often desirable - since right one often not clear pre-query
- IGrid supports multiple ?lt ?0 easily
- ? ? / ?0 , use ? bins nearest bin for each
query dim - i.e. treat as hierarchy of binnings (most
detailed for ?0) - suggests building for large ?0 (e.g. ?0 1) first
43IGrid Issues
- does this approach really make sense for huge d?
- Is it reasonable to discretize into 3000 bins for
d1000? - e.g. many real-values come from 8 or 16 bit A/D
sensors - nevertheless, may be useful for moderate d
- still larger than d 20 where kd-tree fails
- also, PCA on data might help even for huge d
- PCA would spread variance, per dim i
- epecially for earlier dimensions (top principal
components) - for bottom dims (less variance), only match if
vals VERY close - for huge n, equi-depth for O(d) bins might make
sense - KDD00 paper not discuss this issue, seems critical
44UCI Data Meaningfulness Checks
k5, ?1
numbers show agreements with class labels, using
5-NN with various distance measures
This table shows PIdist metric may be useful
but doesnt prove IGrids use of many bins (kd
d) really makes sense when d is huge (largest
here is only d160) ...
45Meaningfulness With Increasing Dimension
L2 norm
PIdist
i.e. doesnt tend toward 0
relative meaningfulness (Dmax - Dmin) / Dmean
using random, uniformly distributed data (n100)
46Query Speed with Increasing Dimensionality
?1
?1
IGrid fraction of DB accessed can be predicted
analytically 2/(? d)