Dennis DeCoste - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Dennis DeCoste

Description:

Yahoo! Research Labs. http://research.yahoo.com/staff/algorithms/decosted.xml. Yahoo! Research Labs Spot Workshop on Recommender Systems. August 24, 2004. August ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 47
Provided by: researc
Category:
Tags: decoste | com | dennis | yahoo

less

Transcript and Presenter's Notes

Title: Dennis DeCoste


1
Optimizing Nearest-Neighbor Methods
  • Dennis DeCoste
  • Yahoo! Research Labs
  • http//research.yahoo.com/staff/algorithms/decoste
    d.xml

Yahoo! Research Labs Spot Workshop on
Recommender Systems August 24, 2004
2
Recommendation via Nearest Neighbors
  • given
  • database X of n (d-dimensional) (sparse) example
    vectors
  • e.g. n people, d movies, X(i,j)person
    is rating of movie j
  • (d-dimensional) (sparse) query vector Q
  • suitable similarity/distance metric dist(x,y)
  • (one) approach based on k-NN (user-user CF)
  • find Qs k most-similar neighbors (X1,,Xi,,Xk)
  • e.g. k people who rated movies most similarly as
    person Q
  • report recommendation over neighbors
  • e.g. movie j with Q(j)unknown highest
    meanX(1,j),,X(k,j)
  • common dist(x,y) Euclidian missing vals
    mean
  • e.g. mean rating per person, per movie, or combo

3
Fast Nearest Neighbors The Landscape
  • linear scan compute query Qs similarity to all
    n examples Xi
  • speed-up via faster and/or fewer distance
    computations
  • faster (query time cost O(n), but less than
    O(d n))
  • low-dimensional embeddings to quickly prune
    clearly-far examples
  • PCA, FastMap
  • fewer (cost potentially lt O(n))
  • spatial indexing
  • kd-trees (generalize binary search to d-dims
    ideal case gets cost O(log(n))
  • metric indexing
  • vp-trees (for Euclidian dists roughly
    combines PCA with kd-trees)
  • upfront many other open/frontier issues
    ?speed ? ?search
  • two biggies 1) disk vs RAM and 2) what is
    right distance metric?
  • approximate vs exact NN (proving found NN
    dominates query time)
  • exploiting labels (nearest class vs nearest
    neighbor per se)
  • many alternatives (locality-preserving hash,
    IGRID, x-tree, tv-tree, )

4
When Complexity Linear in n Isnt Good Enough
  • in DM, n may be so large that linear in n too
    slow
  • especially at query-time
  • common goal O(dn) train O(d) or O(d log n)
    query time
  • for many tasks, subsampling n may suffice
  • e.g. linear regression, for which O(nd2 d3)
  • subsample of huge n often leads to very similar
    covariance matrix
  • but, for k-nearest neighbor, subsample can be bad
  • k-NN only converges to 2 Bayes error for as n ?
    ?
  • especially hard to pick good subsamples when d
    large
  • linear scans O(nd) time cost can be too high
  • especially model (dist) selection search (multi n
    queries)
  • often prefer some pre-query cost for faster
    queries ...

5
Linear Scan The Thing to Beat
  • many clever indexing methods try better than
    O(nd)
  • R-tree, kd-tree, vp-tree, ball-tree, X-tree,
    TV-tree, etc. ...
  • for high d rarely much faster than (good) linear
    scan
  • often useless if intrinsic d even moderate (e.g.
    dgt20)
  • linear scans simplicity makes it highly
    optimizable
  • modern CPU prefetch ops no pipeline stall due to
    branch
  • branch and bound relatively easy
  • order d coordinates via PCA, stop when partial
    dist gt best-so-far
  • use priority queue for best-first-search (e.g.
    exact match in O(d))
  • also O(n) not so bad -- O(n d) is often the
    problem
  • dim reduction (PCA, FastMap, random projections)
    often suffice ...
  • at very least linear scan within buckets (tree
    leafs)

6
GEMINI Framework
  • Qs length (d) often large (100-1000)
  • even more so for multi-variate data ...
  • high d performs poorly for similarity search
  • linear scan slow O(nd)
  • indexing methods (e.g. kd-trees) degrade for high
    d
  • solution dimensionality reduction to d ltlt d
  • linear scan d/d faster indexing works well for
    d lt 20
  • using PCA, FastMAP, ...
  • GEMINI framework Faloutsos et al.
  • lower bounding lemma distd(A,B) lt distd(A,B)
  • thus, no false dismissals of top matches (100
    recall)
  • since only stop search when all (partial) dists gt
    best so far

7
FastMap MDS Suitable for DM
Faloutsos Lin, SIGMOD 1995
  • FastMap finds k-dimensional embedding of n data
  • like PCA, but using distances (not require vector
    data)
  • MDS computes n2 dists FastMap approxs using O(k
    n)
  • achieved by approximately finding farthest pairs

distant orthogonal
O(n) times
limited (e.g. 5) repeats, until a,b stable

b
a


b
8
FastMap Compute Embed Coord xi from a,b
  • Pythagorean theorem directly gives
  • dai2 xi2 z2
  • dbi2 (dab - xi)2 z2
  • z2 equivalences
  • dai2 - xi2
  • dbi2 - (dab - xi)2
  • dbi2 - dab2 - 2dabxi xi2
  • cancelling xi2 terms
  • dai2 dbi2 - dab2 2dabxi
  • xi dai2 dab2 - dbi2 / 2dab

a,b pivot points selected by choose-distant
z
dab - xi
maximized
9
FastMap Computing Embedded Distances
  • project data to hyperplane H orthogonal to line
    ab
  • D(OI,OJ)2 D(OI,OJ)2 - (xI - xJ)2 for I,J
    1,,n
  • to get distances after k pivot pairs
  • recursively D(OI,OJ)2 D(OI,OJ)2 - ?s1k
    (xI(s) - xJ(s))2
  • after kn steps, all D(OI,OJ)0, some for kltn

10
FastMap k-Dimensional Embeddings
after k pivot pairs Dk(OI,OJ)2 D(OI,OJ)2 -
?s1k (xI(s) - xJ(s))2
Qk Dkaq2 Dkab2 - Dkbq2 / 2Dkab
so, first k embeddings Q1,Q2,,Qk require
O(12k) O(k2) work
pivot-pair 1
pivot-pair 2
q
...
Q1
1 ... k
Q2
Qk
alt Cholesky S ZT Z
(note qTq - Q12Qk2 represents reconstruction
error when using only kltd dims)
11
(FastMap MDS which can embed queries)
  • RFastMap(X) returns result
  • k pivot id pairs (a1,b1) (ak,bk)
  • k distances between the pivot pairs (daibi)
  • k2 embedding coords (k coords for each of k
    pivots)
  • ZFastMap_embed(Y,R,X)
  • embeds d-by-n Y into k-by-n Z
  • for p1 to k do
  • (a,b) p-th pivot pair
  • use dai2 Dai2 - ?s1p-1 (Zsa - Zsi)2 to
    compute all dai2 (and dbi2)
  • Zpi dai2 dab2 - dbi2 / 2dab for i1,,n
  • where DIJ dists (only as needed) in original
    input space
  • O(2k) dists (dai2 and dbi2) to embed each query
    of Y

12
Spatial Indexing kd-Trees
  • classic binary search (CS 101, kindergarten, ...)
  • given univariate (d1) real-valued data x(x1,
    x2, xn)
  • prepare by sorting x once -- O(n log(n)) time,
    O(n) space
  • fast queries via binary search -- O(log(n)) time
  • use binary search tree of data
  • balanced, so height log2(n)
  • kd-tree generalize binary search to data X with
    dgt1
  • goal still get O(d log(n)) query times and O(dn)
    space
  • prepare (indexing) time O(d n log(n)) to build
    kd-tree

13
2d Example Constructing kd-Tree
  • each node partitions subtrees data by one dim
  • issue 1 which dimension i1d to use for given
    node?
  • common max spread for values of subtrees data
    in dim i (so max pruning)
  • issue 2 where cut for dimension i?
  • common at median of values (so that tree evenly
    balanced)

d2
1st cut by dimension d2
d1
14
Constructing kd-Tree Pivot Strategies
2d tree using median-of-max-spread
closest-to-center-of-widest-dim
latter preferred, except after some depth prefer
former (for balance)
15
2d Example Querying kd-Tree
first, DFS directly to leaf containing Q
often immediately finds decent neighbor (with
best-so-far dist r)
Q
often, kd-tree finds decent neighbor quickly,
and spends most of search proving its the best
(motivates approx-NN )
r
16
2d Example Querying kd-Tree
2nd, backup to leafs parent check if other
child needs to be searched
checking if hyperrectangle HR intersects
hypersphere (radius r, center Q) find point p
in HR closest to Q intersect if dist(p,Q) ? r
p
Q need not search this HR
pi hrimin if Qi ? hrimin Qi if
hrimin lt Qi lt hrimax hrimax if Qi ? hrimax
17
Querying kd-Tree Best and Worst Cases
good-case
bad-case
expected search cost depends on expected
distribution of X and Q worst case query time
more like O(2d log(n)) than O(d log(n))
18
Example kd-Tree Performance
number of DB nodes examined during query
search (n10,000)
dimensionality of data (d)
19
Training Time of k-NN Revisted
  • so, in practice, training a k-NN usually not
    O(1)
  • e.g. O(2d n log(n)), to build kd-tree
  • for faster query times O(d log(n)) vs O(d n) for
    linear scan for k1
  • or O(d n) time to LOO cross-validate to find best
    k
  • or other pre-processing
  • i.e. condensation (prune train set to reduce
    effective n)
  • feature selection/weighting
  • find best distance metric to use (model
    selection)
  • ...

20
Metric Indexing Vantage Point Trees
Data Structures and Algos for NN Search in
General Metric Spaces (vp-trees), by Peter
Yianilos. Also ball trees, gh-trees,
  • not require data in vector form (e.g. SVM
    kernel dists)
  • exploit any structure among n data points
  • anything helpless for pathos
  • e.g. all data at unit hypercube corners lt1,0,0gt,
    lt0,1,0gt,lt0,0,1gt
  • define distance between pivot p objects
    (q,b,)
  • dp(q,b) ? d(q,p) - d(b,p)
  • d(q,b) ? d(q,p) - d(b,p) dp(q,b)
  • via triangular inequality d(q,p) ? d(q,b)
    d(b,p)
  • lower bounding, so
  • dp(q,b) gt r ? d(q,b) gt r
  • can early stop search for b as qs neighbor, once
    gt best-so-far (r)

q
d(p,q)
d(q,b)
p
b
d(p,b)
21
vp-Tree Pivot Choices Depends on Distribution
corner optimal for uniform over square closest
to 50/50 split
22
2d Example vp-Trees vs kd-Trees
23
Constructing Simple vp-Tree
function nodebuild(S) if S? then return node
? p select_vp(S) node.pp. m
medians?S d(p,s) node.mm L s ? S-p
d(p,s) lt m R s ? S-p d(p,s) ? m
node.left build(L) node.right build(R)
function bestpselect_vp(S) P random sample
of S best0 for p?P Q random sample of
S s varianceOfMedianq?Q d(p,q) if
(sgtbest) bests bestpp
O(c n log(n)) time (c time to compute
distance often cd) O(n) space
r
q
m
p ?

L R
when can skip kids n.R if d(n.p,q)r lt n.m n.L
if d(n.p,q)-r gt n.m
24
Searching vp-Tree
simple VP search R if xgtm-r L if xlt mr
function search(n,q) if n? then return
xd(q,n.id) if (xltr) then rx bestn.id
middle (n.leftBndHigh n.rightBndLow)/2 if
(x lt middle) then if (x ? IL)
search(n.left,q) if (x ? IR)
search(n.right,q) else if (x ? IR)
search(n.right,q) if (x ? IL)
search(n.left,q)
heuristic ordering of L,R or R
often both searched
often both searched
r
r

rightBndLow
leftBndHigh


vp id ?
middle
narrow by ? for approx
IL
IR
25
Results for vp-Tree on 2d Data in 10d Space
number (of 4000 total) vectors examined
2d hypercube
10d hypercube
sample Q from 2d
sample Q from 10d
since not axis-align partitioning like kd-trees,
vp-tree complexity stays close to 2d performance
and better than 10d random hypercube (i.e. build
same vp-tree regardless of how rotate data not
need PCA)
26
Same 2d Rand Data in 3d to 50d Space Noise
vp-trees
amplitude of added noise
27
Examples of vp-trees on Real Image Data
n 1 million
fraction of DB touched per query (average)
kd-tree linear scan by d256
Q DB similar
Q DB general
32x32
32x32
28
kd-Tree vs vp-Tree
  • kd-tree
  • build time O(d n log(n))
  • query time O(min(dn,2d-1log(n))) worst-case
  • each query path (root to some leaf) takes
    O(log(n))
  • vp-tree (where d?d ? effective dimensionality)
  • assuming distance computation is O(d)
  • build time O(d n log(n)) using random
    pivot samples
  • query time O(min(dn,d 2d-1log(n)) worst-case
  • each query path (root to some leaf) takes O(d
    log(n))
  • so, vp-tree not always better than kd-tree each
    path O(d) more
  • exploits local structure for each subtree (not
    need PCA)
  • also can mix (e.g. SR-trees, etc.)

Q need not search this HR
29
FastMAP revisited (wrt kd-tree and vp-tree)
  • FastMap can make linear scan faster
  • FastMAP(k) like PCA(k), but only requires dists,
    not vects
  • satisfies lower bounding lemma (just as PCA does)
  • however, FastMap time complexity is O(dkn k2n)
  • or O(dk k2) for k-dim embed per query for
    O(d) dists
  • pro avoids O(dn2) time of MDS
    when kltltn
  • con k2, due to enforcing orthogonality of
    embedding axes
  • at k-th pivot pair D(OI,OJ)2 D(OI,OJ)2 -
    ?s1k (xI(s) - xJ(s))2
  • negligable (vs linear scans O(nd)) for common
    case of k sqrt(d)
  • e.g. often just k2 or k3 for simple
    visualization applications
  • but, k could be as large as n (e.g. for poly2
    kernel distance) ...
  • so, for large k vp-tree faster than FastMap
    kd-tree
  • e.g. O(d log(n)) lt O(k2 log(n)), where kd and
    log(n)ltd

30
Triangular Inequality Insufficient for (Very)
High d
  • vp-tree (and other metric trees) based on TIs
  • for all p,q,b d(p,q) ? d(p,b) d(b,q)
  • etc.
  • for query q, prune point b via pivot p whenever
    see
  • d(q,p) - d(p,b) gt r since d(q,b) gt r must
    be true via TI
  • e.g. vp-tree caches bounds on d(p,b) (over ps
    subtree)
  • unfortunately, in high d, TI is often USELESS
  • let minD smallest (non-self) d(a,b) distance in
    database
  • let maxD largest d(a,b) in database
  • if 2minDgtmaxD (maxD-minDltminD), no pruning
    possible
  • since for most any q ? X d(q,p) - d(p,b) lt minD
    but r gt minD

b
d(p,b)
d(b,q)
p
q
d(p,q)
31
High d Hopeless?
  • only high intrinsic d (e.g. gt 20) is a problem
  • else PCA, triangular inequality, etc. will handle
    fine
  • also, kernel methods like SVM handle high d fine
  • e.g. poly 9 kernel uses huge implicit dim,
    O((d9)!/(d!9!))
  • but SVM weights that feature space -- only O(n)
    non-zero
  • SVM doesnt employ triangular inequality
  • training determines which examples (h SVs h?n)
    need to search
  • key is often to use appropriate distance metric
  • Euclidean (L2) distance over original d usually
    not best ...

32
STOP!
33
VA-File
  • accept that linear scan inevitable for high d
  • compress data so linear scan faster (e.g. 10x)
  • Vector-Approximation
  • fit more of data in RAM
  • less/faster disk scans

34
Locality-Preserving Hashing
  • hash table very fast O(1) time
  • but unlikely query nearest neighbor would
    collide
  • so, use multiple (Lgt1) hash tables
  • each one hashes over random k of d
  • linear scan over union of all L hashed buckets
    for query
  • gives fast approx NN (within 1? factor of
    optimal dist)
  • experiments suggest L10-100 often gives
    reasonable error (? 2-20)
  • with max bucket size B (e.g. B 100)
  • opt k max prob (p1) nears hash same bucket, min
    (p2) fars not
  • k log(1/p2) (n/B), L (n/B)p, where p
    ln(1/p1) / ln(1/p2)
  • e.g. Similarity Search in High Dimensions via
    Hashing, by Gionis, Indyk, and Motwani, VLDB 99.

35
IGrid (Inverted Grid Index) Motivations
(Aggarwal et al, KDD 2000)
  • problem with high d is that (Dmax - Dmin)/Dmin ?
    0
  • i.e. as d grows, var(dists) tends to zero
    (relative to mean)
  • for very high d
  • nearest-neighbor farthest neighbor could swap
    places
  • with relatively small perturbation in vector
    reps of the data
  • even most-similar records likely have some
    well-sep dims
  • due to noise effects (e.g. Brownian motion)
  • these problems suggest
  • need more meaningful distance measure
  • simple Euclidean (L2) sensible for d2,3 not
    d100
  • if crafting new distance score, might as well be
    fast too ...

36
IGrid Similarity Functions
  • for two vectors x(x1,x2,xd) and y(y1,y2,,yd)
  • basic scale-normalized similarity
  • IDIST(x,y) ?i1d ( 1 - xi - yi / (ui - li)
    )p 1/p
  • where ui,li are upper/lower range on dimension i
    values
  • p 2 is similar to Euclidean L2
  • proximity-based similarity
  • bin each of d dimensions into kd equi-depth
    ranges
  • PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
    / (mi - ni) )p 1/p
  • S(x,y,kd) is set of dimensions for which xi yi
    in same bin
  • where mi ni are upper/lower bounds of (same)
    bin for i
  • value of PIDIST ranges from 0 to S(x,y,kd)
  • on average, summation has d/kd bin matches (noise
    reduction)
  • equi-depth similar to IDF-normalization in IR
    (relative to all data)

37
IGrid Index Structure
  • uses inverted index on grid representation of
    data
  • e.g. much like Google, by discretizing data in
    words
  • IGrid structure simply d-by-kd matrix Ri,j,
    with
  • Ri,j.hi and RI.j.lo (two real values)
  • upper and lower range bounds for each bin j for
    each dimension i
  • Ri,j.ids1n/kd (of integers from
    range 1n)
  • lists records whose value for dim i is within
    Ri,js range
  • length of each list will be n/kd due to
    equi-depth binning
  • Ri,j.vals1n/kd (of real values)
  • each record in Ri,js list also stores actual
    value for dim i
  • note inverted indexs size 2 size of X
  • so, IGrid only requires O(d n) space

38
Computing Proximity Similarity Given IGrid
  • given d-by-kd Ri,j.hi, Ri.j.lo, Ri,j.ids
    Ri.j.vals
  • query-time similarity computes very simple and
    fast
  • Google would just intersect d ids lists (each of
    size kd)
  • general way hash table h1n of similarities
    all 0 initially
  • for each dim i and corresponding j of range for
    query val for dim i
  • update hid (accumulate partial PIDIST) for each
    id in Ri,j.ids
  • avoiding HT overflow (when O(nd) space gt RAM)
  • divide large X into chunks (each of max_hash_size
    kd records)
  • at most only 1/kd of data accessed at all (i.e.
    require PIDIST accumulations)
  • NOTE thus, speed is O(n/kd), so improves with
    increasing kd
  • each chunk build inverted index and query
    separately report best
  • PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
    / (mi - ni) )p 1/p
  • S(x,y,kd) is set of dimensions for which xi yi
    in same bin

39
Picking a Good kd for IGrid
  • as saw, higher kd leads to faster query time
  • higher kd also improves meaningfulness
  • ignores exact degree of dissimilarity on noisy
    dims
  • too large of kd bad loss of information
  • tradeoff resolution pick minimum kd to avoid
    ?0
  • assume Bernoulli distribution (bin match prob
    1/kd)
  • 0,1 match random variable Mi has ?1/kd ,
    ?(1/kd)(1-1/kd)
  • L?i1d Mi has ? d/kd , ?
    sqrt((d/kd)(1-1/kd))
  • ? sqrt((d/kd)(kd -1)/kd) sqrt(d(kd -1)) / kd
  • if lim d?? ? /? 0 then (Dmax-Dmin) / Dmin ?0
  • want lim d?? sqrt(kd - 1)/d) gt 0
  • implies pick kd at least linear in d they use
    kd ? d ...

40
Pick kd ? d Using Some ? ? 0.5,1
  • for linear dependence on d (kd ? d)
  • ?1 exactly gives Lp norm distance for d1
  • increasingly different from Lp as d grows
  • ?0.5 and p1 similar to L1 norm distance for d2
  • increasingly different from L1 as d grows
  • suggests ? ? 0.5,1 is good range
  • also ensures that speed improves as d grows (kd
    grows)
  • PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 - xi - yi
    / (mi - ni) )p 1/p
  • S(x,y,kd) is set of dimensions for which xi yi
    in same bin

41
Handling Edge Effects of Binning
  • when bin (discretize), need to consider edge
    effects
  • example
  • value of xi could be in bin j very close to
    bins high
  • value of yi could be in bin j1 very close to
    bins low
  • PIDIST contribution would be 0
  • recall PIDIST(x,y,kd) ? i?S(x,y,kd) ( 1 -
    xi - yi / (mi - ni) )p 1/p
  • S(x,y,kd) is set of dimensions for which xi yi
    in same bin
  • eliminate edge effect by
  • each of kd bins subdivided into L equi-depth
    ranges
  • each getting list of n/(kd L) record ids
  • at search find subbin for query, then (L-1)/2 on
    each side
  • L3 seems sufficient

42
Supporting Multiple ? At Query Time
  • building one index which can support multiple
    distance metrics at query time often desirable
  • since right one often not clear pre-query
  • IGrid supports multiple ?lt ?0 easily
  • ? ? / ?0 , use ? bins nearest bin for each
    query dim
  • i.e. treat as hierarchy of binnings (most
    detailed for ?0)
  • suggests building for large ?0 (e.g. ?0 1) first

43
IGrid Issues
  • does this approach really make sense for huge d?
  • Is it reasonable to discretize into 3000 bins for
    d1000?
  • e.g. many real-values come from 8 or 16 bit A/D
    sensors
  • nevertheless, may be useful for moderate d
  • still larger than d 20 where kd-tree fails
  • also, PCA on data might help even for huge d
  • PCA would spread variance, per dim i
  • epecially for earlier dimensions (top principal
    components)
  • for bottom dims (less variance), only match if
    vals VERY close
  • for huge n, equi-depth for O(d) bins might make
    sense
  • KDD00 paper not discuss this issue, seems critical

44
UCI Data Meaningfulness Checks
k5, ?1
numbers show agreements with class labels, using
5-NN with various distance measures
This table shows PIdist metric may be useful
but doesnt prove IGrids use of many bins (kd
d) really makes sense when d is huge (largest
here is only d160) ...
45
Meaningfulness With Increasing Dimension
L2 norm
PIdist
i.e. doesnt tend toward 0
relative meaningfulness (Dmax - Dmin) / Dmean
using random, uniformly distributed data (n100)
46
Query Speed with Increasing Dimensionality
?1
?1
IGrid fraction of DB accessed can be predicted
analytically 2/(? d)
Write a Comment
User Comments (0)
About PowerShow.com