Efficient Top-k Queries in Large-Scale Networks - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient Top-k Queries in Large-Scale Networks

Description:

Problem: find the k objects with highest sums ... B = k. TA Running over Networks (A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J, 1) (B, 10) ... – PowerPoint PPT presentation

Number of Views:132

Avg rating:3.0/5.0

Slides: 39

Provided by: cao4

Learn more at: http://crypto.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Top-k Queries in Large-Scale Networks

1
Efficient Top-k Queries in Large-Scale Networks

Pei Cao
Cisco Systems, Inc.
Consulting Faculty, Stanford University

2
Motivation

Enterprise content delivery networks (CDNs)
CE web cache and streaming media cache combined
Number of branches 50 - 2000

Data Center
Central Manager
56Kbps,128kbps, DSL
Branch Offices
. . .
. . .
CE
CE
CE
3
Top-k Queries in CDNs

Example queries
Across all CEs, which URLs are accessed most
often?
Across all CEs, which domains consume the most
storage?
Across all CEs, which cached objects produced the
biggest bandwidth savings?
etc.

4
Definitions

a network of m nodes, connected to a central
manager (CM)
each node i has a reverse-sorted list of (
x, Vi(x) )
an objects sum
V(x) V1(x)V2(x)Vm(x)
Problem find the k objects with highest sums
Goal answer this question with minimum network
traffic
? A generic problem in distributed systems

5
Existing Methods

Each node sends the full list of objects and
their values to the Central Manager
Pro simple to implement works fine when the
number of objects is small
Con when the number of objects is large,
consumes too much network bandwidth
Use the threshold algorithm (TA)
Proposed by multiple groups in the database
research community

6
The Threshold Algorithm (TA)

Example find top 2 objects with max sums in
three columns

Node 1
Node 2
Node 3
Central Manager (CM)
?
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) (K, 1) . . .
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
T 30 V(A)20, V(C)19, V(B)18
?
T 26 V(A)20, V(C)19,
?
T 24 V(F)22, V(A)20,
?
T 21 V(F)22, V(A)20,
?
T 18 V(F)22, V(A)20,
7
Adapting TA for Distributed Environments

Consists of multiple rounds
Each round has two round trips
Round-trip 1 sorted access CM asks for the
next B objects on the lists and nodes respond
Round-trip 2 random lookup CM sends a list of
object names to nodes and nodes supply values
B k

8
TA Running over Networks
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
T 26 looks up A, B, C, D ? V(A)20, V(C)19
cant stop
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
T 21 looks up E, F, G, H, J ? V(F)22,
V(A)20 cant stop
?
?
T 10 stop
9
Problems with TA in Large Networks

Num of round-trips required vary by data input
High bandwidth consumption when number of nodes
is large
In round trip 2, the list of random-lookup
objects are the union of all objects sent by m
nodes in round trip 1
In round trip 2, the list goes to all m nodes

10
New Algorithm Two-Phase Uniform Threshold (TPUT)

Motivation algorithm should terminate in a fixed
(and small) number of round trips
Operates in two phases
Phase 1 get a lower-bound estimate on the bottom
value in the top-k set (i.e. the true bottom,
denoted as E)
Phase 2 all nodes send objects who sums are
potentially higher than the lower bound CM
aggregates the info, refines the estimate,
determines the candidates, and looks up
candidates in all nodes

11
Partial Sums and Upper Bounds

Partial sum PS(x) ?Vi(x)
Upper bound U(x) ?Ui(x)

Vi(x), if x has been reported by node i to CM
Vi(x)
0, otherwise
Vi(x), if x has been reported by node i to CM
Ui(x)
Li, otherwise
Li is the lowest value that node I has reported
to CM
12
Examples
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
PS(A) 10 0 9 19 U(A) 10 9 9
28 PS(B) 0 10 0 10 U(B) 8 10 9
27
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
For any object O, PS(O) V(O) U(O)
13
Steps in TPUT

Round-trip 1
Manager ? Nodes start top-k query
Nodes ? Manager here are my top-k objects
Manager
Calculate partial sums of all objects and sort
them
Take the kth value, call it E1 E1 E
set t E1/m
Round-trip 2
Manager ? Nodes send me all objects with value
t
Nodes ? Manager here they are
Manager
Calculate partial sums of all objects and sort
them take the kth value, call it E2 E1 E2
E
For each object, calculate its upper bound
select those objects whose upper bounds are E2
call the set S

14
TPUT

Round-trip 3
Manager ? Nodes here is S send me all objects
in S
Nodes ? Manager here they are
Manager calculate sums for objects in S select
the top k objects

15
Example
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
PS(A) 19 PS(C) 18 ? E1 18 t 6
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
PS(F) 22 PS(A) 19 ? E2 19 U(H) 18, U(J)
19 ? H and J are out! S (A, B, C, D, E, F, G)
?
S(F) 22 S(A) 20 S(C) 19 Top 2 objects
are F and A.
16
Improving the Pruning Power

Observation if E2 E1, then no object can be
pruned away
Solution set t (E1/m) a, where 0ltalt1
Effect
Increases traffic in round-trip 2
But shrinks the set of candidates, hence reduce
traffic in round-trip 3
Optimal alpha depends on data set
Default alpha 0.5

17
Compression via Hashing

Problem object IDs can be too long
Solution send hashed keys of object IDs
Node report to CM (hash(o), V(o))
If hash(o1)hash(o2), then V max(V(o1), V(o2))
Candidate set S is a set of hashed keys
Size of key log( of objects in all nodes)
Effect
maintains correctness of pruning by upper bounds
However, might need an additional round trip

18
Evaluating TPUT Algorithm

Trace-driven simulation
Optimality analysis

19
Trace Data for Simulations

NLANR-10 daily web access from 10 NLANR proxies
Worldcup-30 2-hr web accesses from 30 servers
hosting 1997 WorldCup
DEC-64 split one-day DEC traces into 64
sub-traces by client IP
Simulating an enterprise with 64 branch offices
DEC-128 split two-day DEC traces into 128
sub-traces by client IP
Simulating an enterprise with 128 branch offices
NLANR-208 split NLANR traces into 208 sub proxy
traces by client IP
Simulating an enterprise CDN of 208 nodes
Berkeley-512 split one week UCB traces into 512
sub traces
Simulating a 512 branch office with 16 people per
branch

20
Performance Metrics

Communication costs
Messages are always compressed by gzip
Unicast-bytes assuming CM communicates with
nodes via uni-cast
Multicast-bytes assuming CM broadcasts to nodes

21
Results on Unicast-Bytes
22
Results on Multicast-Bytes
23
Optimality Analysis

Main results
TPUT is instance optimal for data sets following
a log-log slope function.
Zipf distribution is a special case.
Zipf distribution opt-ratio (m-1)2m km
Setting alt1 reduces cost qualitatively.
Zipf distribution ratio (m-1)?O(vm )
k?m/ a

24
General Instance Optimality

Definition
An algorithm T is instance-optimal with
optimality ratio C1, if exists C2, such that for
any data series D, and any algorithm A,
cost(T, D) C1 cost(A, D) C2
cost is amount of network traffic
Threshold Algorithm is instance optimal with
opt-ratio O(m2)

25
Worst Cases for Fixed-Number Round Trip Algorithms

TPUT is not general instance optimal
Nor can any algorithm that terminates in a fixed
number of round trips regardless of input

Finding obj with highest sum
Node 1 (A, 1) (C, 1) (X1, 0.6) (X2,
0.6) . . . (Xn, 0.6) (B, 0.5) . .
Node 2 (B, 1) (D, 0.2) . . . . . . . . .
26
Log-Log Slope Function

L(j) is the value at position j in a
reverse-sorted list
The list satisfies log-log slope function C(n),
if, for all jk, L(jC(n)) lt L(j)/n
For Zipf-like distribution L(j) 1/j?, C(n)
n1/?.

List Position 1
. . . .
. Position j . .
. . .
. . Position jC(n) .
. . .
. . .
L(j)
lt L(j)/n
27
Properties of the Two Lower Bounds

E1 E/m, where E is the true bottom
E2 gt E/2
E2 E1
For any x, V(x) PS(x) lt (m-1)t? V(x) PS(x)
lt (m-1) E1/m? E E2 lt E1 (m-1)/m ? E2 gt E
E1(m-1)/m
E2 gt (m/(2m-1))E
Consequently
Since L(k) E1 in every node, each node sends at
most kC(m) to manager in round trip 2
A candidate in round trip 3 has average value
RgtE/2m

28
Restricted Instance Optimality of TPUT (a1)

Assume D is a collection of m lists all following
log-log slope function C(n), then for any
algorithm A,
cost(TPUT,D) cost(A,D) ((m-1)C(2m) C(m)k)
Proof assume the optimal algorithm for D stops
at position bi on list i, then L(bi) lt E? the
number of candidates in round-trip 3 is
bi C(2m)

29
Effect of alt1

Intuition
if an object appears in few nodes and still
makes the cut, then its average value must be
high
if an object has a small value and makes the
cut, then it must appear in many nodes
Let li be the num of objects that appear in
exactly i nodes from round-trip 2, then
1l1 2l2 3l3 mlm C(m (1a)/a)
?bi
For each i, If an object appears in less than i
nodes and still makes the cut, then its average
value R E2 (1-a)/I? l1 l2 li C( i
(1 a)/(1-a)) ?bi
Size of candidate set is l1 l2 lm

30
Analysis of alt1

Whats the maximum l1l2 lm under the
following constraints?
1l12l2 3l3 mlm C(m (1a)/a) B
l1 C(1ß) B
l1l2 C(2ß) B
...
l1l2 lm C( m ß) B
where ß (1a)/(1-a), B ?bi
Solution maximize l1, l2, , ld, and
set ld1, ld2, , lm to 0
Lj C(i ß) B C((i-1) ß) B
d C(d ß) B - ?C(i ß) B C(m (1a)/a) B
Candidate set size S C(d ß) B

31
a For Zipf Distributions

For Zipf distribution, where C(n) n, size of
candidate set is O(vm) ?bi
? Optimality ratio for TPUT with alt1 is (m-1)
c vm mk
Optimal a depends on m, but should gt 1/3
default 0.5

32
Summary and Open Questions

TPUT algorithm works well for top-k queries in
distributed networks
Introducing a0.5 improve performance
significantly
TPUT is instance-optimal under log-log slope
function assumption
Easy to extend the algorithm to hierarchical
networks
Open question
Is TPUT instance optimal compared with all fixed
round trip algorithms over all data sets?

33
Performance of Threshold Algorithm
Trace Raw Data K10 TA UniCast K10 TA MultiCast K100 TA UniCast K100 TA UniCast
NL-10 26MB
WC-30
DEC-64
DEC-128
NL-208
UCB-512
34
Unicast-Bytes for Top-100 Objects
35
Multicast-Bytes for Top-100 Objects
36
Fixed-Number Round Trip Algorithms