Efficient Top-k Queries in Large-Scale Networks - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient Top-k Queries in Large-Scale Networks

Description:

split 2-day DEC proxy traces into 128 sub-traces by client IP. DEC-128 ... Assume D is a collection of m lists all following log-log slope function C(n) ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 41

Provided by: cao4

Learn more at: http://crypto.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Top-k Queries in Large-Scale Networks

1
Efficient Top-k Queries in Large-Scale Networks

Pei Cao
Cisco Systems, Inc.
Consulting Faculty, Stanford University

2
Motivation

Enterprise content delivery networks (CDNs)
CE web cache and streaming media cache combined
Number of branches 50 - 2000

Data Center
Central Manager
56Kbps,128kbps, DSL
Branch Offices
. . .
. . .
CE
CE
CE
3
Top-k Queries in CDNs

Example queries
Across all CEs, which URLs are accessed most
often?
Across all CEs, which domains consume the most
storage?
Across all CEs, which cached objects produced the
biggest bandwidth savings?
etc.

4
Definitions

a network of m nodes, connected to a central
manager (CM)
each node i has a reverse-sorted list of (
x, Vi(x) )
an objects sum
V(x) V1(x)V2(x)Vm(x)
Problem find the k objects with highest sums
Goal answer this question with minimum network
traffic
? A generic problem in distributed systems

5
Existing Methods

Naïve Algorithm
Each node sends the full list of objects and
their values to the Central Manager
Threshold Algorithm (TA)
Proposed by multiple groups in the database
research community

6
The Threshold Algorithm (TA)

Example find top 2 objects with max sums in
three columns

Node 1
Node 2
Node 3
Central Manager (CM)
?
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) (K, 1) . . .
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
T 30 V(A)20, V(C)19, V(B)18
?
T 26 V(A)20, V(C)19,
?
T 24 V(F)22, V(A)20,
?
T 21 V(F)22, V(A)20,
?
T 18 V(F)22, V(A)20,
7
Adapting TA for Distributed Environments

Consists of multiple rounds, each round having
two round trips
Round-trip 1 sorted access CM asks for the
next B objects on the lists and nodes respond
Round-trip 2 random lookup CM sends a list of
object names to nodes and nodes supply values
B k
Issues
of rounds unpredictable
O(m2) network traffic

8
New Algorithm Three-Phase Uniform Threshold
(TPUT)

Motivation terminate in a fixed number of round
trips regardless of input
Operates in three phases
Lower-bound estimation
Pruning
Final lookup

9
Partial Sums and Upper Bounds

Partial sum PS(x) ?Vi(x)
Upper bound U(x) ?Ui(x)

Vi(x), if x has been reported by node i to CM
Vi(x)
0, otherwise
Vi(x), if x has been reported by node i to CM
Ui(x)
Ti, otherwise
Ti Node i sends all objects with values gt Ti
10
Examples
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
PS(A) 10 0 9 19 U(A) 10 9 9
28 PS(B) 0 10 0 10 U(B) 8 10 9
27
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
For any object O, PS(O) V(O) U(O)
11
Steps in TPUT

Phase 1
Manager ? Nodes start top-k query
Nodes ? Manager here are my top-k objects
Manager
Calculate partial sums of all objects
Take the kth partial sum E1 (E1 E) set t
E1/m
Phase 2
Manager ? Nodes send me all objects with value
t
Nodes ? Manager here they are
Manager
Calculate partial sums again take the kth
partial sum E2 (E1 E2 E)
Calculate upper bounds of all objects
S objects whose upper bounds are E2
Phase 3
Manager ? Nodes here is S send me all objects
in S
Nodes ? Manager here they are

12
Example
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
S(F) 22 S(A) 20 S(C) 19 Top 2 objects
are F and A.
13
Improving the Pruning Power

Set t (E1/m) a, where 0ltalt1

. . .
E2/m
t
14
Compression via Hashing

Problem object IDs can be too long
Solution send hashed keys of object IDs
Node report to CM (hash(o), V(o))
If hash(o1)hash(o2), then V max(V(o1), V(o2))
Candidate set S is a set of hashed keys
Size of key log(total of objects in all
nodes)
Effect
Algorithm is still correct
However, might need an additional round trip

15
Evaluating TPUT Algorithm

Trace-driven simulation
Optimality analysis

16
Trace Data for Simulations
NLANR-10 daily web access from 10 NLANR proxies
Worldcup-30 2-hr logs from 30 WorldCup web servers
DEC-64 split 1-day DEC proxy traces into 64 sub-traces by client IP
DEC-128 split 2-day DEC proxy traces into 128 sub-traces by client IP
NLANR-203 split NLANR traces into 203 sub proxy traces by client IP
Berkeley-512 Split one week UCB traces into 512 sub traces by client IP
17
Performance Metrics

Communication costs
Unicast-bytes
Multicast-bytes
Messages are all compressed by gzip

18
Results on Unicast-Bytes
m10
m30
m64
m128
m203
m512
19
Number of Objects Looked-Up
Trace K10 TA K10 TPUT/0.5 K100 TA K100 TPUT/0.5
NLANR-10 166 18 1486 176
WorldCup-30 46 12 238 101
DEC-64 3164 31 9817 244
DEC-128 6928 28 26680 250
NLANR-203 5576 28 43954 238
Berkeley-512 47899 41 180550 132
20
Results on Multicast-Bytes
m10
m30
m64
m128
m203
m512
21
Optimality Analysis

Main results
TPUT is instance optimal for data sets with a
log-log slope function C(n)
Zipf distribution C(n) n
Zipf distribution opt-ratio (m-1)2m km
Setting alt1 reduces cost qualitatively.
Zipf distribution opt-ratio (m-1)?O(vm )
km/a

22
General Instance Optimality

Definition
An algorithm R is instance-optimal with
optimality ratio C1, if exists C2, such that for
any data series D, and any algorithm A,
cost(R, D) C1 cost(A, D) C2
cost is amount of network traffic
TA is instance optimal with opt-ratio O(m2)

23
Worst Cases for Fixed Number Round-Trip Algorithms

TPUT is not general instance optimal
Nor can any algorithm that terminates in a fixed
number of round trips regardless of input

Finding obj with highest sum
Node 1 (A, 1) (C, 1) (X1, 0.6) (X2,
0.6) . . . (Xn, 0.6) (B, 0.5) . .
Node 2 (B, 1) (D, 0.2) . . . . . . . . .
24
Log-Log Slope Function

L(j) is the value at position j in a
reverse-sorted list
The list satisfies log-log slope function C(n),
if, for all jk, L(jC(n)) lt L(j)/n
For Zipf-like distribution L(j) 1/j?, C(n)
n1/?.

List Position 1
. . . .
. Position j . .
. . .
. . Position jC(n) .
. . .
. . .
L(j)
lt L(j)/n
25
Properties of the Two Lower Bounds

Let E be the true bottom
E1 E/m
E2 gt E/2
E2 E1
E2 gt E E1(m-1)/m
For any x, V(x) PS(x) lt (m-1)t? V(x) PS(x)
lt (m-1) E1/m? E E2 lt E1 (m-1)/m
E2 gt (m/(2m-1))E

26
Restricted Instance Optimality of TPUT (a1)

Assume D is a collection of m lists all following
log-log slope function C(n), then for any
algorithm A,
cost(TPUT,D) cost(A,D) ((m-1)C(2m) C(m)k)
Proof assume the optimal algorithm for D stops
at position bi on list i, then L(bi) lt E
The number of objects in S from node i is bi
C(2m)
Each node sends C(m) k objects in round-trip
2

27
Effect of alt1

Property
If object x appears in n nodes in Phase 2 and
U(x) E2, then its average value in those nodes
R(x) E2 (1-a)/n
Let li the num of objects in S that appear in
exactly i nodes in Phase 2, then
1l1 2l2 3l3 mlm C(m (1a)/a)
?bi
l1 l2 li C( i (1 a)/(1-a)) ?bi
Size of S is l1 l2 lm

28
Analysis of alt1

Whats the maximum l1l2 lm under the
following constraints?
1l12l2 3l3 mlm C(m (1a)/a) B
l1 C(1ß) B
l1l2 C(2ß) B
...
l1l2 lm C( m ß) B
where ß (1a)/(1-a), B ?bi
Solution maximize l1, l2, , ld, and
set ld1, ld2, , lm to 0
Li C(i ß) B C((i-1) ß) B
d C(d ß) B - ?C(i ß) B C(m (1a)/a) B
Candidate set size S C(d ß) B

29
a For Zipf Distributions

For Zipf distribution, where C(n) n, size of
candidate set S is cvm B
? Optimality ratio for TPUT with alt1 is (m-1)
c vm m/a k

30
TPUT for Hierarchical Networks
Phase 1 Lower-Bound Estimation
Phase 2 Selection by value Pruning
Phase 3 Final lookup
S
tE/m a
. . .
S
t (E/mn) a
. . .
. . .
. . .
31
Summary and Future Work

TPUT should be used for top-k queries in
distributed networks
TPUT is instance-optimal under the log-log slope
function assumption
Introducing alt1 improves performance
significantly
Future work
Evaluating TPUT for hierarchical and P2P networks
Distributed algorithms for other aggregate
statistics

32
Backup Slides
33
Bandwidth Consumption of Threshold Algorithm
Trace Raw Data K10 TA UniCast K10 TA MultiCast K100 TA UniCast K100 TA UniCast
NL-10 26MB 56.3KB 25.9KB 318KB 132KB
WC-30 426KB 31KB 22KB 96KB 80KB
DEC-64 7.4MB 1.7MB 160KB 4.6MB 359KB
DEC-128 15MB 7.2MB 419KB 24.6MB 1.2MB
NL-203 44MB 22MB 1.2MB 143MB 4.2MB
UCB-512 78MB 423MB 16.1MB 1.47GB 31MB
34
Bandwidth Consumption of TPUTHash
Trace Raw Data K10 TPUT-H UniCast K10 TPUT-H MultiCast K100 TPUT-H UniCast K100 TPUT-H UniCast
NL-10 26MB 8KB 7KB 52KB 49KB
WC-30 426KB 44KB 38KB 99KB 89KB
DEC-64 7.4MB 64KB 59KB 322KB 300KB
DEC-128 15MB 161KB 150KB 870KB 828KB
NL-203 44MB 154KB 139KB 764KB 687KB
UCB-512 78MB 1.03MB 978KB 15.8MB 15.3MB
35
Unicast-Bytes for Top-100 Objects
36
Multicast-Bytes for Top-100 Objects
37
Varying a
38
Fixed-Number Round Trip Algorithms

Criteria by which a node decides to send objects
By position
By name
By value
Any fixed-number round trip algorithm must
include a by value operation
Any algorithm, if include by value operation,
wont be instance optimal

39
TA Running over Networks
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
T 26 looks up A, B, C, D ? V(A)20, V(C)19
cant stop
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
T 21 looks up E, F, G, H, J ? V(F)22,
V(A)20 cant stop
?
?
T 10 stop
40
TPUT