Efficient Top-k Queries in Large-Scale Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Top-k Queries in Large-Scale Networks

Description:

split 2-day DEC proxy traces into 128 sub-traces by client IP. DEC-128 ... Assume D is a collection of m lists all following log-log slope function C(n) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 41
Provided by: cao4
Category:

less

Transcript and Presenter's Notes

Title: Efficient Top-k Queries in Large-Scale Networks


1
Efficient Top-k Queries in Large-Scale Networks
  • Pei Cao
  • Cisco Systems, Inc.
  • Consulting Faculty, Stanford University

2
Motivation
  • Enterprise content delivery networks (CDNs)
  • CE web cache and streaming media cache combined
  • Number of branches 50 - 2000

Data Center
Central Manager
56Kbps,128kbps, DSL
Branch Offices
. . .
. . .
CE
CE
CE
3
Top-k Queries in CDNs
  • Example queries
  • Across all CEs, which URLs are accessed most
    often?
  • Across all CEs, which domains consume the most
    storage?
  • Across all CEs, which cached objects produced the
    biggest bandwidth savings?
  • etc.

4
Definitions
  • a network of m nodes, connected to a central
    manager (CM)
  • each node i has a reverse-sorted list of (
    x, Vi(x) )
  • an objects sum
  • V(x) V1(x)V2(x)Vm(x)
  • Problem find the k objects with highest sums
  • Goal answer this question with minimum network
    traffic
  • ? A generic problem in distributed systems

5
Existing Methods
  • Naïve Algorithm
  • Each node sends the full list of objects and
    their values to the Central Manager
  • Threshold Algorithm (TA)
  • Proposed by multiple groups in the database
    research community

6
The Threshold Algorithm (TA)
  • Example find top 2 objects with max sums in
    three columns

Node 1
Node 2
Node 3
Central Manager (CM)
?
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) (K, 1) . . .
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
T 30 V(A)20, V(C)19, V(B)18
?
T 26 V(A)20, V(C)19,
?
T 24 V(F)22, V(A)20,
?
T 21 V(F)22, V(A)20,
?
T 18 V(F)22, V(A)20,
7
Adapting TA for Distributed Environments
  • Consists of multiple rounds, each round having
    two round trips
  • Round-trip 1 sorted access CM asks for the
    next B objects on the lists and nodes respond
  • Round-trip 2 random lookup CM sends a list of
    object names to nodes and nodes supply values
  • B k
  • Issues
  • of rounds unpredictable
  • O(m2) network traffic

8
New Algorithm Three-Phase Uniform Threshold
(TPUT)
  • Motivation terminate in a fixed number of round
    trips regardless of input
  • Operates in three phases
  • Lower-bound estimation
  • Pruning
  • Final lookup

9
Partial Sums and Upper Bounds
  • Partial sum PS(x) ?Vi(x)
  • Upper bound U(x) ?Ui(x)

Vi(x), if x has been reported by node i to CM
Vi(x)
0, otherwise
Vi(x), if x has been reported by node i to CM
Ui(x)
Ti, otherwise
Ti Node i sends all objects with values gt Ti
10
Examples
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
PS(A) 10 0 9 19 U(A) 10 9 9
28 PS(B) 0 10 0 10 U(B) 8 10 9
27
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
For any object O, PS(O) V(O) U(O)
11
Steps in TPUT
  • Phase 1
  • Manager ? Nodes start top-k query
  • Nodes ? Manager here are my top-k objects
  • Manager
  • Calculate partial sums of all objects
  • Take the kth partial sum E1 (E1 E) set t
    E1/m
  • Phase 2
  • Manager ? Nodes send me all objects with value
    t
  • Nodes ? Manager here they are
  • Manager
  • Calculate partial sums again take the kth
    partial sum E2 (E1 E2 E)
  • Calculate upper bounds of all objects
  • S objects whose upper bounds are E2
  • Phase 3
  • Manager ? Nodes here is S send me all objects
    in S
  • Nodes ? Manager here they are

12
Example
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
S(F) 22 S(A) 20 S(C) 19 Top 2 objects
are F and A.
13
Improving the Pruning Power
  • Set t (E1/m) a, where 0ltalt1

. . .
E2/m
t
14
Compression via Hashing
  • Problem object IDs can be too long
  • Solution send hashed keys of object IDs
  • Node report to CM (hash(o), V(o))
  • If hash(o1)hash(o2), then V max(V(o1), V(o2))
  • Candidate set S is a set of hashed keys
  • Size of key log(total of objects in all
    nodes)
  • Effect
  • Algorithm is still correct
  • However, might need an additional round trip

15
Evaluating TPUT Algorithm
  • Trace-driven simulation
  • Optimality analysis

16
Trace Data for Simulations
NLANR-10 daily web access from 10 NLANR proxies
Worldcup-30 2-hr logs from 30 WorldCup web servers
DEC-64 split 1-day DEC proxy traces into 64 sub-traces by client IP
DEC-128 split 2-day DEC proxy traces into 128 sub-traces by client IP
NLANR-203 split NLANR traces into 203 sub proxy traces by client IP
Berkeley-512 Split one week UCB traces into 512 sub traces by client IP
17
Performance Metrics
  • Communication costs
  • Unicast-bytes
  • Multicast-bytes
  • Messages are all compressed by gzip

18
Results on Unicast-Bytes
m10
m30
m64
m128
m203
m512
19
Number of Objects Looked-Up
Trace K10 TA K10 TPUT/0.5 K100 TA K100 TPUT/0.5
NLANR-10 166 18 1486 176
WorldCup-30 46 12 238 101
DEC-64 3164 31 9817 244
DEC-128 6928 28 26680 250
NLANR-203 5576 28 43954 238
Berkeley-512 47899 41 180550 132
20
Results on Multicast-Bytes
m10
m30
m64
m128
m203
m512
21
Optimality Analysis
  • Main results
  • TPUT is instance optimal for data sets with a
    log-log slope function C(n)
  • Zipf distribution C(n) n
  • Zipf distribution opt-ratio (m-1)2m km
  • Setting alt1 reduces cost qualitatively.
  • Zipf distribution opt-ratio (m-1)?O(vm )
    km/a

22
General Instance Optimality
  • Definition
  • An algorithm R is instance-optimal with
    optimality ratio C1, if exists C2, such that for
    any data series D, and any algorithm A,
  • cost(R, D) C1 cost(A, D) C2
  • cost is amount of network traffic
  • TA is instance optimal with opt-ratio O(m2)

23
Worst Cases for Fixed Number Round-Trip Algorithms
  • TPUT is not general instance optimal
  • Nor can any algorithm that terminates in a fixed
    number of round trips regardless of input

Finding obj with highest sum
Node 1 (A, 1) (C, 1) (X1, 0.6) (X2,
0.6) . . . (Xn, 0.6) (B, 0.5) . .
Node 2 (B, 1) (D, 0.2) . . . . . . . . .
24
Log-Log Slope Function
  • L(j) is the value at position j in a
    reverse-sorted list
  • The list satisfies log-log slope function C(n),
    if, for all jk, L(jC(n)) lt L(j)/n
  • For Zipf-like distribution L(j) 1/j?, C(n)
    n1/?.

List Position 1
. . . .
. Position j . .
. . .
. . Position jC(n) .
. . .
. . .
L(j)
lt L(j)/n
25
Properties of the Two Lower Bounds
  • Let E be the true bottom
  • E1 E/m
  • E2 gt E/2
  • E2 E1
  • E2 gt E E1(m-1)/m
  • For any x, V(x) PS(x) lt (m-1)t? V(x) PS(x)
    lt (m-1) E1/m? E E2 lt E1 (m-1)/m
  • E2 gt (m/(2m-1))E

26
Restricted Instance Optimality of TPUT (a1)
  • Assume D is a collection of m lists all following
    log-log slope function C(n), then for any
    algorithm A,
  • cost(TPUT,D) cost(A,D) ((m-1)C(2m) C(m)k)
  • Proof assume the optimal algorithm for D stops
    at position bi on list i, then L(bi) lt E
  • The number of objects in S from node i is bi
    C(2m)
  • Each node sends C(m) k objects in round-trip
    2

27
Effect of alt1
  • Property
  • If object x appears in n nodes in Phase 2 and
    U(x) E2, then its average value in those nodes
    R(x) E2 (1-a)/n
  • Let li the num of objects in S that appear in
    exactly i nodes in Phase 2, then
  • 1l1 2l2 3l3 mlm C(m (1a)/a)
    ?bi
  • l1 l2 li C( i (1 a)/(1-a)) ?bi
  • Size of S is l1 l2 lm

28
Analysis of alt1
  • Whats the maximum l1l2 lm under the
    following constraints?
  • 1l12l2 3l3 mlm C(m (1a)/a) B
  • l1 C(1ß) B
  • l1l2 C(2ß) B
  • ...
  • l1l2 lm C( m ß) B
  • where ß (1a)/(1-a), B ?bi
  • Solution maximize l1, l2, , ld, and
    set ld1, ld2, , lm to 0
  • Li C(i ß) B C((i-1) ß) B
  • d C(d ß) B - ?C(i ß) B C(m (1a)/a) B
  • Candidate set size S C(d ß) B

29
a For Zipf Distributions
  • For Zipf distribution, where C(n) n, size of
    candidate set S is cvm B
  • ? Optimality ratio for TPUT with alt1 is (m-1)
    c vm m/a k

30
TPUT for Hierarchical Networks
Phase 1 Lower-Bound Estimation
Phase 2 Selection by value Pruning
Phase 3 Final lookup
S
tE/m a
. . .
S
t (E/mn) a
. . .
. . .
. . .
31
Summary and Future Work
  • TPUT should be used for top-k queries in
    distributed networks
  • TPUT is instance-optimal under the log-log slope
    function assumption
  • Introducing alt1 improves performance
    significantly
  • Future work
  • Evaluating TPUT for hierarchical and P2P networks
  • Distributed algorithms for other aggregate
    statistics

32
Backup Slides
33
Bandwidth Consumption of Threshold Algorithm
Trace Raw Data K10 TA UniCast K10 TA MultiCast K100 TA UniCast K100 TA UniCast
NL-10 26MB 56.3KB 25.9KB 318KB 132KB
WC-30 426KB 31KB 22KB 96KB 80KB
DEC-64 7.4MB 1.7MB 160KB 4.6MB 359KB
DEC-128 15MB 7.2MB 419KB 24.6MB 1.2MB
NL-203 44MB 22MB 1.2MB 143MB 4.2MB
UCB-512 78MB 423MB 16.1MB 1.47GB 31MB
34
Bandwidth Consumption of TPUTHash
Trace Raw Data K10 TPUT-H UniCast K10 TPUT-H MultiCast K100 TPUT-H UniCast K100 TPUT-H UniCast
NL-10 26MB 8KB 7KB 52KB 49KB
WC-30 426KB 44KB 38KB 99KB 89KB
DEC-64 7.4MB 64KB 59KB 322KB 300KB
DEC-128 15MB 161KB 150KB 870KB 828KB
NL-203 44MB 154KB 139KB 764KB 687KB
UCB-512 78MB 1.03MB 978KB 15.8MB 15.3MB
35
Unicast-Bytes for Top-100 Objects
36
Multicast-Bytes for Top-100 Objects
37
Varying a
38
Fixed-Number Round Trip Algorithms
  • Criteria by which a node decides to send objects
  • By position
  • By name
  • By value
  • Any fixed-number round trip algorithm must
    include a by value operation
  • Any algorithm, if include by value operation,
    wont be instance optimal

39
TA Running over Networks
Node 2
Node 3
Node 1
CM
(B, 10) (D, 9) (F, 8) (H, 6) (G, 5) (C, 1) (A,
1) . . .
(C, 10) (A, 9) (G, 8) (J, 7) (F, 6) (D, 4) (B,
1) . . .
T 26 looks up A, B, C, D ? V(A)20, V(C)19
cant stop
(A, 10) (C, 8) (E, 8) (F, 8) (B, 7) (D, 5) (J,
1) . . .
?
T 21 looks up E, F, G, H, J ? V(F)22,
V(A)20 cant stop
?
?
T 10 stop
40
TPUT
  • Phase 3
  • Manager ? Nodes here is S send me all objects
    in S
  • Nodes ? Manager here they are
  • Manager calculate sums for objects in S select
    the top k objects
Write a Comment
User Comments (0)
About PowerShow.com