Title: SCUD: Scalable Counting of Unique Data
1SCUD Scalable Counting of Unique Data
Dmitry Kit, Prince Mahajan, Navendu Jain, Praveen
Yalagandula, Mike Dahlin, and Yin
Zhang Laboratory for Advanced Systems Research,
The University of Texas at Austin
Hewlett-Packard Labs, Palo Alto, CA
Laboratory for Advanced Systems Research
Our Approach
- Push query processing into the network
- Chaining of Aggregation functions
Counting information about unique data is an
important basic operation in large scale
distributed applications
- Web Demo
- Perform two operations on the system put/get.
- Can Specify the address from which requests
originate. - View the top-10 list of clients who performed a
put or a get. - Details
- Uses Bamboo DHT as the storage system.
- When Bamboo receives a put or a get request it
notifies SDIMS with the new access count
(old_count1). - When this information is fully aggregated SDIMS
contacts Bamboo. - Bamboo updates its local top-10 list.
- If there is a change, this list is inserted into
SDIMS under as a different aggregation. - The lists from each root is combined to form the
final Top-10 list. - Location
- http//z.cs.utexas.edu/users/dkit/bamboo_test.php
- Use one aggregation function to aggregate
information per attribute - Example Aggregating the number of accesses made
by a client can be aggregated across multiple DHT
trees. - Use a basic SUM aggregation for each
client -
- Chain the output of the first aggregation
function into the second aggregation. e.g.,
Top-K - The results of the first aggregation reside on a
large number of nodes (e.g., 1,000,000). - The chaining process allows us to combine this
distributed information. - Example Maintain a Top-10 list of heavy hitter
flows by source IPs in terms of the total bytes
sent.
- Several Applications
- Distributed storage e.g., Bamboo/OpenDHT
- Top-k users in a storage system grouped by
activity type. - Put store information.
- Get retrieve information.
-
-
- Network Monitoring e.g., Heavy Hitters
- Top-k flows by bytes and packets.
- Aggregate MAX/MIN/AVG. incoming flows in an
organization. - Content Distribution Networks e.g., Akamai
- Counting the number of unique accesses per
webpage. - Telematics Applications e.g., BMW Assist
- Counting the number of unique cars per highway.
Top-10 aggregation
lt9.2.5.89, 4.2MBgtlt9.2.5.15, 2.2MBgtlt9.2.5.122,
1.5MBgtlt9.2.5.67, 1.3MBgtlt9.2.5.25, 900KBgt
lt9.2.5.230, 800KBgt
lt9.2.5.202, 760KBgt
lt9.2.5.122, 1.5MBgt
lt9.2.5.121, 400KBgt
lt9.2.5.25, 900KBgt
lt9.2.5.67, 1.3MBgt
- Scalability
- large number of unique items.
- large number of distributed data sources at which
the information about these items are being
updated. - Existing schemes dont scale
- High bandwidth cost, large processing delay, high
response latency. - Hierarchical aggregation
- Root node and nodes near the root O( items)
message cost, storage cost.
lt9.2.5.10, 653KBgt
lt9.2.5.56, 257KBgt
lt9.2.5.240, 100KBgt
lt9.2.5.210, 120KBgt
lt9.2.5.15, 2.2MBgt
lt9.2.5.89, 4.2MBgt
URL http//www.cs.utexas.edu/users/ypraveen/sdims
Email sdims_at_cs.utexas.edu
2Estimated Network Size 18 nodes Top 10 Gets by
client Client IP Number of Gets
127.0.0.1 1182 128.83.144.30 243
128.83.120.172 168 128.83.120.245 147
128.83.120.138 120 128.83.120.21 90
128.83.144.43 75 128.83.130.11 48
128.83.120.114 27 128.83.144.241 12 Top 10
Puts by client Client IP Number of Puts
127.0.0.1 1100 128.83.120.138 180
128.83.144.30 144