Title: Processing AdHoc Joins on Mobile Devices
1Processing Ad-Hoc Joins on Mobile Devices
- HKU CSIS DB Seminar
- 10 Oct 2003
- Speaker Eric Lo
2Mobile Devices and Databases
- Cellular phones and Personal Data Assistants
(PDAs) are capable to ask information from remote
database(s) anywhere and anytime - The connection channel is wireless
- E.g., WAP, IEEE 802.11 (also WiFi), GPRS, 3G
3Example
HK Stock Exchange
SELECT Stock_Price FROM DB WHERE Stock_Code 8
8 - PCCW 1156am HKD 2.5
1155am What is the stock price of PCCW now?
4There are no free lunchOption 1 Charged by
airtime
HK Stock Exchange
SELECT Stock_Price FROM DB WHERE Stock_Code 8
8 - PCCW 1156am HKD 2.5
1155am What is the stock price of PCCW now?
1
2.8
4.6
10.2
5Option 2 Charged by amount of data transferred
- Network traffic and QoS of wireless data
networking are strongly dependent on factors
like - Network workloads
- Availability of network stations
- Charged by amount of data transferred
- Minimizing the transfer cost ? dollar!
6Query more than one data source
- Mobile users may wish to combine information from
more than one remote databases - E.g., A vegetarian visits Hong Kong and looks for
some restaurants recommended by both HK tourist
office and HK vegetarian community
7Example relations
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
Join Query
8Motivations
- Evaluating join queries on mobile devices
- Considerations
- Mobile device has limited memory
- Minimizing the transfer cost (dollar )
- Databases are non-collaborative
- Query results are small in sizes compare to input
relations (ad-hoc)
9Download all relations?
- Download both relations (HK tourist office and HK
vegetarian directory) onto the mobile device and
evaluate the join on the device locally - Wont be able to hold the large amount of data
from the remote databases (for most mobile
devices) - The transfer cost is very high though the result
size is very small
10Outline
- Introduction and motivation
- A simple late-projection strategy
- Block-merge join
- Ship-data-as-queries join
- RAMJ Recursive and Adaptive Mobile Join
- Experiment result
- Conclusions and future work
11A simple late-projection strategy
- Traditional distributed query processing
techniques like semi-join involves - Shipping of join columns and (whole) tuples
- Across the trusted distributed nodes directly
- In high selective join, most tuples fails to
contribute to the final result - Semi-join d/l the non-key attributes which may
not be included in the result - Download and join the distinct values of join
keys only (Do not download the non-key
attributes) - Only tuples belong to join result entails
downloading the rest of non-key attributes
12A late projection strategy
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
13Step 1
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
14Step 2
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
15Step 3
- Evaluate T ? Name (R1) ? Name (R2) locally
-
T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
16Step 4
- Evaluate ? Name,Address (?NameT (R1))
T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
17Step 5
- Evaluate ? Name,Cost (?NameT (R2))
T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
18Step 6
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
19Block-merge join (BMJ)
- Late-projection still insufficient if the whole
join column cannot fit into the memory of mobile
devices - Applying sort-merge join, with the sorting part
on the servers - 1 block of ordered join keys are downloaded from
each server and join them locally, until one
block is exhausted - Each block must
- Cover same data range
- Sorted in same order (e.g., both in ascending
order) - Small enough to be resided in memory
- Each block can be downloaded by using ROWNUM or
LIMIT SQL statements
20Ship-data-as-queries join (SDAQ)
- If R1
- Download the join column of R1 to the mobile
deviceSELECT Name FROM R1 - Send the join keys to R2 in form of SQL selection
queries (e.g., if two results returned in step
1) SELECT Name FROM R2WHERE Name in (Beta
Food,Ceta Food)) - The result from R2 are the joined keys
Very few results returned
21Can we do even better?
- Block-merge join (BMJ) can handle the limited
memory problem, BUT download all join keys
essentially - Ship-data-as-queries (SDAQ) can do better ONLY if
the sizes of two relations differ much - Pay small overhead ?
- Build histograms that capture the data
distribution of target relations - Join space pruning
- Bucket-base joining
22Pruning the data space
23Constructing Histogram
- Problem
- Mobile devices are not able to receive those
histograms (they are some internal data
structures in remote databases) - Solution
- Constructing some queries that build histogram
through SQL
24String Histogram
- Using the SUBSTRING function
- SELECT SUBSTRING(Name,1,1) AS Bucket,COUNT(Name)
As CountFROM R1GROUP BY SUBSTRING(Name,1,1)HAVI
NG COUNT(Name) 0
25Numeric Histogram
- Using ROUND function
- SELECT ROUND(Cost/(D/G)) AS Bucket,COUNT(ROUND(Co
st/(D/G)) As CountFROM R2GROUP BY
ROUND(Cost/(D/G)HAVING COUNT(ROUND(Cost/(D/G))
0 - G is granularity that specifies the number of
bucket - D is the numeric domain
26A Bucket-base Approach
- So far we know that
- SDAQ is good when input relations have large size
difference - But, BMJ is better when two input relations have
similar size (Why?) - Histogram helpful to prune join space
- Which method is better?
- The histogram can do more!
- The histogram already partitioned the data space
in form of buckets - Depends on the data distribution of each bucket,
assign the best action to them adaptively
27Direct Join (BMJ)
28Ship-join-keys (SDAQ)
29Recursive Partitioning
30Recursive Partitioning
- Further breaks a partition into more refined ones
and further request histogram for it - Hoping some sub-buckets are being pruned in
future - Or hoping cheap ship-join-keys join can be
applied on some future sub-buckets
31Recursive Partitioning
- SELECT SUBSTRING(Name,1,2) AS Bucket,COUNT(Name)
AS CountFROM R2WHERE SUBSTRING(Name,1,1)AGR
OUP BY SUBSTRING(Name,1,2)HAVING COUNT(Name) 0
32Which action is the best for each bucket? The
cost model!
- The largest amount of data that can be
transferred in one packet is called MTU (Maximum
Transfer Unit) - The largest segment of TCP data that can be
transmitted is called MSS(Maximum Segment Size) - MTU MSS BH (BH is the size of headers)
- To transfer B bytes data, the actual number of
bytes to be transferred is
33Cost Model
- Assume CR1 and CR2 be the cost of accessing R1
and R2, respectively - Send a selection query Q to a server needs T(BSQL
Bkey) bytes - Bkey 4 bytes for numeric attributes
- Bkey 2L bytes for string attributes in length L
34Cost Model
- Under these settings, we have to determine the
cost of - Direct Join C1
- Ship-join-key Join C2
- Recursive Partitioning C3
- Execute the minimal cost action for each bucket
adaptively
35C1 Direct Join
- ai,ßi be the i-th histogram bucket summarizing
the same data region - Send a selection query to R1 CR1 T(BSQL Bkey)
- Receiving CR1 T(aiBkey) bytes.
- Send a selection query to R2 CR2 T(BSQL Bkey)
- Receiving CR2 T(ßiBkey) bytes.
- C1(ai,ßi) (CR1 CR2 )T(BSQL Bkey)
CR1 T(aiBkey) CR2 T(ßiBkey)
36C2 Ship-join-keys
- ai
- Send a selection query to the smaller relation R2
that holds ai CR2 T(BSQL Bkey) - Receiving CR2 T(aiBkey) bytes from R2
- Send a selection query to larger relation R1 to
check existence of ai keys CR1 T(BSQL
aiBkey) - Receiving at most ai keys from R1 CR1
T(aiBkey) - C2(ai,ßi) CR1 (T(BSQL Bkey) T(aiBkey))
CR2 T(T(BSQL 2aiBkey))
37C3 Recursive Partitioning
- Have to estimate the cost of
- Ask for finer histograms for that bucket from R1
- Ask for finer histograms for that bucket from R2
- For each pair of (future/virtual) sub-buckets,
each of them may execute direct-join,
ship-join-key or recursive partitioning again
38Recursive Partitioning C3
- Ask for finer histograms from R1
- Ch(G,R1) CR1(T(G(Bkey4))T(BSQLBkey))
- Ask for finer histograms from R2
- Ch(G,R2) CR2(T(G(Bkey4))T(BSQLBkey))
- For each pair of (future) sub-buckets, each
sub-bucket pair may recursively follow
direct-join, ship-join-key or recursive
partitioning again
39Recursive Partitioning C3
- C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)
?
40Recursive Partitioning Optimistic Estimation
- Optimistically assume that buckets in next level
are all being pruned - It will hold if the data distribution in the two
datasets is very different - Since all future (next-level) sub-buckets are
being pruned, they would NOT have any actions - Therefore
- C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)
41Recursive Partitioning Linear Interpolation
Estimation
- More accurate. Higher computational cost
- Exploit the histogram in current level to
estimate the distribution of next level
We have histograms in this level
We DONT have histograms in this level
42Linear Interpolation Estimation
- Select adjacent buckets as interpolation points
- Preserve the current trend
- Resistance to fluctuated distribution
b3
b3
b2
b2
b4
b4
b5
b5
b1
b1
43One problem left
Level 1
b3
b3
b2
b2
b4
b4
b5
b5
b1
b1
Level 2
- Level 1 The cost of RP on b2 is?
- Estimate Level 2 by Linear Interpolation
- b2,1, b2,2, , b2,5 are found
- b2,1, b2,2, , b2,5 are found,
determine which action is the most
cost-efficient (C1,C2 or C3) for each sub-bucket? - C1, C2 of b2,1, b2,2, , b2,5
can be determine - C3 of b2,1, b2,2, , b2,5 ?
Started from step 1 again
b2,1
2
3
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
44If you dont understand
- Cost 3 of one level depends on next level
- Fortunately the cost of CRP (ai,ßi) is bounded by
the following inequality
C1
ai,ßi
C2
C1
C3
C2
C1
C3
C2
C3
45Recursive Partitioning Linear Interpolation
Estimation
- C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)
- In optimistic estimation, we omit the last item
optimistically - CRP (ai,ßi) is bounded by the inequality
- Summing up everything
46RAMJ Algorithm
- Recursive and Adaptive Mobile Join
47Real Data Experiment
- Real Data Set 1
- DBLP
- Join relations Conference (235K tuples) and
Journal (125K tuples) in order to find the set
of publications that have the same conference and
journal title - 3836 publications have same title in both
conference and journal paper - SELECT R1.Title
- FROM Conference R1, Journal R2
- WHERE R1.Title R2.Title
48Real Data Experiment
- Real Data Set 2
- Restaurants Data Set
- Crawled from www.restaurantrow.com
- Join relation Steak (4573 tuples) and
Vegetarian (2098 tuples) in order to find the
set of restaurants that offer both steak and
vegetarian dishes (163 joined) - SELECT R1.Name
- FROM Steak R1, Vegetarian R2
- WHERE R1.Name R2.Name
49Real Data Experiment Result
50Synthetic Data Experiment
- Generate 3 relations with different
distributions - Gaussian
- Negative Exponential
- Zipf (skewness ? 1)
- Default
- 10,000 tuples
- Domain 100,000
- G 20
51Synthetic Data Experiment Result
52The impact of data skew
53The impact of memory size
54Conclusions and Future Work
- Identify the requirements and limitation on
evaluating ad-hoc join on mobile devices - A recursive and adaptive algorithm RAMJ
- Extension to multi-way joins and multi-attributes
- Extension to Top-K-Join
- Existing approaches on Top-K problem ONLY works
on collaborative database
55Q A
56Approach 2 Mediator
- User queries are free-form
- i.e., User KY may issue a join query that
involves DB x and DB y, whereas user BY may issue
a join query that involves DB e and DB f - Mediator cannot answer those queries without
prior preparation like data integration, schema
matching - Mediator services may charge the users as well
57Approach 2Distributed query processing?
- Existing distributed database work on trusted
environment only - DB1.Name DB2.CustomerName
- Semi-join
- Site DB1 Evaluate J ? Name DB1 J All Names
- Send J from DB1 to DB2
- Site DB2 Evaluate K ? CustomerName J( DB2 )
- Find all CustomerName Name
in DB1 - Send K from DB2 to DB1
58Approach 2Distributed query processing?
- Not work on our problem!
- DB1 and DB2 are non-collaborative
- Would not accept data structures as input
(e.g., a join column in semi-join or a
hash-table in bloom-join - Accept SQL only
- Semi-join is worked by some modifications
- Send J and K through the mobile device
- ? High transfer cost
59References
- Processing Ad-hoc joins on mobile devices,
submitted to EDBT 04 - P.A. Bernstein and N. Goodman. Power of natural
semijoin. SAIM Journal of Computing, 1981 - P.A. Bernstein, N. Goodman, et. al. Query
processing in a system for distributed databases
(sdd-1). ACM TODS, 1981 - N. Mamoulis, P. Kalnis, et. al. Optimization of
spatial joins on mobile devices. SSTD, 2003
60Other Join Types
- Equi-join with selection constraints
- E.g., We are interested in restaurants appear in
both datasets and the expense is less than 20 - SELECT R1.Name, R1.Address, R2.Cost
- FROM R1, R2
- WHERE R1.Name R2.Name
- AND R2.Cost
- Add this condition in histogram construction
- SELECT SUBSTRING(Name,1,1) AS Bucket,
- COUNT(Name) As Count
- FROM R1
- WHERE R2.Cost
- GROUP BY SUBSTRING(Name,1,1)
- HAVING COUNT(Name) 0
61Iceberg Semi-join
- Find all restaurants in R1 which are recommended
by at least 10 users in a discussion group R2 - Properties
- Equi-join between R1 and R2
- Results are comes from R1 only
- Condition is applied on R2 only (10 users)
- SELECT SUBSTRING(Name,1,1) AS Bucket,
- COUNT(Name) As Count
- FROM R2
- GROUP BY SUBSTRING(Name,1,1)
- HAVING COUNT(Name) t