Processing AdHoc Joins on Mobile Devices - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

Processing AdHoc Joins on Mobile Devices

Description:

Mobile Devices and Databases ... Mobile users may wish to combine information from more ... Send a selection query Q to a server needs T(BSQL Bkey) bytes ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 62

Provided by: Eri6

Category:

more less

Transcript and Presenter's Notes

Title: Processing AdHoc Joins on Mobile Devices

1
Processing Ad-Hoc Joins on Mobile Devices

HKU CSIS DB Seminar
10 Oct 2003
Speaker Eric Lo

2
Mobile Devices and Databases

Cellular phones and Personal Data Assistants
(PDAs) are capable to ask information from remote
database(s) anywhere and anytime
The connection channel is wireless
E.g., WAP, IEEE 802.11 (also WiFi), GPRS, 3G

3
Example
HK Stock Exchange
SELECT Stock_Price FROM DB WHERE Stock_Code 8
8 - PCCW 1156am HKD 2.5
1155am What is the stock price of PCCW now?
4
There are no free lunchOption 1 Charged by
airtime
HK Stock Exchange
SELECT Stock_Price FROM DB WHERE Stock_Code 8
8 - PCCW 1156am HKD 2.5
1155am What is the stock price of PCCW now?
1
2.8
4.6
10.2
5
Option 2 Charged by amount of data transferred

Network traffic and QoS of wireless data
networking are strongly dependent on factors
like
Network workloads
Availability of network stations
Charged by amount of data transferred
Minimizing the transfer cost ? dollar!

6
Query more than one data source

Mobile users may wish to combine information from
more than one remote databases
E.g., A vegetarian visits Hong Kong and looks for
some restaurants recommended by both HK tourist
office and HK vegetarian community

7
Example relations
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
Join Query
8
Motivations

Evaluating join queries on mobile devices
Considerations
Mobile device has limited memory
Minimizing the transfer cost (dollar )
Databases are non-collaborative
Query results are small in sizes compare to input
relations (ad-hoc)

9
Download all relations?

Download both relations (HK tourist office and HK
vegetarian directory) onto the mobile device and
evaluate the join on the device locally
Wont be able to hold the large amount of data
from the remote databases (for most mobile
devices)
The transfer cost is very high though the result
size is very small

10
Outline

Introduction and motivation
A simple late-projection strategy
Block-merge join
Ship-data-as-queries join
RAMJ Recursive and Adaptive Mobile Join
Experiment result
Conclusions and future work

11
A simple late-projection strategy

Traditional distributed query processing
techniques like semi-join involves
Shipping of join columns and (whole) tuples
Across the trusted distributed nodes directly
In high selective join, most tuples fails to
contribute to the final result
Semi-join d/l the non-key attributes which may
not be included in the result
Download and join the distinct values of join
keys only (Do not download the non-key
attributes)
Only tuples belong to join result entails
downloading the rest of non-key attributes

12
A late projection strategy
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
13
Step 1

Download ? Name R1

SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
14
Step 2

Download ? Name R2

SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
15
Step 3

Evaluate T ? Name (R1) ? Name (R2) locally

T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
16
Step 4

Evaluate ? Name,Address (?NameT (R1))

T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
17
Step 5

Evaluate ? Name,Cost (?NameT (R2))

T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
18
Step 6

Join the two resultsets

SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
19
Block-merge join (BMJ)

Late-projection still insufficient if the whole
join column cannot fit into the memory of mobile
devices
Applying sort-merge join, with the sorting part
on the servers
1 block of ordered join keys are downloaded from
each server and join them locally, until one
block is exhausted
Each block must
Cover same data range
Sorted in same order (e.g., both in ascending
order)
Small enough to be resided in memory
Each block can be downloaded by using ROWNUM or
LIMIT SQL statements

20
Ship-data-as-queries join (SDAQ)

If R1
Download the join column of R1 to the mobile
deviceSELECT Name FROM R1
Send the join keys to R2 in form of SQL selection
queries (e.g., if two results returned in step
1) SELECT Name FROM R2WHERE Name in (Beta
Food,Ceta Food))
The result from R2 are the joined keys

Very few results returned
21
Can we do even better?

Block-merge join (BMJ) can handle the limited
memory problem, BUT download all join keys
essentially
Ship-data-as-queries (SDAQ) can do better ONLY if
the sizes of two relations differ much
Pay small overhead ?
Build histograms that capture the data
distribution of target relations
Join space pruning
Bucket-base joining

22
Pruning the data space
23
Constructing Histogram

Problem
Mobile devices are not able to receive those
histograms (they are some internal data
structures in remote databases)
Solution
Constructing some queries that build histogram
through SQL

24
String Histogram

Using the SUBSTRING function
SELECT SUBSTRING(Name,1,1) AS Bucket,COUNT(Name)
As CountFROM R1GROUP BY SUBSTRING(Name,1,1)HAVI
NG COUNT(Name) 0

25
Numeric Histogram

Using ROUND function
SELECT ROUND(Cost/(D/G)) AS Bucket,COUNT(ROUND(Co
st/(D/G)) As CountFROM R2GROUP BY
ROUND(Cost/(D/G)HAVING COUNT(ROUND(Cost/(D/G))
0
G is granularity that specifies the number of
bucket
D is the numeric domain

26
A Bucket-base Approach

So far we know that
SDAQ is good when input relations have large size
difference
But, BMJ is better when two input relations have
similar size (Why?)
Histogram helpful to prune join space
Which method is better?
The histogram can do more!
The histogram already partitioned the data space
in form of buckets
Depends on the data distribution of each bucket,
assign the best action to them adaptively

27
Direct Join (BMJ)
28
Ship-join-keys (SDAQ)
29
Recursive Partitioning
30
Recursive Partitioning

Further breaks a partition into more refined ones
and further request histogram for it
Hoping some sub-buckets are being pruned in
future
Or hoping cheap ship-join-keys join can be
applied on some future sub-buckets

31
Recursive Partitioning

SELECT SUBSTRING(Name,1,2) AS Bucket,COUNT(Name)
AS CountFROM R2WHERE SUBSTRING(Name,1,1)AGR
OUP BY SUBSTRING(Name,1,2)HAVING COUNT(Name) 0

32
Which action is the best for each bucket? The
cost model!

The largest amount of data that can be
transferred in one packet is called MTU (Maximum
Transfer Unit)
The largest segment of TCP data that can be
transmitted is called MSS(Maximum Segment Size)
MTU MSS BH (BH is the size of headers)
To transfer B bytes data, the actual number of
bytes to be transferred is

33
Cost Model

Assume CR1 and CR2 be the cost of accessing R1
and R2, respectively
Send a selection query Q to a server needs T(BSQL
Bkey) bytes
Bkey 4 bytes for numeric attributes
Bkey 2L bytes for string attributes in length L

34
Cost Model

Under these settings, we have to determine the
cost of
Direct Join C1
Ship-join-key Join C2
Recursive Partitioning C3
Execute the minimal cost action for each bucket
adaptively

35
C1 Direct Join

ai,ßi be the i-th histogram bucket summarizing
the same data region
Send a selection query to R1 CR1 T(BSQL Bkey)
Receiving CR1 T(aiBkey) bytes.
Send a selection query to R2 CR2 T(BSQL Bkey)
Receiving CR2 T(ßiBkey) bytes.
C1(ai,ßi) (CR1 CR2 )T(BSQL Bkey)
CR1 T(aiBkey) CR2 T(ßiBkey)

36
C2 Ship-join-keys

ai
Send a selection query to the smaller relation R2
that holds ai CR2 T(BSQL Bkey)
Receiving CR2 T(aiBkey) bytes from R2
Send a selection query to larger relation R1 to
check existence of ai keys CR1 T(BSQL
aiBkey)
Receiving at most ai keys from R1 CR1
T(aiBkey)
C2(ai,ßi) CR1 (T(BSQL Bkey) T(aiBkey))
CR2 T(T(BSQL 2aiBkey))

37
C3 Recursive Partitioning

Have to estimate the cost of
Ask for finer histograms for that bucket from R1
Ask for finer histograms for that bucket from R2
For each pair of (future/virtual) sub-buckets,
each of them may execute direct-join,
ship-join-key or recursive partitioning again

38
Recursive Partitioning C3

Ask for finer histograms from R1
Ch(G,R1) CR1(T(G(Bkey4))T(BSQLBkey))
Ask for finer histograms from R2
Ch(G,R2) CR2(T(G(Bkey4))T(BSQLBkey))
For each pair of (future) sub-buckets, each
sub-bucket pair may recursively follow
direct-join, ship-join-key or recursive
partitioning again

39
Recursive Partitioning C3

C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)

?
40
Recursive Partitioning Optimistic Estimation

Optimistically assume that buckets in next level
are all being pruned
It will hold if the data distribution in the two
datasets is very different
Since all future (next-level) sub-buckets are
being pruned, they would NOT have any actions
Therefore
C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)

41
Recursive Partitioning Linear Interpolation
Estimation

More accurate. Higher computational cost
Exploit the histogram in current level to
estimate the distribution of next level

We have histograms in this level
We DONT have histograms in this level
42
Linear Interpolation Estimation

Select adjacent buckets as interpolation points
Preserve the current trend
Resistance to fluctuated distribution

b3
b3
b2
b2
b4
b4
b5
b5
b1
b1
43
One problem left
Level 1
b3
b3
b2
b2
b4
b4
b5
b5
b1
b1
Level 2

Level 1 The cost of RP on b2 is?
Estimate Level 2 by Linear Interpolation
b2,1, b2,2, , b2,5 are found
b2,1, b2,2, , b2,5 are found,
determine which action is the most
cost-efficient (C1,C2 or C3) for each sub-bucket?
C1, C2 of b2,1, b2,2, , b2,5
can be determine
C3 of b2,1, b2,2, , b2,5 ?
Started from step 1 again

b2,1
2
3
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
44
If you dont understand

Cost 3 of one level depends on next level
Fortunately the cost of CRP (ai,ßi) is bounded by
the following inequality

C1
ai,ßi
C2
C1
C3
C2
C1
C3
C2
C3
45
Recursive Partitioning Linear Interpolation
Estimation

C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)
In optimistic estimation, we omit the last item
optimistically
CRP (ai,ßi) is bounded by the inequality
Summing up everything

46
RAMJ Algorithm

Recursive and Adaptive Mobile Join

47
Real Data Experiment

Real Data Set 1
DBLP
Join relations Conference (235K tuples) and
Journal (125K tuples) in order to find the set
of publications that have the same conference and
journal title
3836 publications have same title in both
conference and journal paper
SELECT R1.Title
FROM Conference R1, Journal R2
WHERE R1.Title R2.Title

48
Real Data Experiment

Real Data Set 2
Restaurants Data Set
Crawled from www.restaurantrow.com
Join relation Steak (4573 tuples) and
Vegetarian (2098 tuples) in order to find the
set of restaurants that offer both steak and
vegetarian dishes (163 joined)
SELECT R1.Name
FROM Steak R1, Vegetarian R2
WHERE R1.Name R2.Name

49
Real Data Experiment Result
50
Synthetic Data Experiment

Generate 3 relations with different
distributions
Gaussian
Negative Exponential
Zipf (skewness ? 1)
Default
10,000 tuples
Domain 100,000
G 20

51
Synthetic Data Experiment Result
52
The impact of data skew
53
The impact of memory size
54
Conclusions and Future Work

Identify the requirements and limitation on
evaluating ad-hoc join on mobile devices
A recursive and adaptive algorithm RAMJ
Extension to multi-way joins and multi-attributes
Extension to Top-K-Join
Existing approaches on Top-K problem ONLY works
on collaborative database

55
Q A

56
Approach 2 Mediator

User queries are free-form
i.e., User KY may issue a join query that
involves DB x and DB y, whereas user BY may issue
a join query that involves DB e and DB f
Mediator cannot answer those queries without
prior preparation like data integration, schema
matching
Mediator services may charge the users as well

57
Approach 2Distributed query processing?

Existing distributed database work on trusted
environment only
DB1.Name DB2.CustomerName
Semi-join
Site DB1 Evaluate J ? Name DB1 J All Names
Send J from DB1 to DB2
Site DB2 Evaluate K ? CustomerName J( DB2 )
Find all CustomerName Name
in DB1
Send K from DB2 to DB1

58
Approach 2Distributed query processing?

Not work on our problem!
DB1 and DB2 are non-collaborative
Would not accept data structures as input
(e.g., a join column in semi-join or a
hash-table in bloom-join
Accept SQL only
Semi-join is worked by some modifications
Send J and K through the mobile device
? High transfer cost

59
References

Processing Ad-hoc joins on mobile devices,
submitted to EDBT 04
P.A. Bernstein and N. Goodman. Power of natural
semijoin. SAIM Journal of Computing, 1981
P.A. Bernstein, N. Goodman, et. al. Query
processing in a system for distributed databases
(sdd-1). ACM TODS, 1981
N. Mamoulis, P. Kalnis, et. al. Optimization of
spatial joins on mobile devices. SSTD, 2003

60
Other Join Types

Equi-join with selection constraints
E.g., We are interested in restaurants appear in
both datasets and the expense is less than 20
SELECT R1.Name, R1.Address, R2.Cost
FROM R1, R2
WHERE R1.Name R2.Name
AND R2.Cost
Add this condition in histogram construction
SELECT SUBSTRING(Name,1,1) AS Bucket,
COUNT(Name) As Count
FROM R1
WHERE R2.Cost
GROUP BY SUBSTRING(Name,1,1)
HAVING COUNT(Name) 0

61
Iceberg Semi-join

Find all restaurants in R1 which are recommended
by at least 10 users in a discussion group R2
Properties
Equi-join between R1 and R2
Results are comes from R1 only
Condition is applied on R2 only (10 users)
SELECT SUBSTRING(Name,1,1) AS Bucket,
COUNT(Name) As Count
FROM R2
GROUP BY SUBSTRING(Name,1,1)
HAVING COUNT(Name) t