Title: Data Dissemination
1Data Dissemination
- Data Dissemination Problems
- Basic Dissemination Methods Push and Pull
- Broadcast Disk (for read-only transactions)
- Basic Schemes for Data Broadcast
- The Hybrid (Push Pull) Approach
- Temporal Consistency and Currency
- Broadcasting Consistent Data
- Multi-version Data Broadcast
- Update First with Order (UFO)
2Proactive Services
2. Infrared sensor detects users ID
Users ID
1. User enters room wearing
3. Display responds
Hello Roy
an active badge
to user
Infrared
Fr. Dollimore
Firstly, when a new object enters a smart space,
the list of services providing in the space have
to be downloaded to the object . How to delivery
the information to the new object? Secondly, the
object may generate requests to be supported by
other objects within the space. How to process
the requests? Thirdly, the object may have
submitted a query/CQ to monitor the status of the
space. How to support the execution of the
query/CQ?
3Distributed Computing Query Processing
Strategies
- Query shipping
- The server (the service provider) maintains the
latest versions of data objects - Queries from clients are sent to a server for
processing - Query results are returned to the clients
- Data shipping
- Clients send data requests to the server (the
service provider) - The server returns the requested data objects to
the clients - The queries are processed by the clients
4Query Shipping Vs. Data Shipping
Query Shipping
2. Query Processing
1. Requests
Client
Server
3. Results
Downlink channel
Uplink channel
3. Query Processing
1. Data requests
Client
Server
2. Data
Data Shipping
What are the tradeoffs? Transmission
overhead/Processing cost/ Scalability/processing
delay? Application and system characteristics,
i.e., data size, number of queries, etc.
5Query Shipping Vs. Data Shipping
- Which one is more suitable to pervasive
computing? - A large number of moving objects submit
(location-dependent) queries to access to
different types of real-time information - Monitor of real-time sensor data using the
in-network processing approach - Lack of powerful nodes at the device (network)
level - Mobile networks
- Low bandwidth
- Asymmetric bandwidth uplink bandwidth ltlt
downlink bandwidth - Transaction types
- Mostly read-only transactions (queries)
- Sensor readings of environment
- Continuous queries with a begin and end time for
event detection (i.e., navigation, tracing) - Location-dependent queries at different
locations request to read different sets of data
objects (spatial and temporal properties)
6Data Dissemination Problems
- Data Shipping How to provide the required data
items to a large number of queries (from moving
objects) for execution? - Note Since they are read-only operations, no
need to update any data items at the server.
Reading (detect events)gt responses
Broadcast data to clients through a mobile network
7Performance Objectives in Data Dissemination
- Workload. To minimize the total data transmission
workload - Waiting delay. Some queries may have a deadline
on their completion time. Meeting the deadline is
important (to respond to the critical events
occurred in the system environment) - Tune in time (conserve energy)
- The clients may not know whether their required
data will be shipped - A client may sleep and then weak-up to tune in
the broadcast channel to get its required data
item (avoid continuously monitor the broadcast
channel for data items) - Currency. Since most of the queries are asking
the real-time information of environment (i.e.,
temperature, traffic condition, etc.), the data
items provided to a query have to be the latest
versions (not out-dated) - Consistency. To ensure consistency (correctness)
of data items provided to a query (Temporal
consistency two data items reporting the status
of the environment at the same time point).
Otherwise, incorrect results may be generated
8Data Dissemination MethodsPull Vs Push
- Using data shipping for processing of mobile
queries - Scalability (Not to process queries at the server
and serve each data request one by one). The
arrival rate of data requests could be very high
due to large number of mobile clients - Suitable for monitoring and surveillance
applications (continuous queries). Why? - How can data shipping be applied to in-network
processing? - Pure Pull (on demand)
- Clients explicitly (periodically for data
monitoring) send data requests to a server
through the uplink channel - The server returns the requested data items to
the clients through the downlink channel - Design problem
- What is the pulling period (based on the dynamic
properties of the data). Pulling Period Vs.
transmission workload - Scalability problem (although the server does not
need to process the queries, it needs to serve
the data requests one by one)
9Data Dissemination Method - Pulling
Point to point communication
Requests
Client 1
Server
Maintain a request queue and process the data
requests one by one
Data
Client 2
. . . .
Requests
Data
Client n
10Data Dissemination MethodsPull Vs Push
- Pure Push (broadcast)
- Data shipping with prediction
- To predict what the data requirements of the
queries are - Spatial and temporal properties (queries at
similar time and from the clients at similar
locations have similar data requirements) - The server defines a broadcast schedule, i.e.,
based on the popularity of data items (identify
the hot items using previous access statistics) - A broadcast schedule is a sequence of data items
to be broadcast by the server - The server repetitively broadcasts data according
in a broadcast schedule to a client population
without receiving any data requests - Clients monitors the broadcast channel and
retrieve the data items which they need as they
appear in the broadcast channels - Application of Push (data broadcast)
- Listening to radio and watching TV
- Information feeds such as stock quotes, sport
tickets, electronic newsletters, traffic and
weather information, cable TV
11Data Dissemination Method Pushing
Broadcast data to all the clients (One to many
communication)
Broadcast Sever
The clients monitor the broadcast channel for
their needed data items
Server
Select data for broadcast
12Comparison Pull Vs. Push (Pull)
- Pull or Push? Which one is more suitable to
pervasive computing applications? It depends on
. - Pull requires a higher demand on uplink bandwidth
- Sending pulling requests and returning data
- Each data transmission can only serve one query
- No waste in bandwidth although the bandwidth for
serving each request is higher - All the transmitted data are needed by clients
- In Pull, a query knows when its required data
items will come (approximately, why?). The
clients play an active role. They tell the server
what they want - The workload at the server and the network
depends on the arrival rate of requests and
number of clients. If there are many data
requests, the waiting time will be very longer.
Missing their completion time? - Arrival rate A Service rate C
- Utilization C/A
- Queue length U/(1-U)
- Assuming Poisson arrival and exponential service
time
13Comparison Pull Vs. Push (Push)
- Some data items are not wanted by any queries (a
waste in bandwidth) - It is only a prediction
- Push is suitable for disseminating data items to
a large number of clients (more scalable) with
similar data requirements (hot data items) - One data push could meet the data requirements of
multiple queries (if the prediction is correct) - I.e. Many clients may want to know the latest
traffic condition at cross harbor tunnel - I.e. TV broadcast Vs. video on-demand
- In Push, the total broadcast workload is
determined by server (I.e., the pushing rate,
number of data items to be pushing in each
second) - The server may introduce a delay in between two
broadcast schedule to reduce the broadcast
workload - Push is suitable to systems with small database
and small size data - Push is suitable to systems where the access
probability of data are non-even (hot data vs.
cold data)
14Design Problems in Using Push
- How to define the broadcast schedule?
- What is the length of a broadcast schedule?
(Number of data items) (All items in the
database?) - The access to required data items is sequential
(a query consists of several read operations and
the operations are processed one after one.) - Clients need continuous listening to the
downlink channel - How to reduce the listening time to conserve
energy? Doze and weak-up mode of operations
15Broadcast Index
- To reduce the monitoring (tune-in) time, an index
is defined before each broadcast schedule starts - A broadcast schedule consists of two parts
- A header index and a sequence of buckets of data
items (one bucket one data item, assuming same
size items) - The broadcast index indicates the broadcast
schedule of the data items in a broadcast cycle - From the index, a client can calculate the
broadcast time of its required data items
(current time position of data item in the
broadcast schedule x the time to broadcast a data
item) - Read the broadcast index, and then sleep until
the required data item is going to be broadcast
16Broadcast Index
Broadcast Schedule
Index
i
Size/ Broadcast bandwidth
Tune-in
Tune-in
sleep
1 M
2 M
3 M
4 M
5 M
Size
17Broadcast Schedules
- A broadcast schedule is a sequence of data items
(bucket) - When a broadcast schedule is finished, the next
schedule will be defined and then be started
immediately (or after a fixed delay) - The use of different methods to define the
broadcast schedule affects the waiting time for
data items - Two types of (read-only) queries
- Each query consists of a set of read operations
- Unordered The operations can be executed in ANY
order depending on the arrivals of their required
data items - Ordered The operations have a predefined
execution sequence. I.e., Query i consists of two
operations, Readi(x) and Readi(y). It may be
defined that operation Read(x) has to be
completed before Read(y) can be started - The response time of a query is the time interval
from its generation time to the time when it
receives all its required data items (ignoring
the processing time of the last item)
18Broadcast Schedules
- The waiting time for a data item depends on
- The length of a schedule
- The position of the data item in a schedule
- To minimize the mean waiting time of queries
- Hot data items (popular data items) should be
broadcast with a higher frequency
Read(i)
Read(j)
Read(k)
Read(i) Read(j) Read(k)
Query x
Query x
19Broadcast Schedules
Client 1
Client 2
Broadcast Sever
. . . .
Server
Broadcast Schedule
Client n
Index
20Basic Schemes for Data Broadcast
- Flat Disk, Skew Disk and Multi-Disk
- Flat Disk (if it is difficult to identify the
hot items) - A broadcast schedule consists of all the items in
the database - In each broadcast cycle, all the data items in
the database will be broadcast one after one
until the end of the database (cycle). Then the
next cycle will be started from the first item - The time to complete one broadcast cycle equals
to the time to broadcast all data items in the
database - It is suitable for small databases, i.e.,
broadcast of stock items (currently we have about
1000 stock items) - Not scalable and not suitable for large database
systems and multimedia broadcast - The waiting time of a query for its required data
items depends on the size of the database and
their sizes - Mean waiting time for a data item is half cycle
length - What will be the waiting time for multiple data
items?
21Flat Disk Schedule
- All the data items in the database (A, B and C)
are broadcast with the same frequency
- Could it be?
- Unordered the operations can be performed in any
order, i.e., calculation of mean - Mean waiting time T/2
- T is the time to finish one broadcast cycle
- How about for queries with ordered operations?
22Skewed broadcast
- Some data items are identified to be hot data
items - Hot data items should be broadcast with a higher
frequency since they are more likely waiting by
queries - In skew broadcast, a broadcast schedule consists
a sub-set of hot data items in the database - How to define the length of a broadcast schedule
and how to choose the data items to be include in
a broadcast schedule? - Order the data items according to their access
probabilities which are calculated using previous
access statistics reported from the clients - (Some) Mobile clients may be requested to
generate a access report periodically to report
the broadcast server (i.e., the market survey of
a product)
23Skewed broadcast
- Design issue
- Size of a broadcast schedule
- Calculation of access probability for each item
Access Probability
Select to broadcast
Broadcast schedule size
Increase in access probability
24Multi-Disk Schedule
- Divide the data items in the database into
several groups based on their hot/cold properties
(access probability) - Each group forms a flat disk and the items in the
same disk have the same broadcast frequency - Note that the size of each group needs not to be
the same - The broadcast of data items in the same disk is
sequential, i.e., like a flat disk - Different disks have different broadcast
frequencies - Multiple broadcast disks gt Multi-Disk
- Changing the disk speeds changes their broadcast
frequency - How to define the broadcast frequencies and the
schedule? - Using the average access probability of the group
of data items
25Multi-Disk Schedule
- Design issue
- Calculation of access probability for each item
- Grouping of data items
- Assigning broadcast frequency
Access Probability
G3
G4
G5
G2
G1
Increase in access probability
26Multiple Disk Schedule
- Multiple disks of different sizes and speeds are
superimposed on the broadcast channel - Data item A is a hot data. Its broadcast
frequency is higher than B and C - Could it be?
- What is the difference?
- How to interleave the broadcast of cold/hot data
items so that the inter-arrival time between two
different instances of the same data item matches
the clients needs - What is the length of a broadcast schedule in
multi-disk?
27Data Dissemination Methods
- On-demand (Pull) broadcast
- Clients send data requests to the server using
the uplink channel (if uplink bandwidth is
available) - Server defines the broadcast schedule based on
the received client requests and the access
probability of the data items - Hybrid using both Push and Pull
- The down-link channel is divided into two parts
- Some of the bandwidth is reserved for sending
data items to clients on demand - Some of the bandwidth is for data broadcast
following the broadcast schedule - How much bandwidth should be reserved for
pulling? - How to interleave the service to push and pull?
- Suitable for queries which need to access
multiple data items - Data requests are only sent after waiting for a
long time - Using on-demand for cold items (data items in
slow disks)
28Push and Pull Broadcast Schedules
- Pre-defined broadcast frequency for each group of
data items according to applications and access
statistics - How to divide the bandwidth between broadcast
schedule and on demand schedule? - Access statistics
- Periodic collection of access statistics from
mobile clients - Scheduling of on-demand requests
- FCFS
- Earliest deadline first (each query is assigned a
deadline for completion) - Longest waiting time first (the deadline
intervals of the queries are different)
29Broadcast Schedules
Broadcast Schedule
push
pull
Client 1
Skew disk
Broadcasting
Client 2
Client 3
Client n
On demand data requests
Prioritization
30Currency and Consistency in Data Broadcast
- A query may require to read a set of data items
with pre-defined sequence - The definition of a transaction
- Consists of a sequence of primitive operations
embraced between a begin and end markers - The operations may be ordered or unordered
(precedence constraints)
R(x) R(z)
C R(y)
Partial Order R(x) and R(y) may execute
concurrently or in any order
31Execution Order and Data Broadcast
- The constraints in execution of the operations in
a query can greatly increase the waiting time for
data items. Why? - The waiting time for completing a query depends
on both the broadcast schedule and the execution
orders of the operations in a query - Since the operation Read(z) cannot be performed
before Read(x) and Read(y), it cannot (does not
know) read z from the broadcast channel if it has
not obtained data item x - For the worst case, the waiting time is nC (C the
time to complete one broadcast cycle and n is the
number of items) - The problem will be more serious when we consider
two additional issues in data dissemination
currency and consistency
32Meeting Currency Requirement
- Update transactions are performed at the database
server to maintain the freshness of the data
items in the database (update streams) - Sensors periodic generation
- Location update based on the adopted update
generation method, i.e., speed-dead reckoning - Data conflicts may occur between update
transactions and mobile queries - Update transactions are performed at database
server to maintain the freshness of data objects
in the database - Reading of data objects (by queries) are occurred
concurrently
33Meeting Consistency Requirement
- Definition data conflict two transactions have
a data conflict if the first one reads a data
object and second one updates the same object
before the commit (completion) of the first one - How to resolve data conflicts in a database
system? - The conflict cannot be detected by locking or
using the conventional concurrency control
methods - Distributed concurrency control problem
- But, the overhead for locking in a wireless
network is too heavy - How to resolve the disconnection problem after
granting a lock to a client program - Data conflicts in transaction execution may
result in inconsistent data accesses - Generate incorrect results from the transactions
34Broadcast Schedules
Client 1
Client 2
Broadcast Sever
. . . .
Server
updates
Client n
Index
Broadcast Schedule
35Concurrent ExecutionInconsistent Retrieval
Problem
Transaction T Bank Withdraw ( A, 100 ) Bank
Deposit ( B, 100)
Transaction U Bank BranchTotal ()
balance A.Read () 200 A.Write (balance
100) 100
balance A.Read () 100 balance balance
B.Read () 300
balance B.Read () 200 B.Write (balance
100) 300
36Correct Execution of Transactions
- Schedule shows the execution orders of the
operations of a set transactions (update and
mobile transactions) - Serial execution (schedule)
- Execute transactions one after one
- The next transaction starts only after the
previous one has been committed or aborted - If we have two transactions, we may two different
serial schedules, I.e., T1 then T2, and T2 then
T1 - Always maintain database consistency since all
transactions start from a consistent database
state - Serial equivalence (serializable)
- Transactions are executed concurrently but the
result is equivalent to that of a serial schedule
of the same set of transactions (which serial
schedule? Any one)
37Serial Execution
Transaction T BankWithdraw ( A, 100
) BankDeposit ( B, 100)
Transaction U BankBranchTotal ()
balance A.Read () 200 A.Write (balance
100) 100 balance B.Read () 200 B.Write
(balance 100) 300
balance A.Read () 100 balance balance
B.Read () 300 balance balance C.Read ()
400 .
38Serial Equivalence
Transaction T BankWithdraw ( A, 4
) BankDeposit ( B, 4)
Transaction U BankWithdraw ( C, 3
) BankDeposit ( B, 3)
balance A.Read () 100 A.Write (balance
4) 96
balance C.Read () 300 C.Write (balance
3) 297
balance B.Read () 200 B.Write (balance
4) 204
balance B.Read () 204 B.Write (balance
3) 207
39Consistency in Data Broadcast
- How to determine the correctness in transaction
execution? I.e., under which situation the
conflict is harmful - Look at the execution order of the conflicting
operations in a schedule - Serialization graph (SG) each edge Ti ? Tj in a
SG means that at least one of Tis operations
precede and conflict with one of Tjs operations - At the client, a query consists a read operation
to read a data item x - At the server, an update transaction wants to
update x - Serializability theorem
- A schedule is serializable iff SG(H) is acyclic
40Consistency in Data Broadcast
- Example 1 Data conflict between an MT and an
update transaction - Suppose update transaction, U, updates data item
d5 and then data item d2, and an MT wants to read
d2 and d5. Remember the update is performed at
the server and MT is executed at a mobile client.
If the schedule is - Server broadcasts d2
- MT reads d2
- U updates d5 d2
- Server broadcasts d5
- MT reads d5
- The MT may observe inconsistent data values. The
serialization graph is cyclic such as MT -gt U -gt
MT and is non-serializable - The reason is that the MT reads a data item, d2,
which is in conflict with U before the update
from U and it reads a conflicting data item, d5,
after the update from U
41Consistency in Data Broadcast
- Example 2 An MT conflicts with two (or more)
update transactions - Even though the serialization order between an
update transaction and a mobile transaction is
acyclic, the final serialization graph can still
be cyclic due to transitive dependencies. - Suppose there are two updates U1 and U2 such that
U1 updates d2 and then d1, and U2 updates d1 and
then d5. If the schedule is - Broadcast transaction (BT) broadcasts d2
- MT reads d2
- U1 updates d2 d1
- U2 updates d1 d5
- Broadcast transaction (BT) broadcasts d5
- MT reads d5
- The serialization graph is cyclic such as U2 -gt
MT -gt U1 -gt U2
42How to resolve this problem?
- The conventional methods for concurrency control
is not suitable - Multiversion Data Broadcast
- For flat disk only
- Update with Order First
- For flat disk, skew disk and multi-disk
43Multi-Version Data Broadcast
- Multi-version data broadcast
- Broadcast multiple versions of a data item
(current version previous versions). How many
versions? - A Push-based method
- No uplink data requests
- Do not need to set any lock or to inform the
database server before accessing any data items - Maintains multiple versions for each data item
- Each new update create a new version and the old
versions are still maintained in the system
44Multi-Version Data Broadcast
- Providing a consistent view to queries by batch
updates - The updates on data items are batch until the end
of a broadcast cycle even they arrive in the
middle of a broadcast cycle - During updates, the broadcast of data items is
suspended - The version number indicates at which cycle-end
the version is created - Even with no update, a new version is created
using the old version at the end of each
broadcast cycle - After the completion of the batch of updates, the
database is consistent and each newly created
data version is assigned a cycle number as its
version number - Accessing data versions in MV
- If a query wants to access to a data object, it
will get the latest version of the data object
for its first read operation from the broadcast
cycle - The subsequent read operations of the query will
read the data objects with the same version
number of the first operation
45Multi-Version Data Broadcast
- How many versions to be broadcast?
- In MV, it is assumed that each query has a
maximum life-span and no query exists in the
system longer than the life-span (L) - The life-span can be considered as the deadline
interval of a query. Start time deadline
interval deadline - After the deadline, the query will be aborted.
Why? - The maximum life-span of the queries together
with the time required for completing a broadcast
(BC) is used to calculate the number of versions
and the versions to be broadcast in a cycle for a
data item - L/BC
- Assuming the use of flat disk
- Why? What will be the problem if a skew disk is
used?
46Multi-Version Data Broadcast
- Why is data consistent guaranteed in MV?
- The update and broadcast of data objects are NOT
interleaved - The view provided in each broadcast cycle is a
CONSISTENT view at the start time of the
broadcast cycle. What is the definition of a
transaction? - It is a consistent view since there is no
incomplete transactions in the system (partially
completed) at that time point - Remember if a transaction starts from an
consistent view, the database is consistent when
it is completed (assuming a concurrency control
method (i.e., 2PL) to resolve the data conflict
problem among the conflicting transactions
47Multi-Version Data Broadcast
48Multi-Version Data Broadcast
- MV can be applied for accessing cached data
objects - The clients may maintain the previous versions of
data items at their caches and the same rule for
accessing broadcast data is used for accessing
cached items - The multi-version method is very useful for
systems where the mobile clients are frequently
disconnected from mobile network
49Multi-Version Data Broadcast
- Consistency Vs. Currency
- Although MV broadcast can ensure consistency of
data objects provided to a mobile query, the
currency of data objects is sacrificed - Why? Delays (and even skipping) in processing
updates (batch updates) - The latest version of a data object to be
broadcast in a cycle is the last version before
the start of the broadcast cycle (how about the
others) - Each data object has to be broadcast at least
once in each cycle (flat disk). What will happen
if not? - Multiple version broadcast overhead
- Point consistency Vs. interval consistency
- MV provides a consistent view of the database
between the start time and end time of a query - How about the problem of continuous queries which
want to generate results continuously for an
interval? Some updates are skipped means some
events are ignored
50Update-first with Order (UFO)
- UFO is another algorithm to ensure data
consistency for mobile queries - In UFO, instead of detecting data conflicts
between mobile queries and update transactions,
it checks data conflicts between a broadcast
transaction and an update transaction - The broadcast schedule is modeled as a
transaction (BT) - The length of a BT is defined as the max life
time of a mobile query - The basic principle of the UFO algorithm is to
ensure that if data conflicts occur between a BT
and an update transaction, the serialization
order between them will always be U -gt BT - Since mobile queries (MT) read data items from
broadcast transactions, their serialization
orders are always BT -gt MT - Serialization order between the update
transactions and the mobile queries will always
be U -gt MT and serializable
51Update-first with Order (UFO)
- The execution of an update transaction (at
server) is divided into two phases the execution
phase and update phase - During the execution phase, an update transaction
is executed and data conflicts with other update
transactions will be resolved according to the
adopted concurrency control protocol - The updates of data items are written in a
private workspace of the transaction during the
execution phase - When all operations of an update transaction have
been executed, it enters the update phase - Permanent updates to the database is performed by
copying the new values from the private workspace
into the database - During the update phase, the broadcast of data
items is stopped (BT always observes a consistent
database)
52Update-first with Order (UFO)
- Before an update transaction starts its update
phase, the system detects data conflict between
the update transaction and the broadcast
transactions in the current and previous
broadcast cycles - At the start time of the update phase, the set of
data items to be updated by the update
transaction will be known as all its operations
have been completed - At the same time, the set of the data items to be
read (broadcast) by a broadcast transaction is
also known as it is resulted from a broadcast
algorithm - The two sets of data items will be compared. If
they are overlapped, there is a data conflict - The conflicting item will be rebroadcast
- The overhead (re-broadcast) depends on the
conflict probability
53Update-first with Order (UFO)
- BT for any current broadcast
cycle i - OBT the set of data items of broadcast
transaction, BT - OU the set of data items of update
transaction, U - BA x OBT OU x is already broadcast when
U arrives - Before the permanent update starts, the following
algorithm is performed - If OBT OU
- Then BT and U have no dependency
- Else
- If BA
- Then the serialization order is U -gt BT
- Else
- For each data item i BA
- re-broadcast data item i
- Next
- the serialization order is U -gt BT
- End If
- End If
54Update-first with Order (UFO)
55UFO Example
- The broadcast transaction (BT) broadcasts d2
- MT reads d2
- Compare the data sets of U and BT
- U updates d5
- U updates d2
- BT re-broadcast d2
- MT reads the most updated value of d2
- BT continue it process and broadcasts d5
- MT reads d5
- The serialization graph is acyclic such as U-gt MT
56MV and UFO
- Relative consistency problem
- MV accessing data with the same versions
- UFO assigning time-stamps to data versions to
indicate their validity. The checking is then
following the requirement of relative consistency - Ordered transaction problem
- MV More versions are needed to be included since
the query life-span is longer - UFO Need to restart a query if the arrival order
is different from the access order of the data
items - The restart cost can be minimized by caching
previously accessed data items
57References
- Schiller Mobile Communications, Ch 6.1 and 6.2
- Fundamentals of Mobile and Pervasive Computing,
Chapter 3