Title: Shetal Shah, IITB
1Dissemination of Dynamic Data Semantics,
Algorithms, and Performance
Shetal Shah, IITB
Modified by Ajinkya Joshi For CS 632
2More and more of the informationwe consumeis
dynamically constructed
3Buying a camera? Track auctions
4 Dynamic Data
- Data gathered by (wireless sensor) networks
- Sensors that monitor light, humidity, pressure,
and heat - Network traffic passing via switches
- Sports Scores
- Score changes by 5 points
- Financials
- Rice price changes by Rs. 10 compared to previous
day - Total value of stock portfolio exceeds 10,000
5Continual Queries
A CQ is a standing query coupled with a
trigger/select condition CQ stock_monitor
SELECT stock_price FROM quotes WHEN
stock_price prev_stock_price gt 0.5 CQ
RFP_tracker SELECT project_name, contact_info
FROM RFP_DB WHERE skill_set_required ?
available_skills Not every change at a source
leads to a change in the result of the query
6Generic Architecture
wired host
Network
Network
sensors
servers
Proxies /caches /Data aggreators
mobile host
Data sources
End-hosts
7Where should the queries execute ?
- At clients
- cant optimize across clients, links
- At source (where changes take place)
- Advantages
- Minimum number of refresh messages, high fidelity
- Main challenge
- Scalability
- Multiple sources hard to handle
- At Data Aggregators -- DAs/proxies -- placed at
edge of network - Advantages
- Allows scalability through consolidation,
Multiple data sources - Main challenge
- Need mechanisms for maintaining data consistency
at DAs
8Coherency of Dynamic Data
- Strong coherency
- The client and source always in sync with each
other - Strong coherency is expensive!
- Relax strong coherency ? - coherency
- Time domain ?t coherency
- The client is never out of sync with the source
by morethan ?t time units - eg Traffic data not stale by more than a minute
- Value domain ?v - coherency
- The difference in the data values at the client
and the source bounded by ?v at all times - eg Only interested in temperature changes larger
than 1 degree
9Coherency Requirement (c )
temperature, max incoherency 1 degree
10Data/Query Value at client
at Server
T
Bounds
Violation
11Source pushes interesting changes
Achieves ?v - coherency Keeps network
overhead minimum -- poor scalability (has to
maintain state and keep connections open)
User
Source
DA
push
push
12Pull interesting changes
Server
Repository
User
Pull
- Pull after
- Time to Live (TTL)
- Time To Next Refresh (TTR / TNR)
- Can be implemented using the HTTP protocol
- Stateless and hence is generally scalable with
respect to state space and computation - Need to estimate when a change of interest will
happen - Heavy polling for stringent coherence requirement
or highly dynamic data - Network overheads higher than for Push
13Complementary Properties
14Dynamic Content Distribution Networks
To create a scalable content dissemination
network (CDN) for streaming/dynamic data.
15Dissemination Network Example
Data Set p, q, r Max Clients 2
A
B
D
C
16Challenges I
- Given the data and coherency needs of
repositories, - how should repositories cooperate to satisfy
these needs? - How should repositories refresh the data such
that - coherency requirements of dependents are
satisfied? - How to make repository network resilient to
failures? -
- VLDB02, VLDB03, IEEE TKDE
17Challenges - II
- Given the data and the coherency available at
repositories in the network, - how to assign clients to repositories?
- Given the data and coherency needs of clients in
the network, - what data should reside in each repository
- and at what coherency?
- If the client requirements keep changing,
- how and when should the repositories be
reorganized ?
RTSS 2004, VLDB 2005
18Dynamics along three axes
- Data is dynamic, i.e., data changes rapidly and
unpredictably - Data items that a client is interested in
- also change dynamically
- Network is dynamic, nodes come and go
19Data Dissemination
20Data Dissemination
- Different users have different coherency req
for the same data item. - Coherency requirement at a repository should be
at least as stringent as that of the dependents. - Repositories disseminate only changes of
interest.
A
B
D
C
Client
21Condition for Data dissemination
- P will send the update to Q only if -
Is this condition sufficient?
22Data dissemination -- must be done with care
1
1
1.2
1
1.4
1
1.4
1.5
1.7
should prevent missed updates!
23Source Based Dissemination Algorithm
- For each data item, source maintains
- unique coherency requirements of repositories
- the last update sent for that coherency
- For every change,
- source finds the maximum coherency
- for which it must be disseminated
- tags the change with that coherency
- disseminates (changed data, tag)
24Source Based Dissemination Algorithm
1
1
1.2
1
1.4
1.5
1.7
1.5
1.5
25Repository Based Dissemination Algorithm
A repository P sends changes of interest to the
dependent Q if
26Repository BasedDissemination Algorithm
1
1
1
1
1
1.2
1.4
1.4
1.4
1.5
1.7
27Building the content distribution network
Choose parents for repositories such that
overall fidelity observed by the repositories is
high ---reduce communication and computational
delays..
28If parents are not chosen judiciously
- It may result in
- Uneven distribution of load on repositories.
- Increase in the number of messages in the system.
A
B
C
D
Increase in loss in fidelity!
29LeLA
- Looks for position of Q level by level
- Each level as load controller node
- For each repository on that level it calculates
preference factor - Smaller the preference factor better is the
chance of a repository to become parent
30Preference factor
- Data availability factor
- Computational delay factor
- Communication delay factor
- Preference factor
31DiTA
- Repository N needs data item x
- If the source has available push connections,
- or the source is the only node
- in the dissemination tree for x
- N is made the child of the source
- Else
- repository is inserted in most suitable subtree
where - Ns ancestors have more stringent coherency
requirements - N is closest to the root
32Most Suitable Subtree?
- l smallest level in the subtree with coherency
requirement less stringent than Ns. - d communication delay from the root of the
subtree to N. - smallest (l x d ) most suitable subtree.
Essentially, minimize communication
and computational delays!
33Example
Initially the network consists of the source.
34Example
D requests service of q with coherency
requirement 0.2
35Example
D requests service of q with coherency
requirement 0.2
36Comparison of LeLA and DiTA
LeLA- Each node does more work, DiTA High
communication cost
37 Resiliency
38Handling Failures in the Network
- Need to detect permanent/transient failures in
the network and to recover from them - Resiliency is obtained by adding redundancy
- Without redundancy,
- failures ? loss in fidelity
- Adding redundancy can increase cost
- ? possible loss of fidelity!
- Handle failures such that
- cost of adding resiliency is low!
39Passive/Active Failure Handling
- Passive failure detection
- Parent sends Im alive messages at the end of
every time interval. - what should the time interval be?
- Active failure handling
- Always be prepared for failures.
- For example 2 repositories can serve the same
data item at the same coherency to a child. - This means lots of work
- ? greater loss in fidelity.
40Middle Path
Let repository R want data item x with coherency
c.
A backup parent B is found for each data item
that the repository needs
P
c
At what coherency should B serve R ?
R
41If a parent fails
- Detection Child gets two consecutive updates
from the backup parent with no updates from the
parent
B
c
k x c
- Recovery Backup parent is asked to serve at
coherency c till we get an update from the parent
R
42Adding Resiliency to DiTA
- A sibling of P is chosen as the backup parent of
R. - If P fails,
- A serves B with coherency c
- ? change is local.
- If P has no siblings, a sibling of nearest
ancestor is chosen. - Else the source is made the backup parent.
A
B
c
k x c
R
43Markov Analysis for k
- Assumptions
- Data changes as a random walk along the line
- The probability of an increase is the same as
that of a decrease - No assumptions made about the unit of change or
time taken for a change
Expected misses for any k lt 2 k2 2
for k 2, expected misses lt 6
44Experimental Methodology
- Physical network 4 servers, 600 routers,
- 100 repositories
- Communication delay 20-30 ms
- Computation delay 3-5 ms
- Real stock traces 100-1000
- Time duration of observations 10,000 s
- Tight coherency range 0.01 to 0.05
- loose coherency range 0.5 to 0.99
45Failure and Recovery Modelling
Trend for time between failure
- Failures and recovery modeled based on trends
observed in practice - Analysis of link failures in an IP backbone by
G. Iannaccone et al - Internet Measurement Workshop 2002
Recovery10 gt 20 min 40 gt 1
min lt 20 min 50 lt 1 min
46In the Presence of Failures, Varying Recovery
Times
Addition of resiliency does improve fidelity.
47In the Presence of Failures, Varying Data Items
Increasing Load
Fidelity improves with addition of resiliency
even for large number of data items.
48 In the Absence of Failures
Increasing Load
Often, fidelity improves with addition of
resiliency, even in the absence of failures!
49 50Source of delay
- Queuing delay
- Time delay between arrival of update and start of
processing - Processing delay
- Check delay
- Data coherency requirement are checked
- Computation delay
- Computing data to be pushed and actual pushing it
51What is our goal?
- Aim is to improve average fidelity over all
repositories - This can be achieved using
- Better filering of updates
- Better scheduling of dissiminations
52Better filtering of updates
- For every dependent repository maintains
- Coherency requirement
- Last pushed value
- New value is pushed if it differs by Cr
- This creates a window with
- Lower bound lb Last pushed value cr
- Upper bound ub Last pushed value cr
53Cr as ordering parameter?
(10,10.6)
- Is it possible to use Cr as ordering parameter?
Source
10.3
A
B
(9.5,10.5)
(9.7,10.3)
(9.5,10.5)
Last value pushed - 10
10
Last value pushed -
Next update is 10.55. Can Cr still act as
ordering parameter?
54Restriction on updates
- In order to use Cr as ordering parameters, some
restrictions on updates are needed - If c1 lt c2 gt
- L2 lt l1 and u2 gt u1
- To satisfy these ineualities pseudo update value
is used
55Update with pseudo value
(9.9,10.5)
10.3
Source
10.2
A
B
(9.5,10.5)
(9.7,10.3)
(9.5,10.5)
Last value pushed - 10
10
Last value pushed -
56How to calculate pseudo update?
10.4
10.1
Next update - 10.55
If v lt (lb ci ) then pseudo val lbci else if
v gt (ub - ci ) then pseudo val lb - ci Else
pseudo val v
57Better Scheduling
- Order in which an update should be processed
- Order in which an update should be propogated
- Consider u1,u2.. un be the updates that we want
to process
58Better Scheduling (Cont..)
- C(u1), C(u2).. be the time delay for processing
updates (cost) - B(u1), B(u2).. be the total no of descendents
that would be benefited (Benefit) - Optimal ordering is by the score of
- B(ui)/C(ui)
59Better Scheduling (Evaluation)
60Acknowledgements
- Allister Bernard Vivek Sharma
- S. Dharmarajan
- Shweta Agarwal
- T. Siva
- Prof. C. Ravishankar
- Prof. Sohoni and Prof. Rangaraj
- Prof. S. Sudarshan
- Prof. Krithi Ramamritham