Slayt 1 - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Slayt 1

Description:

An Efficient and Resilient Approach to Filtering & Disseminating Streaming ... The internet and the web are increasingly used to disseminate fast changing data. ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 55
Provided by: mrselt
Category:
Tags: inhere | slayt

less

Transcript and Presenter's Notes

Title: Slayt 1


1

PAPER PRESENTATION on An Efficient and
Resilient Approach to Filtering Disseminating
Streaming Data CMPE 521 Database
Systems Prepared by Mürsel Tasgin Onur Kardes
2
Introduction
  • The internet and the web are increasingly used to
    disseminate fast changing data.
  • Several examples for fast changing data
  • sensors,
  • traffic and weather information,
  • stock prices,
  • sports scores,
  • health monitoring information

3
Introduction
  • The properties of this data
  • Highly dinamic,
  • Streaming,
  • Aperiodic.
  • Users are interested in not only monitoring
    streaming data but in also using it for on-line
    decision making.

4
Introduction
  • Replicating the Source

SOURCE
Repository 1
Repository 3
Repository 2
5
Introduction
  • Services like Akamai.net and IBMs edge server
    technology are exemplars of such networks of
    repositories, which aim to provide better
    services by shifting most of the work to the edge
    of the network (closer to the end users).
  • But, although such systems scale quite well, if
    the data is changing at a fast rate, the quality
    of service at a repository farther from the data
    source would deteriorate.

6
Introduction
  • In general
  • Replication can reduce the load on the sources,
  • But, replication of time-varying data introduces
    new challenges
  • Coherency
  • Delays and scalability

7
Introduction
  • Coherency requirement (cr) Users specify the
    bound on the tolerable imprecision associated
    with each requested data item.

Repository 1 Microsoft 60,89 at time 1136
USER 1
SOURCE Microsoft 60,85 at time 1143
Repository 2 Microsoft 60,86 at time 1141
USER 2
8
Introduction
  • Coherency-preserving system
  • the delivered data must preserve associated
    coherency requirements,
  • resilient to failures,
  • efficient.
  • Necessary changes are pushed to the users
    instead of polling the source independently.

9
Introduction
  • Construction of an effective dissemination
    network of repositories
  • A logical overlay network of repositories are
    created according to
  • coherency needs of users attached to
  • each repository
  • expected delays at each repository
  • this network is called dynamic data
    dissemination
  • graph (d3g).

10
Introduction
  • Construction of an effective dissemination
    network of repositories
  • The previous algorithm called LeLA, for d3g, was
    unable to cope with large number of data.
  • A new algorithm (DiTA) to build dissemination
    networks that are scalable and resilient, is
    introduced.

11
Introduction
  • Construction of an effective dissemination
    network of repositories
  • In DiTA, repositories with more stringent
    coherency requirements are placed closer to the
    source in the network as they are likely to get
    more updates than the ones with looser coherency
    requirements.
  • In DiTA, a dynamic data dissemination tree,
    d3g, is created for each data item, x.

12
Introduction
Construction of an effective dissemination
network of repositories
SOURCE
Repository 1c 0.2
Repository 2c 0.3
Repository 3c 0.8
Repository 6c 0.7
Repository 4c 0.7
Repository 5c 0.9
13
Introduction
  • Provision for the dissemination of dynamic data
    in spite of failures in the overlay network
  • to handle repository and communication link
    failures back-up parents are used.
  • back-up parent is asked to deliver data with
    coherency that is less stringent than that
    associated with the parent.

14
Introduction
Provision for the dissemination of dynamic data
in spite of failures in the overlay network
x,y,z,t
a,b,c,x
Parent
z
y,z,t
x,t
Back-up Parent
15
Introduction
  • Efficient filtering and scheduling techniques for
    repositories
  • normally a repository receives updates and
    selectively disseminates them to its downstreams.
  • it is not always necessary to disseminate the
    exact values of the most recent updates, as long
    as the values presented preserve the coherency
    of the data.

16
The Basic Framework Data Coherency and Overlay
Network
17
The Basic Framework Data Coherency and Overlay
Network
  • a coherency requirement (c) is associated with a
    data
  • item, to denote the maximum permissible
    deviation of
  • the users view from the value of data x at
    the source.
  • c can be specified in terms of
  • time (values should never be out-of-sync by more
    than 5sec.)
  • value (weather information where the temperature
    value should never be out-of-sync by more than 2
    degrees).

18
The Basic Framework Data Coherency and Overlay
Network
Each data item in the repository from which a
user obtains data must be refreshed in such a way
that the user-specified coherency requirements
are maintained.
fidelity f observed by a user can be defined to
be the total length of time for which the above
inequality holds
19
The Basic Framework Data Coherency and Overlay
Network
  • Assume x is served by a single source
  • Repositories R1,....,Rn are interested in x.
  • These repositories in turn serve a subset of the
    remaining repositories such that the resulting
    network is in the form a tree rooted at the
    source and consisting of repositories R1,....,Rn
    .
  • Parent ?? dependent relationship.

20
The Basic Framework Data Coherency and Overlay
Network
  • Since the repository disseminates updates to its
    users and dependents, the coherency requirement
    of a repository should be the most stringent
    requirement that it has to serve.
  • When a data change occurs at the source, it
    checks which of its direct and indirect
    dependents are interested in the change and
    pushes the change to them.

21
Building a d3t
  • Start with a physical layout of the communication
    network in the form of a graph, where the graph
    consists of a set of sources, repositories and
    the underlying network.
  • Try to build a d3t for a data item x.
  • The root of the d3t will be the source, which
    serves x.
  • A repository P serving repository Q with data
    item x, is called the parent of Q and Q is
    called the dependent of P for x.

22
Building a d3t
Source for data itemx
in each repository
23
Building a d3t
  • A repository should ideally serve at least as
    many unique pairs as the number of data items
    served to it.
  • If a repository is currently serving less than
    this fixed number, then we say that the
    repository has the resources to serve a new
    dependent.

R1
Dependent Data Item R7 x R11
y R18 x R9
z R10 t R21 x
?
24
Building a d3t
SOURCE
NO
R4c0.1
NO
Max(c)0.8
Max(c)0.7
R5 c0.4
cR6 gt cR10
So, replace R10 with R6, and push R6 down
YES
Max(c)0.8
Max(c)0.6
Max(c)0.7
R7 c0.8
R9 c0.7
R8 c0.6
R10 c0.3
25
Building a d3t
SOURCE
R4c0.1
Max(c)0.8
Max(c)0.7
This algorithm is called as Data-Item-at-a-Time-Al
gorithm (DiTA)
R5 c0.4
Max(c)0.6
Max(c)0.8
Max(c)0.7
Max(c)0.5
R7 c0.8
R6 c0.5
R8 c0.6
R9 c0.7
26
Building a d3t
Traces Collection procedure and charectristics
  • Real world stock price streams from
    http//finance.yahoo.com are used.
  • 10,000 values are polled during 1,000 traces
    approximately a new data value is obtained per
    second.

27
Building a d3t
Repositories Data, Coherency and Cooperation
characteristics
  • A coherency requirement c is associated with each
    of the chosen data items.
  • cs associated with data in a repository are a
    mix of stringent tolerances (varying from 0.01
    to 0.05) and less stringent tolerances (varying
    from 0.5 to 0.99).
  • T of the data items have stringent coherency
    requirements at each repository (the remaining
    (100 T), of data items have less stringent
    coherency requirements).

28
Building a d3t
Physical Network topology and delays
  • The router topology was generated using BRITE
    (http//www.cs.bu.edu/brite).
  • The repositories and the sources are selected
    randomly.
  • node-node communication delays derived from a
    Pareto distribution x ? (1 / x1/a) x1 where
    a x / (x-1) and

29
Building a d3t
Physical Network topology and delays
  • x is the mean, x1 is the minimum delay a link
    can have.
  • According to the experiments, x15 ms and x12
    ms.
  • The computational delays for dissemination is
    taken to be 12.5 ms .

30
Building a d3t
Metrics
  • The key metric is the loss in fidelity of the
    data.
  • Fidelity was the total length of time which the
    inequality
  • P(t) S(t) lt c holds.
  • Fidelity of a repository is the mean over all
    data items stored in that repository
  • Fidelity of the system is the mean fidelity of
    all repositories.
  • Obviously, the loss of fidelity is (100 -
    fidelity)
  • One another metric is the number of messages in
    the system (system load)

31
Building a d3t
Performance Evaluation
  • For the base performance measurement, 600
    routers, 100 repositories and 4 servers were
    used.
  • Total number of data items served by servers was
    varied from 100 to 1000.
  • T parameter was varied from 20 to 80.
  • A previous algorithm, LeLA was used as a
    benchmark.

32
Building a d3t
Performance Evaluation
  • Each node in DiTA does less work than in LeLA.
  • Thus, in DiTA height of the dissemination tree
    will be more.
  • So, when computational delays are low but link
    delays are large, LeLA may act better.
  • But, this happens only for negligible
    computational delays (0.5 ms) and very high link
    delays (110 ms)

33
Enchancing the Resiliency of the Repository
Network
  • Active backups vs. Passive backups
  • Passive backups may increase the load, which
    causes the loss in fidelity.
  • So active backup parents are used.
  • A backup parent serves data to a dependent Q with
    a coherency cB gt c.

34
Enchancing the Resiliency of the Repository
Network
  • If all changes are less than cB, the dependent
    can not know when parent P fails. So P should
    send periodic Im alive messages.
  • Once P fails, Q requests B to serve it the data
    at c . When P recovers from the failure, Q
    requests B to serve the data item at cB.
  • In this approach, there no backup for backups. So
    that when both P and B fails, Q can not get any
    updates.

35
Enchancing the Resiliency of the Repository
Network
Choice of cB Using a Probabilistic Model
  • For the sake of simplicity, cB k c.
  • Here, choice of k is important

k
36
Enchancing the Resiliency of the Repository
Network
Choice of cB Using a Probabilistic Model
  • Assuming that the data values change with uniform
    probability and
  • Using a Markov Chain Model
  • Misses 2k2 2
  • 2k2-2 is the number of updates a dependent will
    miss before it detects that there is a failure.
  • According to the experiments, this number is
    rather pessimistic nearly an upper limit.

37
Enchancing the Resiliency of the Repository
Network
Choice of backup parents
R
YES
B
P
C
NO
Choose one of them randomly
Q
38
Enchancing the Resiliency of the Repository
Network
Choice of backup parents
  • In case the coherency at which Q wants x from B
    is less then the coherency at which B wants x ,
  • the parent of B is asked to serve x to Q with the
    required tighter coherency.
  • An advantage of choosing a sibling, is that the
    change in coherency requirement is not percolated
    all the way to the source.
  • However, if an ancestor of P and B is heavily
    loaded, then the delay due to the load will be
    reflected in the updates of both the P and B .
    This might result in additional loss in fidelity.

39
Enchancing the Resiliency of the Repository
Network
Effect of Repository failures on Loss of Fidelity
  • Because the kinds of failures are memory-less, an
    exponential probability distribution is used for
    simulating them.
  • Pr (X gt t) e-?t
  • ? ?1 ? time to failure
  • ? ?2 ? time to recover
  • In this approach link failures are not taken into
    account. So the model is incomplete...

?2
40
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
  • The effect of adding resiliency is shown.
  • k2 is used.
  • When 100 data items are used, 23 of updates sent
    by backups are disseminated.
  • Some updates sent by backups reached before
    parents.

41
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
  • But when backup parents are loaded ( gt 400),
    their updates are of no use, and increase the
    loss of fidelity.
  • The dependent should control them by
    time-stamping the updates.

42
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
  • During the experiment, about 80-90 of the
    repositories experienced at least one failure,
  • and the maximum number of failures in the system
    at any given time for ?2 0.001 was around 12.
  • For ?2 0.01, the maximum number of failures was
    5 and for ?2 0.1 , the maximum failures was 2.

43
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
  • Effect of quick recovery is shown.
  • ?1 0.0001 and ?2 2
  • For high coherence requirements, resiliency
    improves fidelity even for transient failures.

44
Enchancing the Resiliency of the Repository
Network
Perfomance Evaluation
  • However, with resiliency with a very large
    number of data items, for e.g., 1000, fidelity
    drops.
  • This is because, at this point, the cost of
    resiliency exceeds the benefits obtained by it,
    and hence this increases the lost in fidelity.

45
Reducing the Delay at a Repository
  • Delays
  • Queing delay The time delay between the arrival
    of the update and time its processing started
  • Processing delay Check delay (decide if the
    update should be processed) computation delay(
    delay of computing the update and pushing data to
    the dependents)

processing delay
46
Reducing the Delay at a Repository
  • Question How can we reduce the average delays to
    improve fidelity?
  • This can be done by
  • Better filtering i.e. Reducing the processing
    delay in determining if
  • an update needs to disseminated to one or more
    dependents
  • Better scheduling of disseminations

47
Reducing the Delay at a Repository
  • Better Filtering

For each dependent, a repository maintains the
coherency req. last value pushed to Upper bound
last pushed value cr Lower bound last
pushed value - cr C10.7 C20.6 C30.5 C40.3 C5
0.1 C60.05

Algorithm to find the dependents to disseminate
data
Sorted cr values
The dependent with first largest cr which needs
to be disseminated

CR values for dependents reside at the repository

For every window the below rule is valid
If an update violates above rule a pseudo value
is generated as actual value
Dependent ordering
48
Reducing the Delay at a Repository
  • Better Filtering
  • Better filtering provides
  • Sending the updates of dynamic data to end users
    who are actually
  • interested in that update.
  • By filtering, no garbage data flow is on the
    network. (no flooding of
  • data over the network) This improves
    communication time in the
  • networks and provides better response times
  • By the help of filtering, a better scalable
    system can be established and it will resist
    against unexpected heavy loads.

49
Reducing the Delay at a Repository
  • Better scheduling of disseminations

Total delay of processing ui
C(u1) Cost of update(delay)
C(u2) Cost of update(delay)
b(u1) Beneficiary of update
b(u2) Beneficiary of update
  • Approach
  • Instead of standard queueing of processing the
    update requests, a kind of prioritization is
    superior to have better performance ? b(u)/C(u)
    SCORING
  • Each update request is shceduled according to
    this score. B(u) is the number of dependents that
    will receive the update, C(u) is the cost of
    dissemination to all dependants. B(u) values are
    stored at aech repository so they are precomputed
    automatically.
  • Advantages
  • Update requests that is important to many
    dependents will be processed earlier ? BUSINESS
    IMPORTANCE
  • Updates with low ratio gets delayed and if a new
    update arrives older ones are dropped, which
    improves performance especially in heaviliy
    loaded environments ? SCALABILITY

50
Reducing the Delay at a Repository
Better scheduling of disseminations
  • Scheduling provides
  • Priority scheme and business importance approach
    that achieves better results
  • As filtering, it makes improvements on
    scalability some out of date update requests are
    discarded from the queue. This saves unnecessary
    computations and queue delays.

51
Reducing the Delay at a Repository
  • Experimental Results
  • Dependent ordering has lower loss of fidelity
    than simple algorithm. However Scheduling has
    better than those (up to 15)
  • Dependent ordering has less number of pushes
    than simple algorithm.
  • Scheduling algorithm decrease computation
    delays because some updates are dropped at the
    queue because of new updates arrive and older
    ones are out of date.
  • Fidelity loss with Scheduling is shown with
    some numbers. It is seen that fidelity drops with
    an increase in the number of data items. Even
    with large increases in the number of data items,
    high update rates loss of fidelity is in the
    range within 10 only.
  • This provides better scalability

52
Reducing the Delay at a Repository
  • Advantages of the better performance approaches
  • Approach-1- Maintaining the dependents ordered
    by cr values
  • Reduces the number of checks required for
    processing each update
  • Reduces the number of pushes
  • Approach-2- Scheduling
  • Reduces the overall delay to the end clients by
    processing updates which provide a higher benefit
    at a lower cost
  • Gives a better choice in dropping updates as low
    score updates are dropped
  • Due to lower propagation delay, it provides
    better scalibility and degrades gracefully under
    unexpected heavy loads

53
Related Work
  • Simple decision procedure is superior. Because
    there are many complex algorithms and database
    systems, that take much computation time to
    maintain data repository up to date
  • Some dynamic web data dissemination algorithms
    also uses push-based scheme. However if they use
    coherency scalability is improved and another
    important feature is that data repositories dont
    need to cooperate with each other to maintain
    coherence information. (its up to date already!)
  • This approach deals with rapidly changing dynamic
    data while some similar approaches focus on web
    content that changes at slower time-scales
  • Most powerful side of this approach is that it
    deals with the problem of failure and forms a
    resillient dissemination network.

54
Conclusion
  • The key points in this architecture are
  • Design of a push-based dissemination for
    time-varying data. Not all the updates are
    disseminated to each repository, only the updates
    that meet the coherency requirements are pushed ?
    EFFICIENT
  • Design of cooperative dissemination network. This
    provides a resilient network and even if a
    failure in the network occurs, data coherency is
    not completely lost. ? RESILLIENT
  • Intelligent filtering, scheduling, selective
    dissemination reduces the overhead in the
    network. It provides a better scalability and
    its a good alternative for dynamic data
    publishing. ? SCALABLE
Write a Comment
User Comments (0)
About PowerShow.com