QuickSilver: Middleware for Scalable SelfRegenerative Systems - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

QuickSilver: Middleware for Scalable SelfRegenerative Systems

Description:

and we are hoping to use them in a more and more 'unattended' manner ... Bimodal Multicast: for faster 'few to many' data transfer patterns ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 56

Provided by: kenb152

Category:

more less

Transcript and Presenter's Notes

Title: QuickSilver: Middleware for Scalable SelfRegenerative Systems

1
QuickSilver Middleware for Scalable
Self-Regenerative Systems

Cornell UniversityKen Birman, Johannes Gehrke,
Paul Francis,Robbert van Renesse, Werner Vogels
Raytheon CorporationLou DiPalma, Paul Work

2
Our topic

Computing systems are growing
larger,
and more complex,
and we are hoping to use them in a more and
more unattended manner
But the technology for managing growth and
complexity is lagging

3
Our goal

Build a new platform in support of massively
scalable, self-regenerative applications
Demonstrate it by offering a specific military
application interface
Work with Raytheon to apply in other military
settings

4
Representative scenarios

Massive data centers maintained by the military
(or by companies like Amazon)
Enormous publish-subscribe information bus
systems (broadly, OSD calls these GIG and NCES
systems)
Deployments of large numbers of lightweight
sensors
New network architectures to control autonomous
vehicles over media shared with other mundane
applications

5
How to approach the problem?

Web Services architecture has emerged as a likely
standard for large systems
But WS is document oriented, lacks
High availability (or any kind of quick response
guarantees)
A convincing scalability story
Self-monitoring/adaptation features

6
Signs of trouble?

Most technologies are way beyond their normal
scalability limits in this kind of center we are
good at small clusters but not huge ones
Pub-sub was a big hit. No longer
Curious side-bar used heavily for point-to-point
communication! (Why?)
Extremely hard to diagnose problems

7
We lack the right tools!

Today, our applications navigate in the dark
They lack a way to find things
They lack a way to sense system state
There are no rules for adaptation, if/when needed
In effect We are starting to build very big
systems, yet doing so in the usual client-server
manner
This denies applications any information about
system state, configuration, loads, etc

8
QuickSilver

QuickSilver A platform to help developers build
these massive new systems
It has four major components
Astrolabe a novel kind of virtual database
Bimodal Multicast for faster few to many data
transfer patterns
Kelips A fast lookup mechanism
Group replication technologies based on virtual
synchrony or other similar models

9
QuickSilver Architecture
Pub-sub (JMS, JBI)
Native API
Distributed query,event detection
Massively Scalable Group Communication Composable
MicroprotocolStacks
MonitoringIndexing
MessageRepository
Overlay Networks
10
Astrolabes role is to collect and report system
state, which is used for many purposes including
self-configuration and repair.
11
What does Astrolabe do?

Astrolabes role is to track information residing
at a vast number of sources
Structured to look like a database
Approach peer to peer gossip. Basically, each
machine has a piece of a jigsaw puzzle. Assemble
it on the fly.

12
Astrolabe in a single domain
1.9
2.1
1.8
3.1
0.9
0.8
1.1
5.3
3.6
2.7

Row can have many columns
Total size should be k-bytes, not megabytes
Configuration certificate determines what data is
pulled into the table (and can change)

13
So how does it work?

Each computer has
Its own row
Replicas of some objects (configuration
certificate, other rows, etc)
Periodically, but at a fixed rate, pick a friend
pseudo-randomly and exchange states efficiently
(bound the size of data exchanged)
States converge exponentially rapidly.
Loads are low and constant and protocol is robust
against all sorts of disruptions!

14
State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
15
State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
16
State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
17
Observations

Merge protocol has constant cost
One message sent, received (on avg) per unit
time.
The data changes slowly, so no need to run it
quickly we usually run it every five seconds or
so
Information spreads in O(log N) time
But this assumes bounded region size
In Astrolabe, we limit them to 50-100 rows

18
Scaling up and up

With a stack of domains, we dont want every
system to see every domain
Cost would be huge
So instead, well see a summary

cardinal.cs.cornell.edu
19
Build a hierarchy using a P2P protocol that
assembles the puzzle without any servers
Dynamically changing query output is visible
system-wide
SQL query summarizes data
New Jersey
San Francisco
20
(1) Query goes out (2) Compute locally (3)
results flow to top level of the hierarchy
1
1
3
3
2
2
New Jersey
San Francisco
21
Hierarchy is virtual data is replicated
New Jersey
San Francisco
22
Hierarchy is virtual data is replicated
New Jersey
San Francisco
23
The key to self- properties!

A flexible, reprogrammable mechanism
Which clustered services are experiencing
timeouts, and what were they waiting for when
they happened?
Find 12 idle machines with the NMR-3D package
that can download a 20MB dataset rapidly
Which machines have inventory for warehouse 9?
Wheres the cheapest gasoline in the area?
Think of aggregation functions as small agents
that look for information

24
What about security?

Astrolabe requires
Read permissions to see database
Write permissions to contribute data
Administrative permission to change aggregation
or configuration certificates
Users decide what data Astrolabe can see
A VPN setup can be used to hide Astrolabes
internal messages from intruders
Byzantine Agreement based on threshold crypto
used to secure aggregation functions

New!
25
Data Mining

Quite a hot area, usually done by collecting
information to a centralized node, then
querying within that node
Astrolabe is doing the comparable thing, but its
query evaluation occurs in a decentralized manner
This is incredibly parallel, hence faster
And more robust against disruption too!

26
Cool Astrolabe Properties

Parallel. Everyone does a tiny bit work, so we
accomplish huge tasks in seconds
Flexible. Decentralized query evaluation, in
seconds
One aggregate can answer lots of questions. E.g.
wheres the nearest supply shed? the
hierarchy encodes many answers in one tree!

27
Aggregation and Hierarchy

Nearby information
Maintained in more detail, can query it directly
Changes seen sooner
Remote information summarized
High quality aggregated data
This also changes as information evolves

28
Astrolabe summary

Scalable could support millions of machines
Flexible can easily extend domain hierarchy,
define new columns or eliminate old ones. Adapts
as conditions evolve.
Secure
Uses keys for authentication and can even encrypt
Handles firewalls gracefully, including issues of
IP address re-use behind firewalls
Performs well updates propagate in seconds
Cheap to run tiny load, small memory impact

29
Bimodal Multicast

A quick glimpse of scalable multicast
Think about really large Internet configurations
A data center as the data source
Typical publication might be going to thousands
of client systems

30
Swiss Stock Exchange Problem Vsync. multicast
is fragile
Most members are healthy.
31
Performance degrades as the system scales up
Virtually synchronous Ensemble multicast protocols
250
group size 32
group size 64
group size 96
200
150
average throughput on nonperturbed members
100
50
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
perturb rate
32
Why doesnt multicast scale?

With weak semantics
Faulty behavior may occur more often as system
size increases (think the Internet)
With stronger reliability semantics
Encounter a system-wide cost (e.g. membership
reconfiguration, congestion control)
That can be triggered more often as a function of
scale (more failures, or more network events,
or bigger latencies)
Similar observation led Jim Gray to speculate
that parallel databases scale as O(n2)

33
But none of this is inevitable

Recent work on probabilistic solutions suggests
that gossip-based repair strategy scales quite
well
Also gives very steady throughput
And can take advantage of hardware support for
multicast, if available

34
Start by using unreliable multicast to rapidly
distribute the message. But some messages may not
get through, and some processes may be faulty.
So initial state involves partial distribution of
multicast(s)
35
Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesnt include them.
36
Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
37
Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
38
This solves our problem!
Low bandwidth comparison of pbcast performance at
faulty and correct hosts
High bandwidth comparison of pbcast performance
at faulty and correct hosts
200
200
traditional at unperturbed host
traditional w/1 perturbed

pbcast at unperturbed host
180
180
pbcast w/1 perturbed

traditional at perturbed host
throughput for traditional, measured at perturbed
host
pbcast at perturbed host
160
throughput for pbcast measured at perturbed host

160
140
140
120
120
100
average throughput
100
average throughput
80
80
60
60
Bimodal Multicast rides out disturbances!
40
40
20
20
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
perturb rate
perturb rate
39
Bimodal Multicast Summary

An extremely scalable technology
Remains steady and reliable
Even with high rates of message loss (in our
tests as high as 20)
Even with large numbers of perturbed processes
(we tested with up to 25)
Even with router failures
Even when IP multicast fails
And weve secured it using digital signatures

40
Kelips

Third in our set of tools
A P2P index
Put(name, value)
Get(name)
Kelips can do lookups with one RPC, is
self-stabilizing after disruption
Unlike Astrolabe, nodes can put varying amounts
of data out there.

41
Kelips
Take a a collection of nodes
110
230
202
30
42
Kelips
Map nodes to affinity groups
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
43
Kelips
110 knows about other members 230, 30
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
Affinity group pointers
44
Kelips
202 is a contact for 110 in group 2
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
Contacts
230
202
members per affinity group
30
Contact pointers
45
Kelips
dot.com maps to group 2. So 110 tells group 2
to route inquiries about dot.com to it.
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
Contacts
230
202
members per affinity group
30
Resource Tuples
Gossip protocol replicates data cheaply
46
Kelips
To look up dot.com, just ask some contact in
group 2. It returns 110 (or forwards your
request).
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
47
Kelips summary

Split the system into ?N subgroups
Map (key,value) pairs to some subgroup, by
hashing the key
Replicate within that subgroup
Each node tracks
Its own group membership
k members of each of the other groups
To lookup a key, hash it and ask one or more of
your contacts if they know the value

48
Kelips summary

O(?N) storage overhead, which is higher than for
other DHTs
Same space overhead for member list, contact
list, and replicated data itself
Heuristic is used to keep contacts fresh and
avoid contacts that seem to churn
This buys us O(1) lookup cost
And background overhead is constant

49
Virtual Synchrony

Last piece of the puzzle
Outcome of a decade of DARPA-funded work,
technology core of
AEGIS integrated console
New York and Swiss Stock Exchange
French Air Traffic Control System
Florida Electric Power and Light System

50
Virtual Synchrony Model
51
Roles in QuickSilver?

Provides way for groups of components to
Replicate data, synchronize
Perform tasks in parallel (like parallel database
lookups, for improved speed)
Detect failures and reconfigure to compensate by
regenerating lost functionality

52
Replication Key to understanding QuickSilver
Astrolabe
Bimodal Multicast
Kelips
Virtual Synchrony
53
Metrics

We plan to look at several
Robustness to externally imposed stress,
overload expect to demonstrate significant
improvements
Scalability Graph performance/overheads as
function of scale, load, etc
End-user power Implement JBI, sensor networks,
data-center mgt. platform
Total cost With Raytheon, explore impact on real
military applications
Under DURIP funding we have acquired a clustered
evaluation platform.

54
Our plan