Title: QuickSilver: Middleware for Scalable SelfRegenerative Systems
1QuickSilver Middleware for Scalable
Self-Regenerative Systems
- Cornell UniversityKen Birman, Johannes Gehrke,
Paul Francis,Robbert van Renesse, Werner Vogels - Raytheon CorporationLou DiPalma, Paul Work
2Our topic
- Computing systems are growing
- larger,
- and more complex,
- and we are hoping to use them in a more and
more unattended manner - But the technology for managing growth and
complexity is lagging
3Our goal
- Build a new platform in support of massively
scalable, self-regenerative applications - Demonstrate it by offering a specific military
application interface - Work with Raytheon to apply in other military
settings
4Representative scenarios
- Massive data centers maintained by the military
(or by companies like Amazon) - Enormous publish-subscribe information bus
systems (broadly, OSD calls these GIG and NCES
systems) - Deployments of large numbers of lightweight
sensors - New network architectures to control autonomous
vehicles over media shared with other mundane
applications
5How to approach the problem?
- Web Services architecture has emerged as a likely
standard for large systems - But WS is document oriented, lacks
- High availability (or any kind of quick response
guarantees) - A convincing scalability story
- Self-monitoring/adaptation features
6Signs of trouble?
- Most technologies are way beyond their normal
scalability limits in this kind of center we are
good at small clusters but not huge ones - Pub-sub was a big hit. No longer
- Curious side-bar used heavily for point-to-point
communication! (Why?) - Extremely hard to diagnose problems
7We lack the right tools!
- Today, our applications navigate in the dark
- They lack a way to find things
- They lack a way to sense system state
- There are no rules for adaptation, if/when needed
- In effect We are starting to build very big
systems, yet doing so in the usual client-server
manner - This denies applications any information about
system state, configuration, loads, etc
8QuickSilver
- QuickSilver A platform to help developers build
these massive new systems - It has four major components
- Astrolabe a novel kind of virtual database
- Bimodal Multicast for faster few to many data
transfer patterns - Kelips A fast lookup mechanism
- Group replication technologies based on virtual
synchrony or other similar models
9QuickSilver Architecture
Pub-sub (JMS, JBI)
Native API
Distributed query,event detection
Massively Scalable Group Communication Composable
MicroprotocolStacks
MonitoringIndexing
MessageRepository
Overlay Networks
10Astrolabes role is to collect and report system
state, which is used for many purposes including
self-configuration and repair.
11What does Astrolabe do?
- Astrolabes role is to track information residing
at a vast number of sources - Structured to look like a database
- Approach peer to peer gossip. Basically, each
machine has a piece of a jigsaw puzzle. Assemble
it on the fly.
12Astrolabe in a single domain
1.9
2.1
1.8
3.1
0.9
0.8
1.1
5.3
3.6
2.7
- Row can have many columns
- Total size should be k-bytes, not megabytes
- Configuration certificate determines what data is
pulled into the table (and can change)
13So how does it work?
- Each computer has
- Its own row
- Replicas of some objects (configuration
certificate, other rows, etc) - Periodically, but at a fixed rate, pick a friend
pseudo-randomly and exchange states efficiently
(bound the size of data exchanged) - States converge exponentially rapidly.
- Loads are low and constant and protocol is robust
against all sorts of disruptions!
14State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
15State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
16State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
17Observations
- Merge protocol has constant cost
- One message sent, received (on avg) per unit
time. - The data changes slowly, so no need to run it
quickly we usually run it every five seconds or
so - Information spreads in O(log N) time
- But this assumes bounded region size
- In Astrolabe, we limit them to 50-100 rows
18Scaling up and up
- With a stack of domains, we dont want every
system to see every domain - Cost would be huge
- So instead, well see a summary
cardinal.cs.cornell.edu
19Build a hierarchy using a P2P protocol that
assembles the puzzle without any servers
Dynamically changing query output is visible
system-wide
SQL query summarizes data
New Jersey
San Francisco
20(1) Query goes out (2) Compute locally (3)
results flow to top level of the hierarchy
1
1
3
3
2
2
New Jersey
San Francisco
21Hierarchy is virtual data is replicated
New Jersey
San Francisco
22Hierarchy is virtual data is replicated
New Jersey
San Francisco
23The key to self- properties!
- A flexible, reprogrammable mechanism
- Which clustered services are experiencing
timeouts, and what were they waiting for when
they happened? - Find 12 idle machines with the NMR-3D package
that can download a 20MB dataset rapidly - Which machines have inventory for warehouse 9?
- Wheres the cheapest gasoline in the area?
- Think of aggregation functions as small agents
that look for information
24What about security?
- Astrolabe requires
- Read permissions to see database
- Write permissions to contribute data
- Administrative permission to change aggregation
or configuration certificates - Users decide what data Astrolabe can see
- A VPN setup can be used to hide Astrolabes
internal messages from intruders - Byzantine Agreement based on threshold crypto
used to secure aggregation functions
New!
25Data Mining
- Quite a hot area, usually done by collecting
information to a centralized node, then
querying within that node - Astrolabe is doing the comparable thing, but its
query evaluation occurs in a decentralized manner - This is incredibly parallel, hence faster
- And more robust against disruption too!
26Cool Astrolabe Properties
- Parallel. Everyone does a tiny bit work, so we
accomplish huge tasks in seconds - Flexible. Decentralized query evaluation, in
seconds - One aggregate can answer lots of questions. E.g.
wheres the nearest supply shed? the
hierarchy encodes many answers in one tree!
27Aggregation and Hierarchy
- Nearby information
- Maintained in more detail, can query it directly
- Changes seen sooner
- Remote information summarized
- High quality aggregated data
- This also changes as information evolves
28Astrolabe summary
- Scalable could support millions of machines
- Flexible can easily extend domain hierarchy,
define new columns or eliminate old ones. Adapts
as conditions evolve. - Secure
- Uses keys for authentication and can even encrypt
- Handles firewalls gracefully, including issues of
IP address re-use behind firewalls - Performs well updates propagate in seconds
- Cheap to run tiny load, small memory impact
29Bimodal Multicast
- A quick glimpse of scalable multicast
- Think about really large Internet configurations
- A data center as the data source
- Typical publication might be going to thousands
of client systems
30Swiss Stock Exchange Problem Vsync. multicast
is fragile
Most members are healthy.
31Performance degrades as the system scales up
Virtually synchronous Ensemble multicast protocols
250
group size 32
group size 64
group size 96
200
150
average throughput on nonperturbed members
100
50
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
perturb rate
32Why doesnt multicast scale?
- With weak semantics
- Faulty behavior may occur more often as system
size increases (think the Internet) - With stronger reliability semantics
- Encounter a system-wide cost (e.g. membership
reconfiguration, congestion control) - That can be triggered more often as a function of
scale (more failures, or more network events,
or bigger latencies) - Similar observation led Jim Gray to speculate
that parallel databases scale as O(n2)
33But none of this is inevitable
- Recent work on probabilistic solutions suggests
that gossip-based repair strategy scales quite
well - Also gives very steady throughput
- And can take advantage of hardware support for
multicast, if available
34Start by using unreliable multicast to rapidly
distribute the message. But some messages may not
get through, and some processes may be faulty.
So initial state involves partial distribution of
multicast(s)
35Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesnt include them.
36Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
37Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
38This solves our problem!
Low bandwidth comparison of pbcast performance at
faulty and correct hosts
High bandwidth comparison of pbcast performance
at faulty and correct hosts
200
200
traditional at unperturbed host
traditional w/1 perturbed
pbcast at unperturbed host
180
180
pbcast w/1 perturbed
traditional at perturbed host
throughput for traditional, measured at perturbed
host
pbcast at perturbed host
160
throughput for pbcast measured at perturbed host
160
140
140
120
120
100
average throughput
100
average throughput
80
80
60
60
Bimodal Multicast rides out disturbances!
40
40
20
20
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
perturb rate
perturb rate
39Bimodal Multicast Summary
- An extremely scalable technology
- Remains steady and reliable
- Even with high rates of message loss (in our
tests as high as 20) - Even with large numbers of perturbed processes
(we tested with up to 25) - Even with router failures
- Even when IP multicast fails
- And weve secured it using digital signatures
40Kelips
- Third in our set of tools
- A P2P index
- Put(name, value)
- Get(name)
- Kelips can do lookups with one RPC, is
self-stabilizing after disruption - Unlike Astrolabe, nodes can put varying amounts
of data out there.
41Kelips
Take a a collection of nodes
110
230
202
30
42Kelips
Map nodes to affinity groups
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
43Kelips
110 knows about other members 230, 30
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
Affinity group pointers
44Kelips
202 is a contact for 110 in group 2
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
Contacts
230
202
members per affinity group
30
Contact pointers
45Kelips
dot.com maps to group 2. So 110 tells group 2
to route inquiries about dot.com to it.
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
Contacts
230
202
members per affinity group
30
Resource Tuples
Gossip protocol replicates data cheaply
46Kelips
To look up dot.com, just ask some contact in
group 2. It returns 110 (or forwards your
request).
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
47Kelips summary
- Split the system into ?N subgroups
- Map (key,value) pairs to some subgroup, by
hashing the key - Replicate within that subgroup
- Each node tracks
- Its own group membership
- k members of each of the other groups
- To lookup a key, hash it and ask one or more of
your contacts if they know the value
48Kelips summary
- O(?N) storage overhead, which is higher than for
other DHTs - Same space overhead for member list, contact
list, and replicated data itself - Heuristic is used to keep contacts fresh and
avoid contacts that seem to churn - This buys us O(1) lookup cost
- And background overhead is constant
49Virtual Synchrony
- Last piece of the puzzle
- Outcome of a decade of DARPA-funded work,
technology core of - AEGIS integrated console
- New York and Swiss Stock Exchange
- French Air Traffic Control System
- Florida Electric Power and Light System
50Virtual Synchrony Model
51Roles in QuickSilver?
- Provides way for groups of components to
- Replicate data, synchronize
- Perform tasks in parallel (like parallel database
lookups, for improved speed) - Detect failures and reconfigure to compensate by
regenerating lost functionality
52Replication Key to understanding QuickSilver
Astrolabe
Bimodal Multicast
Kelips
Virtual Synchrony
53Metrics
- We plan to look at several
- Robustness to externally imposed stress,
overload expect to demonstrate significant
improvements - Scalability Graph performance/overheads as
function of scale, load, etc - End-user power Implement JBI, sensor networks,
data-center mgt. platform - Total cost With Raytheon, explore impact on real
military applications - Under DURIP funding we have acquired a clustered
evaluation platform.
54Our plan
- Integrate these core components
- Then
- Build a JBI layer over the system
- Integrate Johannes Gehrkes data mining
technology into the platform - Support scalable overlay multicast (Francis)
- Raytheon Teaming with us to tackle military
applications, notably Navy
55More information?
- www.cs.cornell.edu/Info/Projects/QuickSilver