Title: Scalable Applications and Real Time Response
1Scalable Applications and Real Time Response
- Ashish Motivala
- CS 614
- April 17th 2001
2Scalable Applications and Real Time Response
- Using Group Communication Technology to Implement
a Reliable and Scalable Distributed IN
Coprocessor Roy Friedman and Ken Birman TINA
1996. - Manageability, availability and performance in
Porcupine a highly scalable, cluster-based mail
service Yasushi Saito, Brian N. Bershad and
Henry M. Levy Proceedings of the 17th ACM
Symposium on Operating Systems Principles , 1999,
Pages 1 15.
3Real-time
- Two categories of real-time
- When an action needs to be predictably fast. i.e.
Critical applications. - When an action must be taken before a time limit
passes. - More often than not real-time doesnt mean as
fast as possible but means slow and steady.
4Real problems need real-time
- Air Traffic Control, Free Flight
- when planes are at various locations.
- Medical Monitoring, Remote Tele-surgery
- doctors talk about how patients responded after
drug was given, or change therapy after some
amount of time. - Process control software, Robot actions
- a process controller runs factory floors by
coordinating machine tools activities.
5More real-time problems
- Video and multi-media systems
- synchronous communication protocols that
coordinate video, voice, and other data sources - Telecommunications systems
- guarantee real-time response despite failures,
for example when switching telephone calls
6Predictability
- If this is our goal
- Any well-behaved mechanism may be adequate
- But we should be careful about uncommon
disruptive cases - For example, cost of failure handling is often
overlooked - Risk is that an infrequent scenario will be very
costly when it occurs
7Predictability Examples
- Probabilistic multicast protocol
- Very predictable if our desired latencies are
larger than the expected convergence - Much less so if we seek latencies that bring us
close to the expected latency of the protocol
itself
8Back to the paper
- Telephone networks need a mixture of properties
- Real-time response
- High performance
- Stable behavior even when failures and recoveries
occur - Can we use our tools to solve such a problem?
9Role of coprocessor
- A simple database
- Switch does a query
- How should I route a call to 1800-327-2777 from
607-266-8141? - Reply use output line 6
- Time limit of 100ms on transaction
- Call ID, call conferencing, automatic
transferring, voice menus, etc - Update database
10IN coprocessor
SS7 switch
SS7 switch
SS7 switch
SS7 switch
11IN coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
12Present coprocessor
- Right now, people use hardware fault-tolerant
machines for this - E.g. Stratus pair and a spare
- Mimics one computer but tolerates hardware
failures - Performance an issue?
13Goals for coprocessor
- Requirements
- Scalability ability to use a cluster of machines
for the same task, with better performance when
we use more nodes - Fault-tolerance a crash or recovery shouldnt
disrupt the system - Real-time response must satisfy the 100ms limit
at all times - Downtime any period when a series of requests
might all be rejected - Desired 7 to 9 nines availability
14SS7 experiment
- Horus runs the 800 number database on a cluster
of processors next to the switch - Provide replication management tools
- Provide failure detection and automatic
configuration
15IN coprocessor example
Switch itself asks for help when remote number
call is sensed
Query Element (QE) processors do the number
lookup (in-memory database). Goals scalable
memory without loss of processing performance as
number of nodes is increased
EA
SS7 switch
EA
External adaptor (EA) processors run the query
protocol
Primary backup scheme adapted (using small Horus
process groups) to provide fault-tolerance with
real-time guarantees
16Options?
- A simple scheme
- Organize nodes as groups of 2 processes
- Use virtual synchrony multicast
- For query
- For response
- Also for updates and membership tracking
17IN coprocessor example
EA
SS7 switch
EA
Step 1 Switch sees incoming request
18IN coprocessor example
EA
SS7 switch
EA
Step 2 Switch waits while EA procs. multicast
request to group of query elements (partitioned
database)
19IN coprocessor example
Think
EA
SS7 switch
Think
EA
Step 3 The query elements do the query in
duplicate
20IN coprocessor example
EA
SS7 switch
EA
Step 4 They reply to the group of EA processes
21IN coprocessor example
EA
SS7 switch
EA
Step 5 EA processes reply to switch, which
routes call
22Results!!
- Terrible performance!
- Solution has 2 Horus multicasts on each critical
path - Experience about 600 queries per second but no
more - Also slow to handle failures
- Freezes for as long as 6 seconds
- Performance doesnt improve much with scale either
23Next try
- Consider taking Horus off the critical path
- Idea is to continue using Horus
- It manages groups
- And we use it for updates to the database and for
partitioning the QE set - But no multicasts on critical path
- Instead use a hand-coded scheme
- Use Sender Ordering (or fifo) instead of Total
Ordering
24Hand-coded scheme
- Queue up a set of requests from an EA to a QE
- Periodically (15 ms), sweep the set into a
message and send as a batch - Process queries also as a batch
- Send the batch of replies back to EA
25Clever twists
- Split into a primary and secondary EA for each
request - Secondary steps in if no reply seen in 50ms
- Batch size calculated so that 50ms should be
long enough - Alternate primary and secondary after each
request.
26Handling Failure and Overload
- Failure
- QE backup EA reissues request after half the
deadline, without waiting for the failure
detector - EA the other EA takes over and handles all the
requests - Overload
- Drop requests if there is no chance of servicing
them, rather than missing all deadlines - High and low watermarks
27Results
- Able to sustain 22,000 emulated telephone calls
per second - Able to guarantee response within 100ms and no
more than 3 of calls are dropped (randomly) - Performance is not hurt by a single failure or
recovery while switch is running - Can put database in memory memory size increases
with number of nodes in cluster
28Other settings with a strong temporal element
- Load balancing
- Idea is to track load of a set of machines
- Can do this at an access point or in the client
- Then want to rebalance by issuing requests
preferentially to less loaded servers
29Load balancing in farms
- Akamai widely cited
- They download the rarely-changing content from
customer web sites - Distribute this to their own web farm
- Then use a hacked DNS to redirect web accesses to
a close-by, less-loaded machine - Real-time aspects?
- The data on which this is based needs to be fresh
or well send to the wrong server
30Conclusions
- Protocols like pbcast are potentially appealing
in a subset of applications that are naturally
probabilistic to begin with, and where we may
have knowledge of expected load levels, etc. - More traditional virtual synchrony protocols with
strong consistency properties make more sense in
standard networking settings
31Future directions in real-time
- Expect GPS time sources to be common within five
years - Real-time tools like periodic process groups will
also be readily available (members take actions
in a temporally coordinated way) - Increasing focus on predictable high performance
rather than provable worst-case performance - Increasing use of probabilistic techniques
32Dimensions of Scalability
- We often say that we want systems that scale
- But what does scalability mean?
- As with reliability security, the term
scalability is very much in the eye of the
beholder
33Scalability
- As a reliability question
- Suppose a system experiences some rate of
disruptions r - How does r change as a function of the size of
the system? - If r rises when the system gets larger we would
say that the system scales poorly - Need to ask what disruption means, and what
size means
34Scalability
- As a management question
- Suppose it takes some amount of effort to set up
the system - How does this effort rise for a larger
configuration? - Can lead to surprising discoveries
- E.g. the 2-machine demo is easy, but setup for
100 machines is extremely hard to define
35Scalability
- As a question about throughput
- Suppose the system can do t operations each
second - Now I make the system larger
- Does t increase as a function of system size?
Decrease? - Is the behavior of the system stable, or unstable?
36Scalability
- As a question about dependency on configuration
- Many technologies need to know something about
the network setup or properties - The larger the system, the less we know!
- This can make a technology fragile, hard to
configure, and hence poorly scalable
37Scalability
- As a question about costs
- Most systems have a basic cost
- E.g. 2pc costs 3N messages
- And many have a background overhead
- E.g. gossip involves sending one message per
round, receiving (on avg) one per round, and
doing some retransmission work (rarely) - Can ask how these costs change as we make our
system larger, or make the network noisier, etc
38Scalability
- As a question about environments
- Small systems are well-behaved
- But large ones are more like the Internet
- Packet loss rates and congestion can be problems
- Performance gets bursty and erratic
- More heterogeneity of connections and of machines
on which applications run - The larger the environment, the nastier it may be!
39Scalability
- As a pro-active question
- How can we design for scalability?
- We know a lot about technologies
- Are certain styles of system more scalable than
others?
40Approaches
- Many ways to evaluate systems
- Experiments on the real system
- Emulation environments
- Simulation
- Theoretical (analytic)
- But we need to know what we want to evaluate
41Dangers
- Lies, damn lies, and statistics
- It is much to easy to pick some random property
of a system, graph it as a function of something,
and declare success - We need sophistication in designing our
evaluation or well miss the point - Example message overhead of gossip
- Technically, O(n)
- Does any process or link see this cost?
- Perhaps not, if protocol is designed carefully
42Technologies
- TCP/IP and O/S message-passing architectures like
U-Net - RPC and client-server architectures
- Transactions and nested transactions
- Virtual synchrony and replication
- Other forms of multicast
- Object oriented architectures
- Cluster management facilities
43Youve Got Mail
- Cluster research has focused on web services
- Mail is an example of a write-intensive
application - disk-bound workload
- reliability requirements
- failure recovery
- Mail servers have relied on brute force
approach to scaling - Big-iron file server, RDBMS
44Conventional Mail Servers
Static partitioning Performance problems No
dynamic load balancing Manageability
problems Manual data partition
decision Availability problems Limited fault
tolerance
The Internet
User DB Server
NFS Server
NFS Server
45Porcupines Goals
- Use commodity hardware to build a large,
scalable mail service - Performance Linear increase with cluster size
- Manageability React to changes automatically
- Availability Survive failures gracefully
1 billion messages/day (100x existing systems)
100 million users (10x existing systems) 1000
nodes (50x existing systems)
46Key Techniques and Relationships
Functional Homogeneity any node can perform any
task
Framework
Automatic Reconfiguration
Load Balancing
Techniques
Replication
Goals
Manageability
Performance
Availability
47Porcupine Architecture
Replication Manager
Mail map
Mailbox storage
User profile
...
...
Node A
Node B
Node Z
48Basic Data Structures
bob
Apply hash function
User map
Mail map /user info
bob A,C
suzy A,C
joe B
ann B
Mailbox storage
Bobs MSGs
Suzys MSGs
Bobs MSGs
Joes MSGs
Anns MSGs
Suzys MSGs
A
B
C
49Porcupine Operations
Protocol handling
User lookup
Load Balancing
Message store
C
A
DNS-RR selection
1. send mail to bob
4. OK, bob has msgs on C and D
3. Verify bob
6. Store msg
...
...
A
B
C
B
5. Pick the best nodes to store new msg ? C
2. Who manages bob? ? A
50Measurement Environment
- 30 node cluster of not-quite-all-identical PCs
- 100Mb/s Ethernet 1Gb/s hubs
- Linux 2.2.7
- 42,000 lines of C code
- Synthetic load
- Compare to sendmailpopd
51Performance
- Goals
- Scale performance linearly with cluster size
- Strategy Avoid creating hot spots
- Partition data uniformly among nodes
- Fine-grain data partition
52How does Performance Scale?
68m/day
25m/day
53Availability
- Goals
- Maintain function after failures
- React quickly to changes regardless of cluster
size - Graceful performance degradation / improvement
- Strategy
- Hard state email messages, user profile
- ? Optimistic fine-grain replication
- Soft state user map, mail map
- ? Reconstruction after membership change
54Soft-state Reconstruction
2. Distributed disk scan
1. Membership protocol Usermap recomputation
B
A
A
B
A
B
A
B
A
C
A
C
A
C
A
C
A
bob A,C
bob A,C
bob A,C
suzy A,B
suzy
B
A
A
B
A
B
A
B
A
C
A
C
A
C
A
C
B
joe C
joe C
joe C
ann B
ann
suzy A,B
C
suzy A,B
suzy A,B
ann B
ann B
ann B
Timeline
55How does Porcupine React to Configuration Changes?
56Hard-state Replication
- Goals
- Keep serving hard state after failures
- Handle unusual failure modes
- Strategy Exploit Internet semantics
- Optimistic, eventually consistent replication
- Per-message, per-user-profile replication
- Efficient during normal operation
- Small window of inconsistency
57How Efficient is Replication?
68m/day
24m/day
58How Efficient is Replication?
68m/day
33m/day
24m/day
59Load balancing Deciding where to store messages
- Goals
- Handle skewed workload well
- Support hardware heterogeneity
- Strategy Spread-based load balancing
- Spread soft limit on of nodes per mailbox
- Large spread ? better load balance
- Small spread ? better affinity
- Load balanced within spread
- Use of pending I/O requests as the load measure
60How Well does Porcupine Support Heterogeneous
Clusters?
16.8m/day (25)
0.5m/day (0.8)
61Claims
- Symmetric function distribution
- Distribute user database and user mailbox
- Lazy data management
- Self-management
- Automatic load balancing, membership management
- Graceful Degradation
- Cluster remains functional despite any number of
failures
62Retrospect
- Questions
- How does the system scale?
- How costly is the failure recovery procedure?
- Two scenarios tested
- Steady state
- Node failure
- Does Porcupine scale?
- Papers says yes
- But in their work we can see a reconfiguration
disruption when nodes fail or recover - With larger scale, frequency of such events will
rise - And the cost is linear in system size
- Very likely that on large clusters this overhead
would become dominant!
63Some Other Interesting Papers
- The Next Generation Internet Unsafe at any
Speed? Ken Birman - Lessons from Giant-Scale Services
- Eric Brewer, UCB