Scalable Applications and Real Time Response - PowerPoint PPT Presentation

About This Presentation

Title:

Scalable Applications and Real Time Response

Description:

Porcupine's Goals. Use commodity hardware to build a large, scalable mail service ... How does Porcupine React to Configuration Changes? Hard-state Replication ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 64

Provided by: ashishm

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Applications and Real Time Response

1
Scalable Applications and Real Time Response

Ashish Motivala
CS 614
April 17th 2001

2
Scalable Applications and Real Time Response

Using Group Communication Technology to Implement
a Reliable and Scalable Distributed IN
Coprocessor Roy Friedman and Ken Birman TINA
1996.
Manageability, availability and performance in
Porcupine a highly scalable, cluster-based mail
service Yasushi Saito, Brian N. Bershad and
Henry M. Levy Proceedings of the 17th ACM
Symposium on Operating Systems Principles , 1999,
Pages 1 15.

3
Real-time

Two categories of real-time
When an action needs to be predictably fast. i.e.
Critical applications.
When an action must be taken before a time limit
passes.
More often than not real-time doesnt mean as
fast as possible but means slow and steady.

4
Real problems need real-time

Air Traffic Control, Free Flight
when planes are at various locations.
Medical Monitoring, Remote Tele-surgery
doctors talk about how patients responded after
drug was given, or change therapy after some
amount of time.
Process control software, Robot actions
a process controller runs factory floors by
coordinating machine tools activities.

5
More real-time problems

Video and multi-media systems
synchronous communication protocols that
coordinate video, voice, and other data sources
Telecommunications systems
guarantee real-time response despite failures,
for example when switching telephone calls

6
Predictability

If this is our goal
Any well-behaved mechanism may be adequate
But we should be careful about uncommon
disruptive cases
For example, cost of failure handling is often
overlooked
Risk is that an infrequent scenario will be very
costly when it occurs

7
Predictability Examples

Probabilistic multicast protocol
Very predictable if our desired latencies are
larger than the expected convergence
Much less so if we seek latencies that bring us
close to the expected latency of the protocol
itself

8
Back to the paper

Telephone networks need a mixture of properties
Real-time response
High performance
Stable behavior even when failures and recoveries
occur
Can we use our tools to solve such a problem?

9
Role of coprocessor

A simple database
Switch does a query
How should I route a call to 1800-327-2777 from
607-266-8141?
Reply use output line 6
Time limit of 100ms on transaction
Call ID, call conferencing, automatic
transferring, voice menus, etc
Update database

10
IN coprocessor
SS7 switch
SS7 switch
SS7 switch
SS7 switch
11
IN coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
12
Present coprocessor

Right now, people use hardware fault-tolerant
machines for this
E.g. Stratus pair and a spare
Mimics one computer but tolerates hardware
failures
Performance an issue?

13
Goals for coprocessor

Requirements
Scalability ability to use a cluster of machines
for the same task, with better performance when
we use more nodes
Fault-tolerance a crash or recovery shouldnt
disrupt the system
Real-time response must satisfy the 100ms limit
at all times
Downtime any period when a series of requests
might all be rejected
Desired 7 to 9 nines availability

14
SS7 experiment

Horus runs the 800 number database on a cluster
of processors next to the switch
Provide replication management tools
Provide failure detection and automatic
configuration

15
IN coprocessor example
Switch itself asks for help when remote number
call is sensed
Query Element (QE) processors do the number
lookup (in-memory database). Goals scalable
memory without loss of processing performance as
number of nodes is increased
EA
SS7 switch
EA
External adaptor (EA) processors run the query
protocol
Primary backup scheme adapted (using small Horus
process groups) to provide fault-tolerance with
real-time guarantees
16
Options?

A simple scheme
Organize nodes as groups of 2 processes
Use virtual synchrony multicast
For query
For response
Also for updates and membership tracking

17
IN coprocessor example
EA
SS7 switch
EA
Step 1 Switch sees incoming request
18
IN coprocessor example
EA
SS7 switch
EA
Step 2 Switch waits while EA procs. multicast
request to group of query elements (partitioned
database)
19
IN coprocessor example
Think
EA
SS7 switch
Think
EA
Step 3 The query elements do the query in
duplicate
20
IN coprocessor example
EA
SS7 switch
EA
Step 4 They reply to the group of EA processes
21
IN coprocessor example
EA
SS7 switch
EA
Step 5 EA processes reply to switch, which
routes call
22
Results!!

Terrible performance!
Solution has 2 Horus multicasts on each critical
path
Experience about 600 queries per second but no
more
Also slow to handle failures
Freezes for as long as 6 seconds
Performance doesnt improve much with scale either

23
Next try

Consider taking Horus off the critical path
Idea is to continue using Horus
It manages groups
And we use it for updates to the database and for
partitioning the QE set
But no multicasts on critical path
Instead use a hand-coded scheme
Use Sender Ordering (or fifo) instead of Total
Ordering

24
Hand-coded scheme

Queue up a set of requests from an EA to a QE
Periodically (15 ms), sweep the set into a
message and send as a batch
Process queries also as a batch
Send the batch of replies back to EA

25
Clever twists

Split into a primary and secondary EA for each
request
Secondary steps in if no reply seen in 50ms
Batch size calculated so that 50ms should be
long enough
Alternate primary and secondary after each
request.

26
Handling Failure and Overload

Failure
QE backup EA reissues request after half the
deadline, without waiting for the failure
detector
EA the other EA takes over and handles all the
requests
Overload
Drop requests if there is no chance of servicing
them, rather than missing all deadlines
High and low watermarks

27
Results

Able to sustain 22,000 emulated telephone calls
per second
Able to guarantee response within 100ms and no
more than 3 of calls are dropped (randomly)
Performance is not hurt by a single failure or
recovery while switch is running
Can put database in memory memory size increases
with number of nodes in cluster

28
Other settings with a strong temporal element

Load balancing
Idea is to track load of a set of machines
Can do this at an access point or in the client
Then want to rebalance by issuing requests
preferentially to less loaded servers

29
Load balancing in farms

Akamai widely cited
They download the rarely-changing content from
customer web sites
Distribute this to their own web farm
Then use a hacked DNS to redirect web accesses to
a close-by, less-loaded machine
Real-time aspects?
The data on which this is based needs to be fresh
or well send to the wrong server

30
Conclusions

Protocols like pbcast are potentially appealing
in a subset of applications that are naturally
probabilistic to begin with, and where we may
have knowledge of expected load levels, etc.
More traditional virtual synchrony protocols with
strong consistency properties make more sense in
standard networking settings

31
Future directions in real-time

Expect GPS time sources to be common within five
years
Real-time tools like periodic process groups will
also be readily available (members take actions
in a temporally coordinated way)
Increasing focus on predictable high performance
rather than provable worst-case performance
Increasing use of probabilistic techniques

32
Dimensions of Scalability

We often say that we want systems that scale
But what does scalability mean?
As with reliability security, the term
scalability is very much in the eye of the
beholder

33
Scalability

As a reliability question
Suppose a system experiences some rate of
disruptions r
How does r change as a function of the size of
the system?
If r rises when the system gets larger we would
say that the system scales poorly
Need to ask what disruption means, and what
size means

34
Scalability

As a management question
Suppose it takes some amount of effort to set up
the system
How does this effort rise for a larger
configuration?
Can lead to surprising discoveries
E.g. the 2-machine demo is easy, but setup for
100 machines is extremely hard to define

35
Scalability

As a question about throughput
Suppose the system can do t operations each
second
Now I make the system larger
Does t increase as a function of system size?
Decrease?
Is the behavior of the system stable, or unstable?

36
Scalability

As a question about dependency on configuration
Many technologies need to know something about
the network setup or properties
The larger the system, the less we know!
This can make a technology fragile, hard to
configure, and hence poorly scalable

37
Scalability

As a question about costs
Most systems have a basic cost
E.g. 2pc costs 3N messages
And many have a background overhead
E.g. gossip involves sending one message per
round, receiving (on avg) one per round, and
doing some retransmission work (rarely)
Can ask how these costs change as we make our
system larger, or make the network noisier, etc

38
Scalability

As a question about environments
Small systems are well-behaved
But large ones are more like the Internet
Packet loss rates and congestion can be problems
Performance gets bursty and erratic
More heterogeneity of connections and of machines
on which applications run
The larger the environment, the nastier it may be!

39
Scalability

As a pro-active question
How can we design for scalability?
We know a lot about technologies
Are certain styles of system more scalable than
others?

40
Approaches

Many ways to evaluate systems
Experiments on the real system
Emulation environments
Simulation
Theoretical (analytic)
But we need to know what we want to evaluate

41
Dangers

Lies, damn lies, and statistics
It is much to easy to pick some random property
of a system, graph it as a function of something,
and declare success
We need sophistication in designing our
evaluation or well miss the point
Example message overhead of gossip
Technically, O(n)
Does any process or link see this cost?
Perhaps not, if protocol is designed carefully

42
Technologies

TCP/IP and O/S message-passing architectures like
U-Net
RPC and client-server architectures
Transactions and nested transactions
Virtual synchrony and replication
Other forms of multicast
Object oriented architectures
Cluster management facilities

43
Youve Got Mail

Cluster research has focused on web services
Mail is an example of a write-intensive
application
disk-bound workload
reliability requirements
failure recovery
Mail servers have relied on brute force
approach to scaling
Big-iron file server, RDBMS

44
Conventional Mail Servers
Static partitioning Performance problems No
dynamic load balancing Manageability
problems Manual data partition
decision Availability problems Limited fault
tolerance

The Internet
User DB Server
NFS Server
NFS Server
45
Porcupines Goals

Use commodity hardware to build a large,
scalable mail service
Performance Linear increase with cluster size
Manageability React to changes automatically
Availability Survive failures gracefully

1 billion messages/day (100x existing systems)
100 million users (10x existing systems) 1000
nodes (50x existing systems)
46
Key Techniques and Relationships
Functional Homogeneity any node can perform any
task
Framework
Automatic Reconfiguration
Load Balancing
Techniques
Replication
Goals
Manageability
Performance
Availability
47
Porcupine Architecture
Replication Manager
Mail map
Mailbox storage
User profile
...
...
Node A
Node B
Node Z
48
Basic Data Structures
bob
Apply hash function
User map
Mail map /user info
bob A,C
suzy A,C
joe B
ann B
Mailbox storage
Bobs MSGs
Suzys MSGs
Bobs MSGs
Joes MSGs
Anns MSGs
Suzys MSGs
A
B
C
49
Porcupine Operations
Protocol handling
User lookup
Load Balancing
Message store
C
A
DNS-RR selection
1. send mail to bob
4. OK, bob has msgs on C and D
3. Verify bob
6. Store msg
...
...
A
B
C
B
5. Pick the best nodes to store new msg ? C
2. Who manages bob? ? A
50
Measurement Environment

30 node cluster of not-quite-all-identical PCs
100Mb/s Ethernet 1Gb/s hubs
Linux 2.2.7
42,000 lines of C code
Synthetic load
Compare to sendmailpopd

51
Performance

Goals
Scale performance linearly with cluster size
Strategy Avoid creating hot spots
Partition data uniformly among nodes
Fine-grain data partition

52
How does Performance Scale?
68m/day
25m/day
53
Availability

Goals
Maintain function after failures
React quickly to changes regardless of cluster
size
Graceful performance degradation / improvement
Strategy
Hard state email messages, user profile
? Optimistic fine-grain replication
Soft state user map, mail map
? Reconstruction after membership change

54
Soft-state Reconstruction
2. Distributed disk scan
1. Membership protocol Usermap recomputation
B
A
A
B
A
B
A
B
A
C
A
C
A
C
A
C
A
bob A,C
bob A,C
bob A,C
suzy A,B
suzy
B
A
A
B
A
B
A
B
A
C
A
C
A
C
A
C
B
joe C
joe C
joe C
ann B
ann
suzy A,B
C
suzy A,B
suzy A,B
ann B
ann B
ann B
Timeline
55
How does Porcupine React to Configuration Changes?
56
Hard-state Replication

Goals
Keep serving hard state after failures
Handle unusual failure modes
Strategy Exploit Internet semantics
Optimistic, eventually consistent replication
Per-message, per-user-profile replication
Efficient during normal operation
Small window of inconsistency

57
How Efficient is Replication?
68m/day
24m/day
58
How Efficient is Replication?
68m/day
33m/day
24m/day
59
Load balancing Deciding where to store messages

Goals
Handle skewed workload well
Support hardware heterogeneity
Strategy Spread-based load balancing
Spread soft limit on of nodes per mailbox
Large spread ? better load balance
Small spread ? better affinity
Load balanced within spread
Use of pending I/O requests as the load measure

60
How Well does Porcupine Support Heterogeneous
Clusters?
16.8m/day (25)
0.5m/day (0.8)
61
Claims

Symmetric function distribution
Distribute user database and user mailbox
Lazy data management
Self-management
Automatic load balancing, membership management
Graceful Degradation
Cluster remains functional despite any number of
failures

62
Retrospect

Questions
How does the system scale?
How costly is the failure recovery procedure?
Two scenarios tested
Steady state
Node failure
Does Porcupine scale?
Papers says yes
But in their work we can see a reconfiguration
disruption when nodes fail or recover
With larger scale, frequency of such events will
rise
And the cost is linear in system size
Very likely that on large clusters this overhead
would become dominant!