Scalable Applications and Real Time Response - PowerPoint PPT Presentation

About This Presentation
Title:

Scalable Applications and Real Time Response

Description:

Porcupine's Goals. Use commodity hardware to build a large, scalable mail service ... How does Porcupine React to Configuration Changes? Hard-state Replication ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 64
Provided by: ashishm
Category:

less

Transcript and Presenter's Notes

Title: Scalable Applications and Real Time Response


1
Scalable Applications and Real Time Response
  • Ashish Motivala
  • CS 614
  • April 17th 2001

2
Scalable Applications and Real Time Response
  • Using Group Communication Technology to Implement
    a Reliable and Scalable Distributed IN
    Coprocessor Roy Friedman and Ken Birman TINA
    1996.
  • Manageability, availability and performance in
    Porcupine a highly scalable, cluster-based mail
    service Yasushi Saito, Brian N. Bershad and
    Henry M. Levy Proceedings of the 17th ACM
    Symposium on Operating Systems Principles , 1999,
    Pages 1 15.

3
Real-time
  • Two categories of real-time
  • When an action needs to be predictably fast. i.e.
    Critical applications.
  • When an action must be taken before a time limit
    passes.
  • More often than not real-time doesnt mean as
    fast as possible but means slow and steady.

4
Real problems need real-time
  • Air Traffic Control, Free Flight
  • when planes are at various locations.
  • Medical Monitoring, Remote Tele-surgery
  • doctors talk about how patients responded after
    drug was given, or change therapy after some
    amount of time.
  • Process control software, Robot actions
  • a process controller runs factory floors by
    coordinating machine tools activities.

5
More real-time problems
  • Video and multi-media systems
  • synchronous communication protocols that
    coordinate video, voice, and other data sources
  • Telecommunications systems
  • guarantee real-time response despite failures,
    for example when switching telephone calls

6
Predictability
  • If this is our goal
  • Any well-behaved mechanism may be adequate
  • But we should be careful about uncommon
    disruptive cases
  • For example, cost of failure handling is often
    overlooked
  • Risk is that an infrequent scenario will be very
    costly when it occurs

7
Predictability Examples
  • Probabilistic multicast protocol
  • Very predictable if our desired latencies are
    larger than the expected convergence
  • Much less so if we seek latencies that bring us
    close to the expected latency of the protocol
    itself

8
Back to the paper
  • Telephone networks need a mixture of properties
  • Real-time response
  • High performance
  • Stable behavior even when failures and recoveries
    occur
  • Can we use our tools to solve such a problem?

9
Role of coprocessor
  • A simple database
  • Switch does a query
  • How should I route a call to 1800-327-2777 from
    607-266-8141?
  • Reply use output line 6
  • Time limit of 100ms on transaction
  • Call ID, call conferencing, automatic
    transferring, voice menus, etc
  • Update database

10
IN coprocessor
SS7 switch
SS7 switch
SS7 switch
SS7 switch
11
IN coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
SS7 switch
coprocessor
12
Present coprocessor
  • Right now, people use hardware fault-tolerant
    machines for this
  • E.g. Stratus pair and a spare
  • Mimics one computer but tolerates hardware
    failures
  • Performance an issue?

13
Goals for coprocessor
  • Requirements
  • Scalability ability to use a cluster of machines
    for the same task, with better performance when
    we use more nodes
  • Fault-tolerance a crash or recovery shouldnt
    disrupt the system
  • Real-time response must satisfy the 100ms limit
    at all times
  • Downtime any period when a series of requests
    might all be rejected
  • Desired 7 to 9 nines availability

14
SS7 experiment
  • Horus runs the 800 number database on a cluster
    of processors next to the switch
  • Provide replication management tools
  • Provide failure detection and automatic
    configuration

15
IN coprocessor example
Switch itself asks for help when remote number
call is sensed
Query Element (QE) processors do the number
lookup (in-memory database). Goals scalable
memory without loss of processing performance as
number of nodes is increased
EA
SS7 switch
EA
External adaptor (EA) processors run the query
protocol
Primary backup scheme adapted (using small Horus
process groups) to provide fault-tolerance with
real-time guarantees
16
Options?
  • A simple scheme
  • Organize nodes as groups of 2 processes
  • Use virtual synchrony multicast
  • For query
  • For response
  • Also for updates and membership tracking

17
IN coprocessor example
EA
SS7 switch
EA
Step 1 Switch sees incoming request
18
IN coprocessor example
EA
SS7 switch
EA
Step 2 Switch waits while EA procs. multicast
request to group of query elements (partitioned
database)
19
IN coprocessor example
Think
EA
SS7 switch
Think
EA
Step 3 The query elements do the query in
duplicate
20
IN coprocessor example
EA
SS7 switch
EA
Step 4 They reply to the group of EA processes
21
IN coprocessor example
EA
SS7 switch
EA
Step 5 EA processes reply to switch, which
routes call
22
Results!!
  • Terrible performance!
  • Solution has 2 Horus multicasts on each critical
    path
  • Experience about 600 queries per second but no
    more
  • Also slow to handle failures
  • Freezes for as long as 6 seconds
  • Performance doesnt improve much with scale either

23
Next try
  • Consider taking Horus off the critical path
  • Idea is to continue using Horus
  • It manages groups
  • And we use it for updates to the database and for
    partitioning the QE set
  • But no multicasts on critical path
  • Instead use a hand-coded scheme
  • Use Sender Ordering (or fifo) instead of Total
    Ordering

24
Hand-coded scheme
  • Queue up a set of requests from an EA to a QE
  • Periodically (15 ms), sweep the set into a
    message and send as a batch
  • Process queries also as a batch
  • Send the batch of replies back to EA

25
Clever twists
  • Split into a primary and secondary EA for each
    request
  • Secondary steps in if no reply seen in 50ms
  • Batch size calculated so that 50ms should be
    long enough
  • Alternate primary and secondary after each
    request.

26
Handling Failure and Overload
  • Failure
  • QE backup EA reissues request after half the
    deadline, without waiting for the failure
    detector
  • EA the other EA takes over and handles all the
    requests
  • Overload
  • Drop requests if there is no chance of servicing
    them, rather than missing all deadlines
  • High and low watermarks

27
Results
  • Able to sustain 22,000 emulated telephone calls
    per second
  • Able to guarantee response within 100ms and no
    more than 3 of calls are dropped (randomly)
  • Performance is not hurt by a single failure or
    recovery while switch is running
  • Can put database in memory memory size increases
    with number of nodes in cluster

28
Other settings with a strong temporal element
  • Load balancing
  • Idea is to track load of a set of machines
  • Can do this at an access point or in the client
  • Then want to rebalance by issuing requests
    preferentially to less loaded servers

29
Load balancing in farms
  • Akamai widely cited
  • They download the rarely-changing content from
    customer web sites
  • Distribute this to their own web farm
  • Then use a hacked DNS to redirect web accesses to
    a close-by, less-loaded machine
  • Real-time aspects?
  • The data on which this is based needs to be fresh
    or well send to the wrong server

30
Conclusions
  • Protocols like pbcast are potentially appealing
    in a subset of applications that are naturally
    probabilistic to begin with, and where we may
    have knowledge of expected load levels, etc.
  • More traditional virtual synchrony protocols with
    strong consistency properties make more sense in
    standard networking settings

31
Future directions in real-time
  • Expect GPS time sources to be common within five
    years
  • Real-time tools like periodic process groups will
    also be readily available (members take actions
    in a temporally coordinated way)
  • Increasing focus on predictable high performance
    rather than provable worst-case performance
  • Increasing use of probabilistic techniques

32
Dimensions of Scalability
  • We often say that we want systems that scale
  • But what does scalability mean?
  • As with reliability security, the term
    scalability is very much in the eye of the
    beholder

33
Scalability
  • As a reliability question
  • Suppose a system experiences some rate of
    disruptions r
  • How does r change as a function of the size of
    the system?
  • If r rises when the system gets larger we would
    say that the system scales poorly
  • Need to ask what disruption means, and what
    size means

34
Scalability
  • As a management question
  • Suppose it takes some amount of effort to set up
    the system
  • How does this effort rise for a larger
    configuration?
  • Can lead to surprising discoveries
  • E.g. the 2-machine demo is easy, but setup for
    100 machines is extremely hard to define

35
Scalability
  • As a question about throughput
  • Suppose the system can do t operations each
    second
  • Now I make the system larger
  • Does t increase as a function of system size?
    Decrease?
  • Is the behavior of the system stable, or unstable?

36
Scalability
  • As a question about dependency on configuration
  • Many technologies need to know something about
    the network setup or properties
  • The larger the system, the less we know!
  • This can make a technology fragile, hard to
    configure, and hence poorly scalable

37
Scalability
  • As a question about costs
  • Most systems have a basic cost
  • E.g. 2pc costs 3N messages
  • And many have a background overhead
  • E.g. gossip involves sending one message per
    round, receiving (on avg) one per round, and
    doing some retransmission work (rarely)
  • Can ask how these costs change as we make our
    system larger, or make the network noisier, etc

38
Scalability
  • As a question about environments
  • Small systems are well-behaved
  • But large ones are more like the Internet
  • Packet loss rates and congestion can be problems
  • Performance gets bursty and erratic
  • More heterogeneity of connections and of machines
    on which applications run
  • The larger the environment, the nastier it may be!

39
Scalability
  • As a pro-active question
  • How can we design for scalability?
  • We know a lot about technologies
  • Are certain styles of system more scalable than
    others?

40
Approaches
  • Many ways to evaluate systems
  • Experiments on the real system
  • Emulation environments
  • Simulation
  • Theoretical (analytic)
  • But we need to know what we want to evaluate

41
Dangers
  • Lies, damn lies, and statistics
  • It is much to easy to pick some random property
    of a system, graph it as a function of something,
    and declare success
  • We need sophistication in designing our
    evaluation or well miss the point
  • Example message overhead of gossip
  • Technically, O(n)
  • Does any process or link see this cost?
  • Perhaps not, if protocol is designed carefully

42
Technologies
  • TCP/IP and O/S message-passing architectures like
    U-Net
  • RPC and client-server architectures
  • Transactions and nested transactions
  • Virtual synchrony and replication
  • Other forms of multicast
  • Object oriented architectures
  • Cluster management facilities

43
Youve Got Mail
  • Cluster research has focused on web services
  • Mail is an example of a write-intensive
    application
  • disk-bound workload
  • reliability requirements
  • failure recovery
  • Mail servers have relied on brute force
    approach to scaling
  • Big-iron file server, RDBMS

44
Conventional Mail Servers
Static partitioning Performance problems No
dynamic load balancing Manageability
problems Manual data partition
decision Availability problems Limited fault
tolerance

The Internet
User DB Server
NFS Server
NFS Server
45
Porcupines Goals
  • Use commodity hardware to build a large,
    scalable mail service
  • Performance Linear increase with cluster size
  • Manageability React to changes automatically
  • Availability Survive failures gracefully

1 billion messages/day (100x existing systems)
100 million users (10x existing systems) 1000
nodes (50x existing systems)
46
Key Techniques and Relationships
Functional Homogeneity any node can perform any
task
Framework
Automatic Reconfiguration
Load Balancing
Techniques
Replication
Goals
Manageability
Performance
Availability
47
Porcupine Architecture
Replication Manager
Mail map
Mailbox storage
User profile
...
...
Node A
Node B
Node Z
48
Basic Data Structures
bob
Apply hash function
User map
Mail map /user info
bob A,C
suzy A,C
joe B
ann B
Mailbox storage
Bobs MSGs
Suzys MSGs
Bobs MSGs
Joes MSGs
Anns MSGs
Suzys MSGs
A
B
C
49
Porcupine Operations
Protocol handling
User lookup
Load Balancing
Message store
C
A
DNS-RR selection
1. send mail to bob
4. OK, bob has msgs on C and D
3. Verify bob
6. Store msg
...
...
A
B
C
B
5. Pick the best nodes to store new msg ? C
2. Who manages bob? ? A
50
Measurement Environment
  • 30 node cluster of not-quite-all-identical PCs
  • 100Mb/s Ethernet 1Gb/s hubs
  • Linux 2.2.7
  • 42,000 lines of C code
  • Synthetic load
  • Compare to sendmailpopd

51
Performance
  • Goals
  • Scale performance linearly with cluster size
  • Strategy Avoid creating hot spots
  • Partition data uniformly among nodes
  • Fine-grain data partition

52
How does Performance Scale?
68m/day
25m/day
53
Availability
  • Goals
  • Maintain function after failures
  • React quickly to changes regardless of cluster
    size
  • Graceful performance degradation / improvement
  • Strategy
  • Hard state email messages, user profile
  • ? Optimistic fine-grain replication
  • Soft state user map, mail map
  • ? Reconstruction after membership change

54
Soft-state Reconstruction
2. Distributed disk scan
1. Membership protocol Usermap recomputation
B
A
A
B
A
B
A
B
A
C
A
C
A
C
A
C
A
bob A,C
bob A,C
bob A,C
suzy A,B
suzy
B
A
A
B
A
B
A
B
A
C
A
C
A
C
A
C
B
joe C
joe C
joe C
ann B
ann
suzy A,B
C
suzy A,B
suzy A,B
ann B
ann B
ann B
Timeline
55
How does Porcupine React to Configuration Changes?
56
Hard-state Replication
  • Goals
  • Keep serving hard state after failures
  • Handle unusual failure modes
  • Strategy Exploit Internet semantics
  • Optimistic, eventually consistent replication
  • Per-message, per-user-profile replication
  • Efficient during normal operation
  • Small window of inconsistency

57
How Efficient is Replication?
68m/day
24m/day
58
How Efficient is Replication?
68m/day
33m/day
24m/day
59
Load balancing Deciding where to store messages
  • Goals
  • Handle skewed workload well
  • Support hardware heterogeneity
  • Strategy Spread-based load balancing
  • Spread soft limit on of nodes per mailbox
  • Large spread ? better load balance
  • Small spread ? better affinity
  • Load balanced within spread
  • Use of pending I/O requests as the load measure

60
How Well does Porcupine Support Heterogeneous
Clusters?
16.8m/day (25)
0.5m/day (0.8)
61
Claims
  • Symmetric function distribution
  • Distribute user database and user mailbox
  • Lazy data management
  • Self-management
  • Automatic load balancing, membership management
  • Graceful Degradation
  • Cluster remains functional despite any number of
    failures

62
Retrospect
  • Questions
  • How does the system scale?
  • How costly is the failure recovery procedure?
  • Two scenarios tested
  • Steady state
  • Node failure
  • Does Porcupine scale?
  • Papers says yes
  • But in their work we can see a reconfiguration
    disruption when nodes fail or recover
  • With larger scale, frequency of such events will
    rise
  • And the cost is linear in system size
  • Very likely that on large clusters this overhead
    would become dominant!

63
Some Other Interesting Papers
  • The Next Generation Internet Unsafe at any
    Speed? Ken Birman
  • Lessons from Giant-Scale Services
  • Eric Brewer, UCB
Write a Comment
User Comments (0)
About PowerShow.com