Arseniy Khobotkov - PowerPoint PPT Presentation

About This Presentation
Title:

Arseniy Khobotkov

Description:

Fault Injector Thread spawned by the Replication Manager ... When multiple threads reached the database with the same transaction they now ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 24
Provided by: priy87
Category:

less

Transcript and Presenter's Notes

Title: Arseniy Khobotkov


1
Team 7 supari 17-654 Analysis of Software
Artifacts 18-846 Dependability Analysis of
Middleware
  • Arseniy Khobotkov
  • Arunesh Gupta
  • Dhananjay Khaitan
  • Saurabh Sharma

2
Team Members
Arunesh Gupta arunesh_at_andrew.cmu.edu
Saurabh Sharma saurabhs_at_andrew.cmu.edu
Arseniy Khobotkov akhobotk_at_andrew.cmu.edu
Dhananjay Khaitan dkhaitan_at_andrew.cmu.edu
http//www.ece.cmu.edu/ece846/team7/index.html
3
Graphical User Interface
  • Registration Screen
  • New User
  • Old User
  • Transaction Screen
  • Updated stock feed
  • Buy / Sell
  • History

4
Baseline Application
  • Financial Stock Trading Application
  • Clients can register accounts with
    username/password
  • Clients receive constantly updated stock prices
    from independent data feed
  • Clients can buy/sell stock at guaranteed prices
  • Clients can view transaction history
  • Innovative Features
  • Guarantee reliability of system within 5s
  • Guarantee price when client decides to make a
    transaction
  • Advanced security as Stock Feed server is not
    connected to main system
  • Stock feed and transaction system are SHA1 signed
  • Stock Feed server generates random variation in
    stock prices
  • Encrypted stock feed to prevent tampering
  • Technical Design Information
  • Java based components with CORBA running on
    Unix/Linux
  • PostgreSQL database
  • Java front end

5
Security Against Malicious Users
  • Every client has a socket connection to the Stock
    Feed server
  • Mimics NASDAQ system
  • The prices are signed by the stock server
  • We are signing using SHA1 with RSA
  • Stock Feed server adds an expiration time of 10
    seconds to all data
  • Clocks on database are synchronized with Stock
    Feed clock
  • Further prevents stock price tampering
  • No possibility of using an old price to perform a
    transaction
  • Server decrypts signature to authenticate.
  • Reports exception to client if failure to
    authenticate
  • Client locks further transaction attempts
  • Database watches for time expiration
  • Returns exception to client if time expired
  • Allows client to retry with new data

6
Baseline Architecture
7
Fault-Tolerance Goals Passive Replication
  • System with n replicas tolerates the failure of
    n-1 replicas
  • All state is stored in the database with a
    transaction ID / user ID tuple
  • Transaction can just be retried from the client
    in case of failure
  • Case 1 Failure on route to database Simple
    retry by Client
  • Case 2 Failure on route back Client retries
    but transaction id prevents duplication
  • Replicated Components
  • The server is replicated three times as they are
    the core component of the system
  • They are on different unix4x Machines
  • Sacred Components
  • ORB daemon
  • Stock Server
  • Replication Manager
  • Database Systems
  • Primary components
  • Replication Manager kills faulty replicas and
    launches new ones
  • Fault Detector located in client

8
Fault Injector
  • Runs on a separate thread of the Replication
    Manager
  • Fault Injector Thread spawned by the Replication
    Manager
  • Accesses Replication Manager state to ensure 2
    replicas are always running
  • Kills a random replica every 15 seconds
  • It takes some time for a dead server do be
    discovered and restarted
  • Makes for a consistent pattern of 2 transactions
    with all the servers up before a server is
    killed.
  • Provides for consistent analysis of data

9
FT-Baseline Architecture
10
Mechanisms for Fail-Over
  • Choose arbitrary replica and in case of failure
    try another one
  • Client detects faults as CORBA IDL Exceptions
  • COMM_FAILURE CORBA thrown exception
  • Generic_Exception Non-application specific
    errors.
  • At exception
  • Contact Replication Manager to deal with faulty
    replica
  • ORBd Naming Service to locate a new working
    replica
  • Replication Manager
  • Receives indication of failed replica from client
  • Kills faulty replica
  • Spawns a new replica on another machine
  • Two minimum replicas maintained
  • Replication Manager ensures that we always have
    two working replica in case one fails.

11
Fail-Over Measurements
  • We expected that a faulty case would take longer
    at higher loads
  • This was not the case
  • CPU load measurement (on a minute basis) was not
    a correlated source of spikes.
  • Database transaction times and network latencies
    took up the bulk of our time.
  • Average FT execution 289 ms
  • Average Fail-Over Execution 458 ms

12
Fail-Over Measurements
  • Transaction Time Dominates (Database Updates /
    Queries)
  • Not much we can do about that
  • Receive Exception / Get Invocation / Contact
    Replica
  • These form the focus of our later optimizations.

13
RT-FT-Baseline Architecture (Active Replication)
  • FT-Baseline had three replicas running actively
  • Spawn a thread to contact each of the replicas
  • In case of a fault this removes
  • Time in which we contact naming service to find
    next active replica
  • Time taken to retry a transaction 2 duplicate
    transactions already in-flight
  • We get the fastest active response to the client
  • Number of replicas always maintained by the
    Replication Manager
  • CORBA does not support multicasting, thus the
    need for three threads.

14
RT-FT-Baseline Architecture (Active Replication)
15
Bounded Real-Time Fail-Over Measurements
  • Much more bounded behavior with Active
    Replication
  • One Naming Service / Replication Manager spike
  • 850ms second failover bound
  • Average Active Replication execution 197 ms,
    31.8 improvement
  • Average Active Replication Fault execution time
    382ms, 16.6 improvement

16
RT-FT-Performance Strategy
  • Active Replication with caching responses at the
    Database
  • Database time is the main bottleneck in the
    system
  • When multiple threads reached the database with
    the same transaction they now get a cached copy
    of the response rather than going through the
    database again
  • Previously, wed touch the database for every
    thread
  • VERY inefficient

17
RT-FT Performance Architecture
18
Performance Measurements
  • Caching results in significantly reduced times
  • Fault Recovery and Execution time average 278ms
  • 27.2 improvement from FT-baseline
  • Transaction time average 168ms
  • 14.7 improvement from un-cached times.

19
Active Replication vs. Passive Replication
  • Active replication is far more relevant in a
    stock trading context
  • Seconds can be the difference between millions of
    dollars
  • When we started this project, we knew wed have
    to go this route
  • No point in making a stock trading system with
    high response times and irregular,
    non-transparent faulty behavior
  • Measurements become far more difficult in Active
    Replication
  • Three threads running
  • Timing the behavior of each thread is not
    possible
  • Threads pre-empt each other at undefined
    intervals dependant on the JVM

20
Other Features
  • Just some things we decided to throw in to the
    project
  • Java Socket communication
  • Hashed Caching
  • SHA1 data signatures
  • What lessons did you learn from this additional
    effort?
  • These are all cool features.
  • But when youre trying to make the basic system
    work, they are a huge distraction.
  • Get the base working, then move onto the complex.

21
Open Issues
  • Issues for resolution
  • Truly persistent active threads
  • Currently we kill and re-invoke threads
  • Inefficient, 5 threads continually running would
    be much better
  • A thread pool.
  • Additional Features
  • Separate client and fault detector
  • Heartbeat system for even faster fault detection
  • Cache write-back policy
  • Save database writes for periods of low activity
  • Greatly reduce database access time
  • Multicast!
  • Get rid of thread based implementation
  • Reduce thread overhead and complexities
  • Have all three requests sent in parallel
  • Load Reduction System
  • Clients send request to only 2 replicas per
    transaction to distribute load

22
Insights from Measurements
  • What insights did you gain from the three sets of
    measurements, and from analyzing the data?
  • Baseline FT (Passive Replication)
  • Inefficient recovery from faults.
  • Contacting the Naming Service at each fault was a
    major bottleneck.
  • This led us to believe active replication had to
    be done for RT behavior.
  • Baseline FT-RT (Active Replication)
  • Times for recovery were significantly lowered.
  • However, time for transaction was now an
    overwhelming bottleneck.
  • Performance enhancements, thus based on methods
    to reduce transaction costs.
  • Baseline FT-RT Performance (Active Replication
    with Caching)
  • Caching allowed us to not have to contact the
    database at every thread
  • Allows much faster thread performance
  • Greatly improved transaction times.

23
Conclusions
  • What did you learn?
  • Building up and improving on a baseline program
    is very cool.
  • Interesting to see the progression.
  • Caveats of replication and duplicate suppression
  • Reliable distributed systems are difficult to
    design and code
  • The difficulty is not in the application but the
    reliability systems
  • What did you accomplish?
  • We are the only ones to have done active
    replication
  • Digitally signed and encrypted communications
  • A dynamic system with so many components works
    reliably
  • What would you do differently, if you could start
    the project from scratch now?
  • Added state in replicas. Its much harder, but
    also much more relevant
  • Modularized the project better
  • Better defined interfaces
Write a Comment
User Comments (0)
About PowerShow.com