Title: Arseniy Khobotkov
1Team 7 supari 17-654 Analysis of Software
Artifacts 18-846 Dependability Analysis of
Middleware
- Arseniy Khobotkov
- Arunesh Gupta
- Dhananjay Khaitan
- Saurabh Sharma
2Team Members
Arunesh Gupta arunesh_at_andrew.cmu.edu
Saurabh Sharma saurabhs_at_andrew.cmu.edu
Arseniy Khobotkov akhobotk_at_andrew.cmu.edu
Dhananjay Khaitan dkhaitan_at_andrew.cmu.edu
http//www.ece.cmu.edu/ece846/team7/index.html
3Graphical User Interface
- Registration Screen
- New User
- Old User
- Transaction Screen
- Updated stock feed
- Buy / Sell
- History
4Baseline Application
- Financial Stock Trading Application
- Clients can register accounts with
username/password - Clients receive constantly updated stock prices
from independent data feed - Clients can buy/sell stock at guaranteed prices
- Clients can view transaction history
- Innovative Features
- Guarantee reliability of system within 5s
- Guarantee price when client decides to make a
transaction - Advanced security as Stock Feed server is not
connected to main system - Stock feed and transaction system are SHA1 signed
- Stock Feed server generates random variation in
stock prices - Encrypted stock feed to prevent tampering
- Technical Design Information
- Java based components with CORBA running on
Unix/Linux - PostgreSQL database
- Java front end
5Security Against Malicious Users
- Every client has a socket connection to the Stock
Feed server - Mimics NASDAQ system
- The prices are signed by the stock server
- We are signing using SHA1 with RSA
- Stock Feed server adds an expiration time of 10
seconds to all data - Clocks on database are synchronized with Stock
Feed clock - Further prevents stock price tampering
- No possibility of using an old price to perform a
transaction - Server decrypts signature to authenticate.
- Reports exception to client if failure to
authenticate - Client locks further transaction attempts
- Database watches for time expiration
- Returns exception to client if time expired
- Allows client to retry with new data
6Baseline Architecture
7Fault-Tolerance Goals Passive Replication
- System with n replicas tolerates the failure of
n-1 replicas - All state is stored in the database with a
transaction ID / user ID tuple - Transaction can just be retried from the client
in case of failure - Case 1 Failure on route to database Simple
retry by Client - Case 2 Failure on route back Client retries
but transaction id prevents duplication - Replicated Components
- The server is replicated three times as they are
the core component of the system - They are on different unix4x Machines
- Sacred Components
- ORB daemon
- Stock Server
- Replication Manager
- Database Systems
- Primary components
- Replication Manager kills faulty replicas and
launches new ones - Fault Detector located in client
8Fault Injector
- Runs on a separate thread of the Replication
Manager - Fault Injector Thread spawned by the Replication
Manager - Accesses Replication Manager state to ensure 2
replicas are always running - Kills a random replica every 15 seconds
- It takes some time for a dead server do be
discovered and restarted - Makes for a consistent pattern of 2 transactions
with all the servers up before a server is
killed. - Provides for consistent analysis of data
9FT-Baseline Architecture
10Mechanisms for Fail-Over
- Choose arbitrary replica and in case of failure
try another one - Client detects faults as CORBA IDL Exceptions
- COMM_FAILURE CORBA thrown exception
- Generic_Exception Non-application specific
errors. - At exception
- Contact Replication Manager to deal with faulty
replica - ORBd Naming Service to locate a new working
replica - Replication Manager
- Receives indication of failed replica from client
- Kills faulty replica
- Spawns a new replica on another machine
- Two minimum replicas maintained
- Replication Manager ensures that we always have
two working replica in case one fails.
11Fail-Over Measurements
- We expected that a faulty case would take longer
at higher loads - This was not the case
- CPU load measurement (on a minute basis) was not
a correlated source of spikes. - Database transaction times and network latencies
took up the bulk of our time. - Average FT execution 289 ms
- Average Fail-Over Execution 458 ms
12Fail-Over Measurements
- Transaction Time Dominates (Database Updates /
Queries) - Not much we can do about that
- Receive Exception / Get Invocation / Contact
Replica - These form the focus of our later optimizations.
13RT-FT-Baseline Architecture (Active Replication)
- FT-Baseline had three replicas running actively
- Spawn a thread to contact each of the replicas
- In case of a fault this removes
- Time in which we contact naming service to find
next active replica - Time taken to retry a transaction 2 duplicate
transactions already in-flight - We get the fastest active response to the client
- Number of replicas always maintained by the
Replication Manager - CORBA does not support multicasting, thus the
need for three threads.
14RT-FT-Baseline Architecture (Active Replication)
15Bounded Real-Time Fail-Over Measurements
- Much more bounded behavior with Active
Replication - One Naming Service / Replication Manager spike
- 850ms second failover bound
- Average Active Replication execution 197 ms,
31.8 improvement - Average Active Replication Fault execution time
382ms, 16.6 improvement
16RT-FT-Performance Strategy
- Active Replication with caching responses at the
Database - Database time is the main bottleneck in the
system - When multiple threads reached the database with
the same transaction they now get a cached copy
of the response rather than going through the
database again - Previously, wed touch the database for every
thread - VERY inefficient
17RT-FT Performance Architecture
18Performance Measurements
- Caching results in significantly reduced times
- Fault Recovery and Execution time average 278ms
- 27.2 improvement from FT-baseline
- Transaction time average 168ms
- 14.7 improvement from un-cached times.
19Active Replication vs. Passive Replication
- Active replication is far more relevant in a
stock trading context - Seconds can be the difference between millions of
dollars - When we started this project, we knew wed have
to go this route - No point in making a stock trading system with
high response times and irregular,
non-transparent faulty behavior - Measurements become far more difficult in Active
Replication - Three threads running
- Timing the behavior of each thread is not
possible - Threads pre-empt each other at undefined
intervals dependant on the JVM
20Other Features
- Just some things we decided to throw in to the
project - Java Socket communication
- Hashed Caching
- SHA1 data signatures
- What lessons did you learn from this additional
effort? - These are all cool features.
- But when youre trying to make the basic system
work, they are a huge distraction. - Get the base working, then move onto the complex.
21Open Issues
- Issues for resolution
- Truly persistent active threads
- Currently we kill and re-invoke threads
- Inefficient, 5 threads continually running would
be much better - A thread pool.
- Additional Features
- Separate client and fault detector
- Heartbeat system for even faster fault detection
- Cache write-back policy
- Save database writes for periods of low activity
- Greatly reduce database access time
- Multicast!
- Get rid of thread based implementation
- Reduce thread overhead and complexities
- Have all three requests sent in parallel
- Load Reduction System
- Clients send request to only 2 replicas per
transaction to distribute load
22Insights from Measurements
- What insights did you gain from the three sets of
measurements, and from analyzing the data? - Baseline FT (Passive Replication)
- Inefficient recovery from faults.
- Contacting the Naming Service at each fault was a
major bottleneck. - This led us to believe active replication had to
be done for RT behavior. - Baseline FT-RT (Active Replication)
- Times for recovery were significantly lowered.
- However, time for transaction was now an
overwhelming bottleneck. - Performance enhancements, thus based on methods
to reduce transaction costs. - Baseline FT-RT Performance (Active Replication
with Caching) - Caching allowed us to not have to contact the
database at every thread - Allows much faster thread performance
- Greatly improved transaction times.
23Conclusions
- What did you learn?
- Building up and improving on a baseline program
is very cool. - Interesting to see the progression.
- Caveats of replication and duplicate suppression
- Reliable distributed systems are difficult to
design and code - The difficulty is not in the application but the
reliability systems - What did you accomplish?
- We are the only ones to have done active
replication - Digitally signed and encrypted communications
- A dynamic system with so many components works
reliably - What would you do differently, if you could start
the project from scratch now? - Added state in replicas. Its much harder, but
also much more relevant - Modularized the project better
- Better defined interfaces