Title: Building FaultTolerant Enterprise Applications
1Building Fault-Tolerant Enterprise Applications
- Greg Hinkle
- Chariot Solutions
- chariotsolutions.com
Adapted from original presentation by Erin
Mulder Brian McCallister
2Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
3Goals of Fault Tolerance
What are we really worried about?
- Availability
- Integrity
- Confidentiality
- Usability
- Cost
4Goals of Fault Tolerance
What can go wrong?
- User Error
- Concurrent Changes
- Bugs
- Resource Failure/Downtime
- System Overload
- Misconfiguration
- Sabotage
5Goals of Fault Tolerance
Themes well keep visiting
- Prevention
- Code Guidelines Reviews
- Automated Validation Regression Testing
- Performance / Stress Testing
- Negative / Security Testing
- Detection
- Logging and Auditing
- Validation Patterns
- Monitoring
- Recovery
- Exception handling patterns
- Error feedback loop
- Redundancy
6Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
7User Recoverable Errors
Simple validation error
- What do you do when the user
- Leaves a required field blank
- Enters a value too big for the database field
- Types letters in a numeric field
- Selects inconsistent options
- Tries to do things in the wrong order
8User Recoverable Errors
Simple validation error
- Fault tolerance is more than detection
- Prevent the user from making errors
- Set maxlengths on input fields
- Use character masks
- Specify units
- Show example input
- Dont allow the selection of inconsistent options
- Dont present navigation options that arent
meant to be followed - Guide the user through longer processes
9User Recoverable Errors
Simple validation error
- Help the user recover quickly
- Highlight all errors clearly
- Show help text and examples for invalid fields
- If some other action is required first, launch it
instead of interrupting the flow with frustrating
errors - Perception is everything!
- Log the error for later analysis
- Save enough information to recreate
- Start automatically handling common mistakes
10User Recoverable Errors
Optimistic concurrency clash
- Everything looks good until the save
- Then
- Item has just gone out of stock
- Another user has just updated the same document
- Time has passed and action is no longer allowed
11User Recoverable Errors
Optimistic concurrency clash
- Increase save points
- Alert user to potential risk
- Low stock
- Another user just accessed this record
- Another user has soft lock on record
- Offer useful options for resolving collision
- Merge changes
- Backorder
- Automatically retry later
- Email me when it is available
- Give tips for avoiding future collisions
12User Recoverable Errors
Bookmarks, back buttons and browsers
- User escapes normal page flow
- Bookmarks login page or internal page
- Uses back button
- Opens a new window within same session
- Session times out
- Missing context from previous requests
- Next click is like bookmark to internal page
- Other browser oddities
- Double-clicking submit buttons
- Pressing stop button in the middle of a request
13User Recoverable Errors
Bookmarks, back buttons and sessions
- Prevention is difficult the user is in control
- Javascript can sometimes help
- Javascript can sometimes hurt
- Plan for and test each of these scenarios
- Plan for handling out-of-sequence requests
- Limit state or unique key it
14User Recoverable Errors
Bookmarks, back buttons and sessions
- To seamlessly handle session timeouts and
out-of-sequence requests, consider - Persistent sessions (saved to database)
- Passing state in every request (form fields or
URL rewriting) - Storing state in custom cookies
- Adding custom logic to recover from timed-out
sequences - Resubmit requests after re-authentication
- To simply detect and alert, consider
- Using listener to catch session expiration
- Using state validation to catch out-of-sequence
requests - Redirecting user to session expiration page
- To improve process
- Log session losses (requests within expired
session) - Consider increasing session timeout
- Consider using prevention techniques described
above
15User Recoverable Errors
Bookmarks, back buttons and sessions
- To minimize impact of back button, consider
- Techniques described for out-of-sequence requests
- Redirecting to GETs instead of returning
responses to POSTs - To work around double submissions, consider
- Utilize unique transaction identifiers stored in
session - Forward action submissions to separated response
pages - Response pages automatically display on double
submit - To handle multiple windows, consider
- Passing state in every request
- Pass state in hidden fields throughout a wizard
- Adapting web frameworks to map state (e.g. Struts
form beans) by primary key or request ID instead
of a static name
16Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
17Expected Application Errors
Resource is unavailable
- Database is down for maintenance
- No connection to integrated partner service
- Resource is overloaded
- Out of DB connections
- JMS Queue full
18Expected Application Errors
Resource is unavailable
- To prevent, consider
- Coordinating maintenance schedules
- Planning for failover at the resource level
- Increasing hardware budget ?
- Increasing transaction timeout seconds (caution
last resort) - To handle, analyze transactional requirements
- Is immediate user response necessary?
- Can the resource access be handled asynchronously
with an extended, logical transaction? - Plan rollbacks carefully to allow for retries
(consider idempotence, sub-transactions) - Alert operator/admin if out of SLA
- Log all outages (study for patterns)
19Expected Application Errors
Application is overloaded
- Mentioned on CNBC
- Linked from Slashdot
- Denial of Service
20Expected Application Errors
Application is overloaded
- Test under heavy load
- Plan for growth
- Tune hot spots
- Run with excess capacity
- Throttle at network level
- Use JMS and other asynchronous technologies to
throttle on backend - Tune application server to degrade gracefully
- Monitor carefully
- Be prepared to scale out, not just up
21Expected Application Errors
Bugs and other undocumented features
- Friendly bug
- Triggers invalid state
- Causes VM or app server to throw exception
- Greedy bug
- Monopolizes resources
- Leaks connections
- Silent and deadly bug
- Corrupts data
22Expected Application Errors
Bugs and other undocumented features
- To handle friendly bugs
- Bulletproof your transactions rollbacks
- Write coding and design guidelines
- Conduct peer code reviews (share best practices)
- For client applications, catch Throwable
- Map exception handling in server container
- The finally clause is your friend
- Display sanitized errors to user
- Give enough information to map back to logs
- Log carefully to allow easy debugging
- Configure timestamp, thread id output
- Log data together not individually
- Alert operator/administrator
23Expected Application Errors
Bugs and other undocumented features
- To handle greedy bugs
- Reduce transaction timeout seconds
- Handle timeouts in the same way as friendly bugs
- Monitor carefully
- Log statistics ( of transaction timeouts, CPU
usage, memory usage, GC, network traffic, stuck
threads) - Automate log analysis
- Trigger a thread dump (kill -3) during hot spots
- Alert operator/administrator to hot spots
- Use clustering to contain damage
24Expected Application Errors
Bugs and other undocumented features
- To handle silent and deadly bugs
- Bulletproof transaction settings
- Validate on multiple levels, use referential
integrity - Audit everything
- Unless performance/cost prohibits, keep a
complete audit trail on every table (easy with
triggers, aspects or code generators), try to
include transaction ID - Flush caches regularly
- After a save, load the record from the database
and display back to the user - Run periodic audits with human review
- Plan for how to use audit trail to recover from
data corruption - Early detection is key escalate user concerns!
25Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
26System Failure
Never have an unplanned outage
- Determine acceptable downtime
- Plan clustering / failover accordingly
- Monitor carefully so outages are detected
immediately - Be ready with a tiny planned outage page and
server in advance - Consider offsite host
- Build this functionality into non-Web clients at
development time - Plan for transaction recovery
- Plan for JMS recovery
- Use quiescing load balancing to bring servers
offline for maintenance
27System Failure
Sabotage
- Encrypt data in database
- Security through obscurity
- Key entry on startup
- Credit cards should be two-way encrypted (resist
the urge to Rot13) - Passwords should be one-way hashed
- Create new temporary passwords for forgotten
pass - SQL Injection Prevention
- Dont dynamically generate SQL with user input
- Use prepare statements
- Cross-site scripting
- Cleanse any user data republished on a site
- Dont publish extra information
- Turn of server headers, require SSL on login or
throughout - Create a DMZ
- Two firewalls
- Use SSL between tiers
28Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
29Useful Strategies
Be sure that you develop guidelines for
- Error Messages
- Validation (format, business rules, size,
cleansing) - Logging (when, where, what)
- Auditing
- Monitoring (level of automation, alerts)
- Transactions (who rolls back, checked vs.
unchecked) - Sessions Caching (request vs. session,
flushing) - Clustering
30Useful Strategies
Error Messages
- For validation errors, be sure to
- Include format and size hints
- Show examples
- Give more information than the basic field label
- Mention the error at the top of the screen and
Highlight the field - Catch all errors at the same time
- For other user-recoverable errors
- Let the user know what to do next
- If the user cant recover
- Apologize
- Give no details
- Suggest workarounds
- (Silently log and alert!)
31Useful Strategies
Validation
- If possible, validate at all levels
- Common strategies
- Externalize validation rules and use a framework
that supports rich validation - Clearly define which layers are responsible for
which types of validation. For example - All format errors handled in web tier
- All business rule violations handled in
application tier - All field lengths enforced at data tier
32Useful Strategies
Logging
- Log in all tiers
- Define logging levels and when they are used
- Log user failures at different levels than system
failures - Include timestamp, user, thread ID, transaction
ID, etc. - Dont make logs a source of failure (watch disk
space, JMS load, etc.) - Log information in a single call
- Aggregate server logs
- Socket appender
- Scripts and mounting
Bad log.trace(Searching keyword) log.trace(
Found results.size()) Good Log.trace(Searc
hing keyword Found
results.size())
33Useful Strategies
Auditing
- Audit operations where possible
- Provides accountability
- Easier to support users
- Easier to debug
- Easier to recover from disaster
- Easier to detect attacks
- Include
- Timestamp
- Current User
- Some sort of thread ID, transaction ID, etc.
- Complete data record or diff
34Useful Strategies
Monitoring
- Common strategies include
- 24/7 operations center
- Business hours operation center
- Automated, redundant processes that analyze logs
and raise alerts to on-call administrators - SNMP and monitors
- Logs show more than critical errors
- Ideally, mine them for clues on usability,
performance problems and attacks - JMX clients
35Useful Strategies
Monitoring - Tools
- Free
- Nagios (Host, Network, Service monitoring)
- Groundwork Monitor
- MC4J
- EJTools
- Cost
- AdventNet
- OpenView
36Useful Strategies
Transactions
- Top server-side tier creates a user transaction,
catches all errors and then determines its fate - Container-managed transactions with session
façade - Top level methods responsible for rollbacks
- Business methods responsible for rollbacks
- Unchecked exceptions not recommended with EJB
- Unchecked exceptions with Spring
37Useful Strategies
Sessions and Caching
- Use session sparingly
- Common strategies
- Hidden form fields
- Cookies (encrypted)
- URL rewriting
- HTTP Session
- Shared caches (OSCache, Tangosol)
- When to flush cache?
- Caches can mask data problems
- Data should have timeouts
- Shared caches should limit usage (LRU)
38Useful Strategies
Clustering
- Why use clusters?
- Availability
- Scalability
- Will this application need a cluster?
- Can you take it offline for maintenance?
- Can you take it offline to scale it up?
- Are you sure you wont need to scale it out?
- Can be expensive and complicated
- Can require more expensive licensing
- Requires serializable data in session
- Limit the use of session and re-put objects on
edit - Requires more testing (test fail over conditions)
39Useful Strategies
Clustering
- JBoss Tomcat have limited cluster sizes
- Multicast can require network and operating
system changes - Multiple JVMs and log files to monitor
- Configuration management issues
- Synchronizing updates
- Custom settings per instance
40Discussion
Get the slides online at http//www.chariotsoluti
ons.com/slides
40
41Building Fault-Tolerant Enterprise Applications
- Greg Hinkle
- Chariot Solutions
- chariotsolutions.com