Building FaultTolerant Enterprise Applications

About This Presentation

Title:

Building FaultTolerant Enterprise Applications

Description:

Enters a value too big for the database field. Types letters ... Throttle at network level. Use JMS and other asynchronous technologies to throttle on backend ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 42

Provided by: chariots

Category:

more less

Transcript and Presenter's Notes

Title: Building FaultTolerant Enterprise Applications

1
Building Fault-Tolerant Enterprise Applications

Greg Hinkle
Chariot Solutions
chariotsolutions.com

Adapted from original presentation by Erin
Mulder Brian McCallister
2
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

3
Goals of Fault Tolerance
What are we really worried about?

Availability
Integrity
Confidentiality
Usability
Cost

4
Goals of Fault Tolerance
What can go wrong?

User Error
Concurrent Changes
Bugs
Resource Failure/Downtime
System Overload
Misconfiguration
Sabotage

5
Goals of Fault Tolerance
Themes well keep visiting

Prevention
Code Guidelines Reviews
Automated Validation Regression Testing
Performance / Stress Testing
Negative / Security Testing
Detection
Logging and Auditing
Validation Patterns
Monitoring
Recovery
Exception handling patterns
Error feedback loop
Redundancy

6
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

7
User Recoverable Errors
Simple validation error

What do you do when the user
Leaves a required field blank
Enters a value too big for the database field
Types letters in a numeric field
Selects inconsistent options
Tries to do things in the wrong order

8
User Recoverable Errors
Simple validation error

Fault tolerance is more than detection
Prevent the user from making errors
Set maxlengths on input fields
Use character masks
Specify units
Show example input
Dont allow the selection of inconsistent options
Dont present navigation options that arent
meant to be followed
Guide the user through longer processes

9
User Recoverable Errors
Simple validation error

Help the user recover quickly
Highlight all errors clearly
Show help text and examples for invalid fields
If some other action is required first, launch it
instead of interrupting the flow with frustrating
errors
Perception is everything!
Log the error for later analysis
Save enough information to recreate
Start automatically handling common mistakes

10
User Recoverable Errors
Optimistic concurrency clash

Everything looks good until the save
Then
Item has just gone out of stock
Another user has just updated the same document
Time has passed and action is no longer allowed

11
User Recoverable Errors
Optimistic concurrency clash

Increase save points
Alert user to potential risk
Low stock
Another user just accessed this record
Another user has soft lock on record
Offer useful options for resolving collision
Merge changes
Backorder
Automatically retry later
Email me when it is available
Give tips for avoiding future collisions

12
User Recoverable Errors
Bookmarks, back buttons and browsers

User escapes normal page flow
Bookmarks login page or internal page
Uses back button
Opens a new window within same session
Session times out
Missing context from previous requests
Next click is like bookmark to internal page
Other browser oddities
Double-clicking submit buttons
Pressing stop button in the middle of a request

13
User Recoverable Errors
Bookmarks, back buttons and sessions

Prevention is difficult the user is in control
Javascript can sometimes help
Javascript can sometimes hurt
Plan for and test each of these scenarios
Plan for handling out-of-sequence requests
Limit state or unique key it

14
User Recoverable Errors
Bookmarks, back buttons and sessions

To seamlessly handle session timeouts and
out-of-sequence requests, consider
Persistent sessions (saved to database)
Passing state in every request (form fields or
URL rewriting)
Storing state in custom cookies
Adding custom logic to recover from timed-out
sequences
Resubmit requests after re-authentication
To simply detect and alert, consider
Using listener to catch session expiration
Using state validation to catch out-of-sequence
requests
Redirecting user to session expiration page
To improve process
Log session losses (requests within expired
session)
Consider increasing session timeout
Consider using prevention techniques described
above

15
User Recoverable Errors
Bookmarks, back buttons and sessions

To minimize impact of back button, consider
Techniques described for out-of-sequence requests
Redirecting to GETs instead of returning
responses to POSTs
To work around double submissions, consider
Utilize unique transaction identifiers stored in
session
Forward action submissions to separated response
pages
Response pages automatically display on double
submit
To handle multiple windows, consider
Passing state in every request
Pass state in hidden fields throughout a wizard
Adapting web frameworks to map state (e.g. Struts
form beans) by primary key or request ID instead
of a static name

16
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

17
Expected Application Errors
Resource is unavailable

Database is down for maintenance
No connection to integrated partner service
Resource is overloaded
Out of DB connections
JMS Queue full

18
Expected Application Errors
Resource is unavailable

To prevent, consider
Coordinating maintenance schedules
Planning for failover at the resource level
Increasing hardware budget ?
Increasing transaction timeout seconds (caution
last resort)
To handle, analyze transactional requirements
Is immediate user response necessary?
Can the resource access be handled asynchronously
with an extended, logical transaction?
Plan rollbacks carefully to allow for retries
(consider idempotence, sub-transactions)
Alert operator/admin if out of SLA
Log all outages (study for patterns)

19
Expected Application Errors
Application is overloaded

Mentioned on CNBC
Linked from Slashdot
Denial of Service

20
Expected Application Errors
Application is overloaded

Test under heavy load
Plan for growth
Tune hot spots
Run with excess capacity
Throttle at network level
Use JMS and other asynchronous technologies to
throttle on backend
Tune application server to degrade gracefully
Monitor carefully
Be prepared to scale out, not just up

21
Expected Application Errors
Bugs and other undocumented features

Friendly bug
Triggers invalid state
Causes VM or app server to throw exception
Greedy bug
Monopolizes resources
Leaks connections
Silent and deadly bug
Corrupts data

22
Expected Application Errors
Bugs and other undocumented features

To handle friendly bugs
Bulletproof your transactions rollbacks
Write coding and design guidelines
Conduct peer code reviews (share best practices)
For client applications, catch Throwable
Map exception handling in server container
The finally clause is your friend
Display sanitized errors to user
Give enough information to map back to logs
Log carefully to allow easy debugging
Configure timestamp, thread id output
Log data together not individually
Alert operator/administrator

23
Expected Application Errors
Bugs and other undocumented features

To handle greedy bugs
Reduce transaction timeout seconds
Handle timeouts in the same way as friendly bugs
Monitor carefully
Log statistics ( of transaction timeouts, CPU
usage, memory usage, GC, network traffic, stuck
threads)
Automate log analysis
Trigger a thread dump (kill -3) during hot spots
Alert operator/administrator to hot spots
Use clustering to contain damage

24
Expected Application Errors
Bugs and other undocumented features

To handle silent and deadly bugs
Bulletproof transaction settings
Validate on multiple levels, use referential
integrity
Audit everything
Unless performance/cost prohibits, keep a
complete audit trail on every table (easy with
triggers, aspects or code generators), try to
include transaction ID
Flush caches regularly
After a save, load the record from the database
and display back to the user
Run periodic audits with human review
Plan for how to use audit trail to recover from
data corruption
Early detection is key escalate user concerns!

25
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

26
System Failure
Never have an unplanned outage

Determine acceptable downtime
Plan clustering / failover accordingly
Monitor carefully so outages are detected
immediately
Be ready with a tiny planned outage page and
server in advance
Consider offsite host
Build this functionality into non-Web clients at
development time
Plan for transaction recovery
Plan for JMS recovery
Use quiescing load balancing to bring servers
offline for maintenance

27
System Failure
Sabotage

Encrypt data in database
Security through obscurity
Key entry on startup
Credit cards should be two-way encrypted (resist
the urge to Rot13)
Passwords should be one-way hashed
Create new temporary passwords for forgotten
pass
SQL Injection Prevention
Dont dynamically generate SQL with user input
Use prepare statements
Cross-site scripting
Cleanse any user data republished on a site
Dont publish extra information
Turn of server headers, require SSL on login or
throughout
Create a DMZ
Two firewalls
Use SSL between tiers

28
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

29
Useful Strategies
Be sure that you develop guidelines for

Error Messages
Validation (format, business rules, size,
cleansing)
Logging (when, where, what)
Auditing
Monitoring (level of automation, alerts)
Transactions (who rolls back, checked vs.
unchecked)
Sessions Caching (request vs. session,
flushing)
Clustering

30
Useful Strategies
Error Messages

For validation errors, be sure to
Include format and size hints
Show examples
Give more information than the basic field label
Mention the error at the top of the screen and
Highlight the field
Catch all errors at the same time
For other user-recoverable errors
Let the user know what to do next
If the user cant recover
Apologize
Give no details
Suggest workarounds
(Silently log and alert!)

31
Useful Strategies
Validation

If possible, validate at all levels
Common strategies
Externalize validation rules and use a framework
that supports rich validation
Clearly define which layers are responsible for
which types of validation. For example
All format errors handled in web tier
All business rule violations handled in
application tier
All field lengths enforced at data tier

32
Useful Strategies
Logging

Log in all tiers
Define logging levels and when they are used
Log user failures at different levels than system
failures
Include timestamp, user, thread ID, transaction
ID, etc.
Dont make logs a source of failure (watch disk
space, JMS load, etc.)
Log information in a single call
Aggregate server logs
Socket appender
Scripts and mounting

Bad log.trace(Searching keyword) log.trace(
Found results.size()) Good Log.trace(Searc
hing keyword Found
results.size())
33
Useful Strategies
Auditing

Audit operations where possible
Provides accountability
Easier to support users
Easier to debug
Easier to recover from disaster
Easier to detect attacks
Include
Timestamp
Current User
Some sort of thread ID, transaction ID, etc.
Complete data record or diff

34
Useful Strategies
Monitoring

Common strategies include
24/7 operations center
Business hours operation center
Automated, redundant processes that analyze logs
and raise alerts to on-call administrators
SNMP and monitors
Logs show more than critical errors
Ideally, mine them for clues on usability,
performance problems and attacks
JMX clients

35
Useful Strategies
Monitoring - Tools

Free
Nagios (Host, Network, Service monitoring)
Groundwork Monitor
MC4J
EJTools
Cost
AdventNet
OpenView

36
Useful Strategies
Transactions

Top server-side tier creates a user transaction,
catches all errors and then determines its fate
Container-managed transactions with session
façade
Top level methods responsible for rollbacks
Business methods responsible for rollbacks
Unchecked exceptions not recommended with EJB
Unchecked exceptions with Spring

37
Useful Strategies
Sessions and Caching

Use session sparingly
Common strategies
Hidden form fields
Cookies (encrypted)
URL rewriting
HTTP Session
Shared caches (OSCache, Tangosol)
When to flush cache?
Caches can mask data problems
Data should have timeouts
Shared caches should limit usage (LRU)

38
Useful Strategies
Clustering

Why use clusters?
Availability
Scalability
Will this application need a cluster?
Can you take it offline for maintenance?
Can you take it offline to scale it up?
Are you sure you wont need to scale it out?
Can be expensive and complicated
Can require more expensive licensing
Requires serializable data in session
Limit the use of session and re-put objects on
edit
Requires more testing (test fail over conditions)

39
Useful Strategies
Clustering

JBoss Tomcat have limited cluster sizes
Multicast can require network and operating
system changes
Multiple JVMs and log files to monitor
Configuration management issues
Synchronizing updates
Custom settings per instance

40
Discussion
Get the slides online at http//www.chariotsoluti
ons.com/slides
40
41
Building Fault-Tolerant Enterprise Applications

Greg Hinkle
Chariot Solutions
chariotsolutions.com

Write a Comment

User Comments (0)

About PowerShow.com

Building FaultTolerant Enterprise Applications - PowerPoint PPT Presentation

Building FaultTolerant Enterprise Applications

Enters a value too big for the database field. Types letters ... Throttle at network level. Use JMS and other asynchronous technologies to throttle on backend ... – PowerPoint PPT presentation