Building FaultTolerant Enterprise Applications - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Building FaultTolerant Enterprise Applications

Description:

Enters a value too big for the database field. Types letters ... Throttle at network level. Use JMS and other asynchronous technologies to throttle on backend ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 42
Provided by: chariots
Category:

less

Transcript and Presenter's Notes

Title: Building FaultTolerant Enterprise Applications


1
Building Fault-Tolerant Enterprise Applications
  • Greg Hinkle
  • Chariot Solutions
  • chariotsolutions.com

Adapted from original presentation by Erin
Mulder Brian McCallister
2
Agenda
  • Goals of Fault Tolerance
  • User Recoverable Errors
  • Expected Application Errors
  • System Failure
  • Useful Strategies
  • Discussion

3
Goals of Fault Tolerance
What are we really worried about?
  • Availability
  • Integrity
  • Confidentiality
  • Usability
  • Cost

4
Goals of Fault Tolerance
What can go wrong?
  • User Error
  • Concurrent Changes
  • Bugs
  • Resource Failure/Downtime
  • System Overload
  • Misconfiguration
  • Sabotage

5
Goals of Fault Tolerance
Themes well keep visiting
  • Prevention
  • Code Guidelines Reviews
  • Automated Validation Regression Testing
  • Performance / Stress Testing
  • Negative / Security Testing
  • Detection
  • Logging and Auditing
  • Validation Patterns
  • Monitoring
  • Recovery
  • Exception handling patterns
  • Error feedback loop
  • Redundancy

6
Agenda
  • Goals of Fault Tolerance
  • User Recoverable Errors
  • Expected Application Errors
  • System Failure
  • Useful Strategies
  • Discussion

7
User Recoverable Errors
Simple validation error
  • What do you do when the user
  • Leaves a required field blank
  • Enters a value too big for the database field
  • Types letters in a numeric field
  • Selects inconsistent options
  • Tries to do things in the wrong order

8
User Recoverable Errors
Simple validation error
  • Fault tolerance is more than detection
  • Prevent the user from making errors
  • Set maxlengths on input fields
  • Use character masks
  • Specify units
  • Show example input
  • Dont allow the selection of inconsistent options
  • Dont present navigation options that arent
    meant to be followed
  • Guide the user through longer processes

9
User Recoverable Errors
Simple validation error
  • Help the user recover quickly
  • Highlight all errors clearly
  • Show help text and examples for invalid fields
  • If some other action is required first, launch it
    instead of interrupting the flow with frustrating
    errors
  • Perception is everything!
  • Log the error for later analysis
  • Save enough information to recreate
  • Start automatically handling common mistakes

10
User Recoverable Errors
Optimistic concurrency clash
  • Everything looks good until the save
  • Then
  • Item has just gone out of stock
  • Another user has just updated the same document
  • Time has passed and action is no longer allowed

11
User Recoverable Errors
Optimistic concurrency clash
  • Increase save points
  • Alert user to potential risk
  • Low stock
  • Another user just accessed this record
  • Another user has soft lock on record
  • Offer useful options for resolving collision
  • Merge changes
  • Backorder
  • Automatically retry later
  • Email me when it is available
  • Give tips for avoiding future collisions

12
User Recoverable Errors
Bookmarks, back buttons and browsers
  • User escapes normal page flow
  • Bookmarks login page or internal page
  • Uses back button
  • Opens a new window within same session
  • Session times out
  • Missing context from previous requests
  • Next click is like bookmark to internal page
  • Other browser oddities
  • Double-clicking submit buttons
  • Pressing stop button in the middle of a request

13
User Recoverable Errors
Bookmarks, back buttons and sessions
  • Prevention is difficult the user is in control
  • Javascript can sometimes help
  • Javascript can sometimes hurt
  • Plan for and test each of these scenarios
  • Plan for handling out-of-sequence requests
  • Limit state or unique key it

14
User Recoverable Errors
Bookmarks, back buttons and sessions
  • To seamlessly handle session timeouts and
    out-of-sequence requests, consider
  • Persistent sessions (saved to database)
  • Passing state in every request (form fields or
    URL rewriting)
  • Storing state in custom cookies
  • Adding custom logic to recover from timed-out
    sequences
  • Resubmit requests after re-authentication
  • To simply detect and alert, consider
  • Using listener to catch session expiration
  • Using state validation to catch out-of-sequence
    requests
  • Redirecting user to session expiration page
  • To improve process
  • Log session losses (requests within expired
    session)
  • Consider increasing session timeout
  • Consider using prevention techniques described
    above

15
User Recoverable Errors
Bookmarks, back buttons and sessions
  • To minimize impact of back button, consider
  • Techniques described for out-of-sequence requests
  • Redirecting to GETs instead of returning
    responses to POSTs
  • To work around double submissions, consider
  • Utilize unique transaction identifiers stored in
    session
  • Forward action submissions to separated response
    pages
  • Response pages automatically display on double
    submit
  • To handle multiple windows, consider
  • Passing state in every request
  • Pass state in hidden fields throughout a wizard
  • Adapting web frameworks to map state (e.g. Struts
    form beans) by primary key or request ID instead
    of a static name

16
Agenda
  • Goals of Fault Tolerance
  • User Recoverable Errors
  • Expected Application Errors
  • System Failure
  • Useful Strategies
  • Discussion

17
Expected Application Errors
Resource is unavailable
  • Database is down for maintenance
  • No connection to integrated partner service
  • Resource is overloaded
  • Out of DB connections
  • JMS Queue full

18
Expected Application Errors
Resource is unavailable
  • To prevent, consider
  • Coordinating maintenance schedules
  • Planning for failover at the resource level
  • Increasing hardware budget ?
  • Increasing transaction timeout seconds (caution
    last resort)
  • To handle, analyze transactional requirements
  • Is immediate user response necessary?
  • Can the resource access be handled asynchronously
    with an extended, logical transaction?
  • Plan rollbacks carefully to allow for retries
    (consider idempotence, sub-transactions)
  • Alert operator/admin if out of SLA
  • Log all outages (study for patterns)

19
Expected Application Errors
Application is overloaded
  • Mentioned on CNBC
  • Linked from Slashdot
  • Denial of Service

20
Expected Application Errors
Application is overloaded
  • Test under heavy load
  • Plan for growth
  • Tune hot spots
  • Run with excess capacity
  • Throttle at network level
  • Use JMS and other asynchronous technologies to
    throttle on backend
  • Tune application server to degrade gracefully
  • Monitor carefully
  • Be prepared to scale out, not just up

21
Expected Application Errors
Bugs and other undocumented features
  • Friendly bug
  • Triggers invalid state
  • Causes VM or app server to throw exception
  • Greedy bug
  • Monopolizes resources
  • Leaks connections
  • Silent and deadly bug
  • Corrupts data

22
Expected Application Errors
Bugs and other undocumented features
  • To handle friendly bugs
  • Bulletproof your transactions rollbacks
  • Write coding and design guidelines
  • Conduct peer code reviews (share best practices)
  • For client applications, catch Throwable
  • Map exception handling in server container
  • The finally clause is your friend
  • Display sanitized errors to user
  • Give enough information to map back to logs
  • Log carefully to allow easy debugging
  • Configure timestamp, thread id output
  • Log data together not individually
  • Alert operator/administrator

23
Expected Application Errors
Bugs and other undocumented features
  • To handle greedy bugs
  • Reduce transaction timeout seconds
  • Handle timeouts in the same way as friendly bugs
  • Monitor carefully
  • Log statistics ( of transaction timeouts, CPU
    usage, memory usage, GC, network traffic, stuck
    threads)
  • Automate log analysis
  • Trigger a thread dump (kill -3) during hot spots
  • Alert operator/administrator to hot spots
  • Use clustering to contain damage

24
Expected Application Errors
Bugs and other undocumented features
  • To handle silent and deadly bugs
  • Bulletproof transaction settings
  • Validate on multiple levels, use referential
    integrity
  • Audit everything
  • Unless performance/cost prohibits, keep a
    complete audit trail on every table (easy with
    triggers, aspects or code generators), try to
    include transaction ID
  • Flush caches regularly
  • After a save, load the record from the database
    and display back to the user
  • Run periodic audits with human review
  • Plan for how to use audit trail to recover from
    data corruption
  • Early detection is key escalate user concerns!

25
Agenda
  • Goals of Fault Tolerance
  • User Recoverable Errors
  • Expected Application Errors
  • System Failure
  • Useful Strategies
  • Discussion

26
System Failure
Never have an unplanned outage
  • Determine acceptable downtime
  • Plan clustering / failover accordingly
  • Monitor carefully so outages are detected
    immediately
  • Be ready with a tiny planned outage page and
    server in advance
  • Consider offsite host
  • Build this functionality into non-Web clients at
    development time
  • Plan for transaction recovery
  • Plan for JMS recovery
  • Use quiescing load balancing to bring servers
    offline for maintenance

27
System Failure
Sabotage
  • Encrypt data in database
  • Security through obscurity
  • Key entry on startup
  • Credit cards should be two-way encrypted (resist
    the urge to Rot13)
  • Passwords should be one-way hashed
  • Create new temporary passwords for forgotten
    pass
  • SQL Injection Prevention
  • Dont dynamically generate SQL with user input
  • Use prepare statements
  • Cross-site scripting
  • Cleanse any user data republished on a site
  • Dont publish extra information
  • Turn of server headers, require SSL on login or
    throughout
  • Create a DMZ
  • Two firewalls
  • Use SSL between tiers

28
Agenda
  • Goals of Fault Tolerance
  • User Recoverable Errors
  • Expected Application Errors
  • System Failure
  • Useful Strategies
  • Discussion

29
Useful Strategies
Be sure that you develop guidelines for
  • Error Messages
  • Validation (format, business rules, size,
    cleansing)
  • Logging (when, where, what)
  • Auditing
  • Monitoring (level of automation, alerts)
  • Transactions (who rolls back, checked vs.
    unchecked)
  • Sessions Caching (request vs. session,
    flushing)
  • Clustering

30
Useful Strategies
Error Messages
  • For validation errors, be sure to
  • Include format and size hints
  • Show examples
  • Give more information than the basic field label
  • Mention the error at the top of the screen and
    Highlight the field
  • Catch all errors at the same time
  • For other user-recoverable errors
  • Let the user know what to do next
  • If the user cant recover
  • Apologize
  • Give no details
  • Suggest workarounds
  • (Silently log and alert!)

31
Useful Strategies
Validation
  • If possible, validate at all levels
  • Common strategies
  • Externalize validation rules and use a framework
    that supports rich validation
  • Clearly define which layers are responsible for
    which types of validation. For example
  • All format errors handled in web tier
  • All business rule violations handled in
    application tier
  • All field lengths enforced at data tier

32
Useful Strategies
Logging
  • Log in all tiers
  • Define logging levels and when they are used
  • Log user failures at different levels than system
    failures
  • Include timestamp, user, thread ID, transaction
    ID, etc.
  • Dont make logs a source of failure (watch disk
    space, JMS load, etc.)
  • Log information in a single call
  • Aggregate server logs
  • Socket appender
  • Scripts and mounting

Bad log.trace(Searching keyword) log.trace(
Found results.size()) Good Log.trace(Searc
hing keyword Found
results.size())
33
Useful Strategies
Auditing
  • Audit operations where possible
  • Provides accountability
  • Easier to support users
  • Easier to debug
  • Easier to recover from disaster
  • Easier to detect attacks
  • Include
  • Timestamp
  • Current User
  • Some sort of thread ID, transaction ID, etc.
  • Complete data record or diff

34
Useful Strategies
Monitoring
  • Common strategies include
  • 24/7 operations center
  • Business hours operation center
  • Automated, redundant processes that analyze logs
    and raise alerts to on-call administrators
  • SNMP and monitors
  • Logs show more than critical errors
  • Ideally, mine them for clues on usability,
    performance problems and attacks
  • JMX clients

35
Useful Strategies
Monitoring - Tools
  • Free
  • Nagios (Host, Network, Service monitoring)
  • Groundwork Monitor
  • MC4J
  • EJTools
  • Cost
  • AdventNet
  • OpenView

36
Useful Strategies
Transactions
  • Top server-side tier creates a user transaction,
    catches all errors and then determines its fate
  • Container-managed transactions with session
    façade
  • Top level methods responsible for rollbacks
  • Business methods responsible for rollbacks
  • Unchecked exceptions not recommended with EJB
  • Unchecked exceptions with Spring

37
Useful Strategies
Sessions and Caching
  • Use session sparingly
  • Common strategies
  • Hidden form fields
  • Cookies (encrypted)
  • URL rewriting
  • HTTP Session
  • Shared caches (OSCache, Tangosol)
  • When to flush cache?
  • Caches can mask data problems
  • Data should have timeouts
  • Shared caches should limit usage (LRU)

38
Useful Strategies
Clustering
  • Why use clusters?
  • Availability
  • Scalability
  • Will this application need a cluster?
  • Can you take it offline for maintenance?
  • Can you take it offline to scale it up?
  • Are you sure you wont need to scale it out?
  • Can be expensive and complicated
  • Can require more expensive licensing
  • Requires serializable data in session
  • Limit the use of session and re-put objects on
    edit
  • Requires more testing (test fail over conditions)

39
Useful Strategies
Clustering
  • JBoss Tomcat have limited cluster sizes
  • Multicast can require network and operating
    system changes
  • Multiple JVMs and log files to monitor
  • Configuration management issues
  • Synchronizing updates
  • Custom settings per instance

40
Discussion
Get the slides online at http//www.chariotsoluti
ons.com/slides
40
41
Building Fault-Tolerant Enterprise Applications
  • Greg Hinkle
  • Chariot Solutions
  • chariotsolutions.com
Write a Comment
User Comments (0)
About PowerShow.com