Title: Christopher A. Monaco
1STEREOThe Solar TErrestrial RElations
Observatory Flight Softwares Unconventional
Solution to Floating Point Error Handling
- Christopher A. Monaco
- JHU/APL
- FSW-07 Workshop
- Laurel, MD
- 5/6-Nov-2007
2Early STEREO FSW Development
- A task of the STEREO FSW team during development
was to implement a floating point exception
handler - On the RAD6000 this is not as straight forward as
it sounds - This is what we learned and what we did about it
3Background
- Each of the two STEREO spacecraft have 2 RAD6000
processors that operate the SC bus - One for CDH and one for GC subsystems
- The CDH and GC processors run VxWorks 5.3.1
operating system - The RAD6000 is a POWER processor architecture
- Based on the IBM Federal Systems RSC6000 circa
1985 - The first RAD6000 was launched in 1996 on the
Mars Pathfinder. - Since then approximately 150 RAD6000 have been
flown on various missions including - Rovers Spirit and Opportunity, Deep Space 1,
Genesis and Stardust, Mars Polar Lander, Mars
Climate Orbiter, APLs MESSENGER,
4RSC6000
- RSC6000 has 3 semi-autonomous processor units
that each implement their own instruction
pipeline - Instruction Stream processor or Branch Unit (BU)
- Fixed Point Unit (FXU)
- Floating Point Unit (FPU)
- The 3 processing units execute somewhat
independently. - Several instructions may be in various phases of
execution at any particular instant - Instructions across the 3 pipelines often finish
in a different order from that defined by the
program - This is in contrast to the sequential model of
program execution - Each instruction must completely finish before
the next begins - Pipelined instruction execution is responsible
for significant performance improvements made by
the POWER architecture
5IEEE 754 Floating Point Standard
- IEEE 754 Floating Point Standard was first
adopted by IEEE/ANSI in 1984 - The standard requires that a faulting instruction
be accurately identifiable within the exception
trap - In general this requirement is met by chip
designers by implementing a precise interrupt - Implementation of precise interrupt is
complicated due to the 3 somewhat independent
pipelines of the RSC6000 - out-of-order instruction sequencing
- Precise interrupt an interrupt or exception
is precise if the saved processor state
corresponds with the sequential model of program
execution where one instruction execution ends
before the next begins.
6RSC6000 Floating Point
- Designers of the RSC6000 had a choice
- (1) Implement precise interrupt - Invent a
complex scheme for identifying a faulting
instruction and enabling rollback of instructions
that executed in the other pipelines
out-of-sequence and restoring processor state - (2) Implement precise interrupt - Enforce
explicit instruction execution sequencing
serialize the pipelines - Each instruction must complete (exception-free)
before subsequent instructions may begin - Performance hit between 2-3 X
- (3) Give up the ability to identify the faulting
instruction - Software must poll floating point registers for
exceptions
or
7Floating Point Exceptions
- Buried at the end of the Floating Point
Exceptions section in the POWER Processor
Architecture Manual Version 1.52 regarding
trapping floating point exceptions - System performance with MSR(FE) 1 may be
significantly degraded - Regarding polling for floating point exceptions
RSC6000 and RAD6000 literature offers little
guidance - inserting test code after each floating point
operation - Adding test code after each floating point
operation is too invasive particularly since a
significant portion of the GC code is
autogenerated MatLab RTW code - Each task involving floating point operations
would require modification - The compiler may provide several options at
different levels subroutine, loop exit,
statement assignment, or after each floating
point instruction - No obvious compiler solutions offered detection
and appropriate handling of floating point errors
while also avoiding performance loss
8STEREO FP Exception Handling Options
- (2) Enforce explicit instruction sequencing Trap
floating point exceptions - System-wide solution
- Conventional
- No latency in error detection
- - Significant overall system performance
degradation 2-3 X associated with serializing the
3 pipelines -
- (3) Give up precise interrupt and poll for
exceptions Polling for floating point exceptions - We dont really NEED to know the exact
instruction that caused the error. Reset system
in case of critical task floating point error - - Polling results in latency between the
occurrence and detection of error - Small latency can be tolerated
- Good software practices ? Floating point
exceptions should be VERY rare! - 2-3 X faster than option (2). Take advantage
of parallel pipelines!
9VxWorks
- VxWorks associates a copy of the FPU registers
with each user task - VxWorks saves and restores FPU registers at
context switches - Polling in a particular task context would only
catch exceptions occurring within that task since
the last poll - VxWorks offers a hook into the OS in which
developers may insert user code to execute at
task context switches - taskHookLib STATUS taskSwitchHookAdd ( FUNCPTR
switchHook) - Arguments to the user supplied task switch hook
are pointers to the old_tcb and new_tcb
10STEREO Floating Point Exception Polling
- VxWorks saves and restores a tasks registers
prior to calling user task switch hook routine - The switch hook behaves as though it were
executing within the context of the new task.
Therefore, FPSCRread() within the task switch
hook supplies the FPSCR associated with the new
task - Did the new task suffer floating pt exception
last time it was run?
11Detection Latency
- Maximum Latency is deterministic for each task
- Example GC attitude controller task 50Hz
- Maximum Latency for GC floating point error
detection 20 ms - Acceptable
12STEREO Floating Point Exception Polling
- This approach guarantees that all tasks will be
monitored - Every task runs as a result of a context switch
- Floating point error monitoring occurs at a
bounded rate - Minimum scheduled task rate
- Maximum on the order of the rate of the highest
rate task in the system - Acceptable detection latency
- Insignificant overhead cost
- Monitoring consists of FPSCRread(), mask and
test - Floating point error handling can easily
discriminate based upon the predefined
criticality of the faulting task - Floating point errors in non-critical tasks are
recorded and FPSCR is cleared - Floating point errors in critical tasks are
recorded and initiate a system reset
13STEREO Floating Point Exception Polling
- Interesting Note
- Rarely, the OS executes the task switch hook with
new task ID 0 and the FPSCR register contains
seemingly erroneous data - Task ID 0, coincidentally, matches the Task ID
of the CDH GC Idler tasks which perform no
floating point operations - IdlerTask()
-
- for()
-
-
- This was investigated extensively by the STEREO
FSW team a specific cause could not be
identified - Empirically determined through all phases of
testing to be a false positive - Over 4 processor years of this code operating
post launch validating this assertion
14STEREO Floating Point Exception Polling
- Task switch hook FPSCR polling is a good software
solution to the feature traditionally implemented
in hardware that has come to be taken for granted - System-wide approach
- can easily be customized per task
- Guarantees that all system tasks are monitored
within one solution - Tasks are monitored at a sufficiently high rate
15STEREO Floating Point Exception Polling