Christopher A. Monaco - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Christopher A. Monaco

Description:

The first RAD6000 was launched in 1996 on the Mars Pathfinder. ... Rovers Spirit and Opportunity, Deep Space 1, Genesis and Stardust, Mars Polar ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 16

Provided by: kevin221

Category:

more less

Transcript and Presenter's Notes

Title: Christopher A. Monaco

1
STEREOThe Solar TErrestrial RElations
Observatory Flight Softwares Unconventional
Solution to Floating Point Error Handling

Christopher A. Monaco
JHU/APL
FSW-07 Workshop
Laurel, MD
5/6-Nov-2007

2
Early STEREO FSW Development

A task of the STEREO FSW team during development
was to implement a floating point exception
handler
On the RAD6000 this is not as straight forward as
it sounds
This is what we learned and what we did about it

3
Background

Each of the two STEREO spacecraft have 2 RAD6000
processors that operate the SC bus
One for CDH and one for GC subsystems
The CDH and GC processors run VxWorks 5.3.1
operating system
The RAD6000 is a POWER processor architecture
Based on the IBM Federal Systems RSC6000 circa
1985
The first RAD6000 was launched in 1996 on the
Mars Pathfinder.
Since then approximately 150 RAD6000 have been
flown on various missions including
Rovers Spirit and Opportunity, Deep Space 1,
Genesis and Stardust, Mars Polar Lander, Mars
Climate Orbiter, APLs MESSENGER,

4
RSC6000

RSC6000 has 3 semi-autonomous processor units
that each implement their own instruction
pipeline
Instruction Stream processor or Branch Unit (BU)
Fixed Point Unit (FXU)
Floating Point Unit (FPU)
The 3 processing units execute somewhat
independently.
Several instructions may be in various phases of
execution at any particular instant
Instructions across the 3 pipelines often finish
in a different order from that defined by the
program
This is in contrast to the sequential model of
program execution
Each instruction must completely finish before
the next begins
Pipelined instruction execution is responsible
for significant performance improvements made by
the POWER architecture

5
IEEE 754 Floating Point Standard

IEEE 754 Floating Point Standard was first
adopted by IEEE/ANSI in 1984
The standard requires that a faulting instruction
be accurately identifiable within the exception
trap
In general this requirement is met by chip
designers by implementing a precise interrupt
Implementation of precise interrupt is
complicated due to the 3 somewhat independent
pipelines of the RSC6000
out-of-order instruction sequencing
Precise interrupt an interrupt or exception
is precise if the saved processor state
corresponds with the sequential model of program
execution where one instruction execution ends
before the next begins.

6
RSC6000 Floating Point

Designers of the RSC6000 had a choice
(1) Implement precise interrupt - Invent a
complex scheme for identifying a faulting
instruction and enabling rollback of instructions
that executed in the other pipelines
out-of-sequence and restoring processor state
(2) Implement precise interrupt - Enforce
explicit instruction execution sequencing
serialize the pipelines
Each instruction must complete (exception-free)
before subsequent instructions may begin
Performance hit between 2-3 X
(3) Give up the ability to identify the faulting
instruction
Software must poll floating point registers for
exceptions

or
7
Floating Point Exceptions

Buried at the end of the Floating Point
Exceptions section in the POWER Processor
Architecture Manual Version 1.52 regarding
trapping floating point exceptions
System performance with MSR(FE) 1 may be
significantly degraded
Regarding polling for floating point exceptions
RSC6000 and RAD6000 literature offers little
guidance
inserting test code after each floating point
operation
Adding test code after each floating point
operation is too invasive particularly since a
significant portion of the GC code is
autogenerated MatLab RTW code
Each task involving floating point operations
would require modification
The compiler may provide several options at
different levels subroutine, loop exit,
statement assignment, or after each floating
point instruction
No obvious compiler solutions offered detection
and appropriate handling of floating point errors
while also avoiding performance loss

8
STEREO FP Exception Handling Options

(2) Enforce explicit instruction sequencing Trap
floating point exceptions
System-wide solution
Conventional
No latency in error detection
- Significant overall system performance
degradation 2-3 X associated with serializing the
3 pipelines

(3) Give up precise interrupt and poll for
exceptions Polling for floating point exceptions
We dont really NEED to know the exact
instruction that caused the error. Reset system
in case of critical task floating point error
- Polling results in latency between the
occurrence and detection of error
Small latency can be tolerated
Good software practices ? Floating point
exceptions should be VERY rare!
2-3 X faster than option (2). Take advantage
of parallel pipelines!

9
VxWorks

VxWorks associates a copy of the FPU registers
with each user task
VxWorks saves and restores FPU registers at
context switches
Polling in a particular task context would only
catch exceptions occurring within that task since
the last poll
VxWorks offers a hook into the OS in which
developers may insert user code to execute at
task context switches
taskHookLib STATUS taskSwitchHookAdd ( FUNCPTR
switchHook)
Arguments to the user supplied task switch hook
are pointers to the old_tcb and new_tcb

10
STEREO Floating Point Exception Polling

VxWorks saves and restores a tasks registers
prior to calling user task switch hook routine
The switch hook behaves as though it were
executing within the context of the new task.
Therefore, FPSCRread() within the task switch
hook supplies the FPSCR associated with the new
task
Did the new task suffer floating pt exception
last time it was run?

11
Detection Latency

Maximum Latency is deterministic for each task
Example GC attitude controller task 50Hz
Maximum Latency for GC floating point error
detection 20 ms
Acceptable

Actual Latency

12
STEREO Floating Point Exception Polling

This approach guarantees that all tasks will be
monitored
Every task runs as a result of a context switch
Floating point error monitoring occurs at a
bounded rate
Minimum scheduled task rate
Maximum on the order of the rate of the highest
rate task in the system
Acceptable detection latency
Insignificant overhead cost
Monitoring consists of FPSCRread(), mask and
test
Floating point error handling can easily
discriminate based upon the predefined
criticality of the faulting task
Floating point errors in non-critical tasks are
recorded and FPSCR is cleared
Floating point errors in critical tasks are
recorded and initiate a system reset

13
STEREO Floating Point Exception Polling

Interesting Note
Rarely, the OS executes the task switch hook with
new task ID 0 and the FPSCR register contains
seemingly erroneous data
Task ID 0, coincidentally, matches the Task ID
of the CDH GC Idler tasks which perform no
floating point operations
IdlerTask()
for()
This was investigated extensively by the STEREO
FSW team a specific cause could not be
identified
Empirically determined through all phases of
testing to be a false positive
Over 4 processor years of this code operating
post launch validating this assertion

14
STEREO Floating Point Exception Polling

Task switch hook FPSCR polling is a good software
solution to the feature traditionally implemented
in hardware that has come to be taken for granted
System-wide approach
can easily be customized per task
Guarantees that all system tasks are monitored
within one solution
Tasks are monitored at a sufficiently high rate

15
STEREO Floating Point Exception Polling