Title: IV
1IVV Lessons Learned Mars Exploration Rovers
andthe Spirit SOL-18 AnomalyNASA IVV
InvolvementAugust 2004
Kenneth Costello Senior IVV ManagerNASA IVV
Facility 100 University Dr Fairmont, West
Virginia 26508 Kenneth.A.Costello_at_nasa.gov 304
367 8343
2Introduction
- Purpose
- This is an information presentation to provide a
quick overview of IVV and provide some lessons
learned for IVV from the Mars Exploration Rover
project - Agenda
- Overview of NASA IVV
- Background on IVV involvement with the MER
program - IVV issues related to the system memory and file
system - IVV Lessons Learned
- Summary
3What is NASA IVV?
- NASA IVV is a program managed by the Safety and
Mission Assurance Office - The program is delegated to the Goddard Space
Flight Center and is managed from the NASA IVV
Facility in Fairmont, West Virginia - Facility was dedicated in 1994
- Focus on unmanned missions began around 2000
- The program has two main roles for the Agency
- The first role is to provide an Independent VV
capability and ensure mission software readiness
for critical projects focused around risk and
safety - The second role is to enhance software readiness
by providing IVV domain expertise to Projects to
identify issues/defects and propose possible
solutions
4Scope DeterminationSoftware Integrity Level
Assessment Process
For each Software Component
Final values plotted on a 5x5 matrix for
reference. Values are also used to
cross-reference a task matrix.
Criticality
Rating
Category
Human Safety
1-5 score where 5 is highest
Criticality
Asset Safety
Performance
Each factor has an associated weight applied
before being combined into a final value
Error Potential
Rating
Category
Development Organization
E1
1-5 score where 5 is highest
Error Potential
Development Process
Software Characteristics
Task selection is from a standardized list of
tasks. Allocation is based on criticality value
and on error potential value individually.
5IVV Lifecycle Flow
Focused activity at the earliest point System
requirements and software role important Issues
are introduced at lowest level
Concept Phase
Verification
Covers all levels of testing Ensure that system
meets the needs of the mission
System Requirements
Software Planning
Verification
Verification
IVV in phase with development (not testing only)
Validation Testing
Software Requirements
Verification
Design
Verification
Simulator/ Environment/ Hardware
Implementation
Verification
Maintenance
Later life cycle activity still important Issues
are still introduced at lowest level Focused more
on individual components
IVV support continues over initial operational
phase
6IVV Activities for MER
- Initial assessment of the MER project performed
in June 2001 - Results of assessment noted that the file system
was a very critical portion of the FSW, however,
the scores for the technology being used and the
maturity of the software indicated low risk - Some portions were rated as high complexity
- Overall the file system software was within the
IVV scope though at a low level - Initial estimate of the IVV resources was 9-10
FTEs - The MER Project had not budgeted for that level
of IVV resources - Final IVV resources were 4-5 FTEs
- Reduction in resources necessitated changes in
the approach to IVV - Goal was to cover the MER FSW to a reasonable
depth so that the IVV Team could feel
comfortable supporting launch and operational
readiness reviews for the project - Tasking was pulled up to a higher level than
normal analysis applied at a complete FSW level
rather than at a software component level - Additional issue in regards to a limited number
of FSW requirement artifacts
7Summary of Spirit Sol-18 System Memory Consumption
- Sol 18
- 900 LST The planned DTE HGA communication
session began. - 911 LST Event Reports were received
indicating uplink errors were occurring. Downlink
was spotty. - 916 LST The signal was lost. This was 14
minutes earlier than expected - 1120 LST Commanded a 30-minute high priority
HGA communication session. No signal was seen. - 1245 LST Commanded an LGA beep. The beep
occurred as predicted (start and duration). - 1618 LST Odyssey UHF pass over Spirit, no
carrier seen - Sol 19
- 145 LST The MGS UHF communications session
lasted only 2 minutes and 20 seconds. It did
start at the correct time but only a repeating
PsuedoNoise code was present in the data. - 439 LST No early morning UHF communication
session with the Odyssey spacecraft (no signal or
data). - 900 LST No morning HGA DTE communication
session. No signal or data were detected. - 1100 LST Looked for 10 bps LGA DTE
communication session initiated by a system fault
protection response. No signal was seen. - 1440 LST Commanded beep at 7.8125 bps. Beep
was seen! - 1524 No afternoon UHF communication session
with the Odyssey spacecraft (no signal or data). - 1527 Attempted to command an LGA DTE
communication session. No signal or data was
received. - A system level fault had occurred on Sol 19 that
put the rover in a degraded communication state
and allowed some commanding - Eventually, JPL was able to determine that FSW
was in a continuous delayed reset loop. The first
reset seemed to occur during the Sol 18 morning
DTE session coincident with an actuator checkout - Both commanded and autonomous shutdowns were
failing and the vehicle probably had not shutdown
in a while
8Root Cause
- The root cause was traced to two configuration
parameters in the VxWorks operating system - Configuration parameters of the dosFsLib module3
permitted the unbounded consumption of memory
from the system memory heap as the FLASH file
system was populated with an increasing number of
files - The configuration parameters of the memPartLib
module4 were set so that the logic would suspend
the execution of any task that requested memory
when no additional memory was available - This had the undesirable effect of suspending a
critical task when the memory space was exhausted - Other effects included memory corruption,
inability to turn vehicle off (due to task
deadlock), repeating system resets - Contributing factors included the compressed
development schedule, unanticipated behavior of
the FSW, incomplete development (analysis of the
effects of the dosFsLib parameters was never
fully completed), test program was not equivalent
to operational use, and inadequate telemetry
9IVV Findings Related to the System Memory
- Requirement and test completeness
- IVV Risk 1 on Requirements (and extended to
include test) was remaining risk in Significant
Concern status at time of upload - Chief concern was that software requirements
discovery was not complete and that software had
not been adequately tested at the time of the
upload - Specific TIMs
- Specific TIMs were written against the
insufficient unit tests for portions of the file
system using the system memory - Project asserted testing was complete but without
documentation - These TIMs were still in Open state at the time
of the final upload - Code Complexity
- Portions of the file system using the system
memory was consistently reported to be very
complex - Modules were reported to have poor testability
and poor maintainability - Code Stability
- File system modules were being worked on until
the last release (R8.1d, 11/20/03) - File Meta Engine had 10 of its total code
changed as late as Release 8.0, and had 9 of its
total code changed for Release 8.1 - Note that the file system was not the cause of
the problem, but brought the lack of memory to
light and created the task deadlock
10IVV Concerns over Requirements Test
- Upload Readiness Review (11/25/03)
- Plans were to upload final FSW on 12/2/05 review
was to determine readiness - IVV recommended further testing before upload,
delaying upload past Dec 2 - Operational Readiness Review (12/5/03)
- Aggregate of requirement and test issues
represent a risk being tracked in IVV Risks - Final Requirements Risk status was Significant
Concern (middle of three possible levels) - IVV Concern There remains an IVV concern
about the possibility of requirements-related
surprises during operations. IVV has a less
optimistic view of the requirements discovery
than does the project. - Potential Consequence for Surface ops Possible
loss of science return (Possible loss of
science return means the situation we are
currently seeing significant time to detect,
understand, and correct problems on the surface) - Reiteration of 11/25/03 IVV recommendation for
further testing before upload (which by 12/5/03
had already occurred, the project having
proceeded with planned upload on 12/2/03) - Recommendation to Continue testing to the extent
possible - Recommendation to Ensure test results are
adequately reviewed - Project emphasis on test as you fly (vs. formal
unit and requirements-based tests) didnt find
the problem
11IVV Lessons Learned
- Resources
- The low level of resources being applied to such
a large and complex project was not sufficient - The goal of analyzing the software at a depth
that would allow the IVV Team to feel confident
when supporting project readiness reviews had to
be maintained - Forced a shift from a software component approach
to a more whole system approach - Resources for IVV should be such that a software
component approach can be maintained throughout a
project SDLC - Lack of Artifacts
- Current IVV Facility processes are very
requirements driven - The lack of FSW requirements artifacts on the MER
Project affected the IVV work being performed
and also helped to move the approach away from a
component level analysis - Additionally projects are not generally required
to follow a standardized software development
life cycle - The IVV Facility needs to examine its
requirements driven approach and generate some
alternative approaches to performing IVV on
projects lacking software artifacts
12IVV Lessons Learned
- Pursuing Risks
- Early on the IVV Team documented the
requirements risk - Project would only address specific problems that
were realization of the risk not the risk itself
with the IVV Team - Otherwise, the planned testing program mitigated
the risk in the projects eyes - The IVV Team was still concerned, but the lack
of FSW requirements made it difficult to fully
examine the consequences and likelihood of the
risk - The IVV Team eventually accepted the test
program as a mitigation to the risk - However as milestone reviews neared, the testing
in some cases had not been completed - The project continued testing up to the last
minute - Additionally, the lack of requirements artifacts
placed the MER Project into the position of
testing with incomplete requirements - Testing was driven more by scenarios generated by
system engineers such that they felt that the
system was fully exercised IVV had no insight
into how the scenarios were developed - The IVV Team needs to be more proactive in
assessing mitigation efforts early in the SDLC so
as to more effectively support projects - Additionally projects should enforce and follow
good software engineering practices that includes
good requirements development to support a mature
test program
13IVV Contributing Factors
- The IVV Team needs to be intimately involved
with the development team - The MER projects compressed schedule created a
schedule risk from outside parties - The IVV team was not able to work directly with
the developer - Additionally there was no access to the
development issue database or the low level
testing artifacts that would allow IVV to
perform a more in-depth analysis - Projects need to integrate the IVV process into
the development process in order to gain maximum
advantage of the resources being offered - Need to monitor relationship to ensure that
independence is not lost - More specific attention to COTS products
- The root cause in this case was the incorrect use
of a COTS product - The IVV team usually analyzes the use of and
interfaces between COTS and developed code since
the content of most COTS products is not visible - The IVV team was not able to perform that level
of analysis on this mission due to resource
constraints
14Summary
- The IVV approach was modified based on various
project specific factors that caused the analysis
approach to be elevated to a full system approach
rather than the normal software component
approach - Even at the full system approach, the IVV team
identified potential troubling areas involving
the system memory usage risk tracking, issue
tracking, code analysis, requirements analysis,
test analysis, code complexity, and code
stability - However, the lack of complete requirements
documents and testing documentation, both
identified by IVV as project deficiencies,
hindered finding the specific problem prior to
upload - The IVV Facility is examining the lessons
learned to determine what actions to take to
ensure better service on other IVV projects