Title: Design and Test Technology for Automotive Electronic Systems
1Design and Test Technology for Automotive
Electronic Systems
Andreas Steininger Vienna University of
Technology
2My contact data
Andreas Steininger Vienna University of
Technology Faculty of Informatics Institute of
Computer Engineering Embedded Computing Systems
Group Treitlstrasse 3 A- 1040 Vienna Austria ste
ininger_at_ecs.tuwien.ac.at http//ti.tuwien.ac.at/e
cs
3Outline
- Automotive electronics the specific situation
- Node-level view
- designing cost efficient dependable nodes
- test purpose techniques
- System-level view
- communication system
- test purposes, challenges techniques
- Summary
4Main Contributors to this Material
- Dr. Thomas Kottke R. Bosch AG / EADS
- Dr. Christoph Scherrer Alcatel / Thales
- Dr. Eric Armengaud DecomSys / VirtualVehicle
- Dr. Karl Thaller DecomSys / Elektrobit Austria
- Dr. Martin Horauer UAT Technikum Wien
5Electronics in Cars some Facts
- high proportion of value
- up to 30
- high development potential
- more than 80 of the innovations
- high number of Electronic Control Units (ECUs)
- up to 70
- complex distributed system
- different networks topologies
6Electronics in Cars - Benefits
- cheap alternative to existing mechanical
solutions - lighter, smaller, cheaper, more flexible,
- enabler for further optimizations
- electronic ignition, motor management,
- key to new functionality
- safety ESP, active suspension, crash sensing
- comfort air conditioning, infotainment,
- security immobilizer, alarm, electronic key, GPS
tracking, - autonomy anticipatory braking, lane keeping,
7Key Demands
- Safety
- Real-Time
- Low Cost
- Robustness
- Testability
8Key Demands
- Safety
- Real-Time
- Low Cost
- Robustness
- Testability
- high risk potential (energy!)
- high public awareness
- no safe state (in general)
- certification required(EN 61508, ISO 26262)
- high complexity of system application
- legal issues (liability)
9Key Demands
- Safety
- Real-Time
- Low Cost
- Robustness
- Testability
- engine 6000 rpm 1/10ms
- VDM 100km/h 28cm/10ms
- need to synchronize distributed activities
- real-time communication
- image processing tasks
10Key Demands
- Safety
- Real-Time
- Low Cost
- Robustness
- Testability
- extreme competition
- high cost inhibits introduction
- tailored safety concepts
- minimum degree of replication
- use structural redundancies
- generic solutions
- scalable, configurable, flexible
- marginal costs beat NRE
11Key Demands
- Safety
- Real-Time
- Low Cost
- Robustness
- Testability
- wide temperature range
- temperature cycles
- humidity
- vibrations
- EMI (radiated conducted)
- decreasing noise margins
- service by non-experts
12Key Demands
- Safety
- Real-Time
- Low Cost
- Robustness
- Testability
- complex distributed system
- many options configs
- multi-vendor system
- startup in less than 1 sec
- high availability reliability
- diagnosis by non-expert
- online testing required ?
13How attain Safe Operation?
- fault avoidance
- fault tolerance
- failure mode
- fault type
- number of faults
14How attain Safe Operation?
- fault avoidance
- fault tolerance
- failure mode
- fault type
- number of faults
- protect system from faults
- impractical
- hostile environment
- MTTF up to 109h to be guaranteed
- susceptibility of electronics is known
15How attain Safe Operation?
- fault avoidance
- fault tolerance
- failure mode
- fault type
- number of faults
- accept occurrence of faults,
- BUT detect handle them
- appropriately to avoid failure
- introduce redundancy
16How attain Safe Operation?
- fault avoidance
- fault tolerance
- failure mode
- fault type
- number of faults
- determined by application
- fail safe
- detect error gt safe state
- single channel with ED
- duplication comparison
- fail operational
- detect and mask error
- voting over redundant results
17How attain Safe Operation?
- fault avoidance
- fault tolerance
- failure mode
- fault type
- number of faults
- random fault
- hits one replica onlygt replication sufficient
- systematic fault
- common cause faultgt avoid shared resources
- design faultgt use diversity
18How attain Safe Operation?
- fault avoidance
- fault tolerance
- failure mode
- fault type
- number of faults
- single fault
- usual assumption
- system recovered from F1 before F2 occurs
- multiple fault
- fault accumulation
- fault bursts
19Current Status
- fail safe functions realized
- shut off upon error
- mechanical fall-back system assumes controlno
true by wire functions - single-channel solutions sufficient
- tolerance against random faults
- avoid design faults by field experience gt no
diversity - avoid common cause faults by design (?)
- single fault assumption
- keep faults rare (shielding, etc.)
20Outline
- Automotive Electronics the specific situation
- Node-level view
- designing cost efficient dependable nodes
- test purpose techniques
- System-level view
- communication system
- test purposes, challenges techniques
- Summary
21Node-level Solution
- mission make a node (processor) fault tolerant
- need to consider CPU and memory
- aim is fail safe (but keep option for fail op
in mind) - simplex unit with error detection capabilities
- duplication and comparison
- hybrid approach
- Ideas?
22Options for the CPU Core
- modify custom CPU core
- parity for buses
- two-rail coding for signals
- self-checking implemen-tation of simple units
- duplicate compare for complex units
- careful layout
- Single core ED
- Dual core cmp
- Superscalar proc. cmp ED
23Options for the CPU Core
- duplicate custom CPU core
- master/checker operation
- shared (safe) memory
- validity check for inputs
- self-checking comparator checks equality of
outputs - option clock delay
- option mode switch
- Single core ED
- Dual core cmp
- Superscalar proc. cmp ED
24Solution Example Dual Core Frame
- benefits
- can use custom core without modifications
- safety analysis valid for other cores as well
- promises high ED coverage with moderate efforts
- CPU is hard to protect otherwise
- crucial points
- enable easy recovery ( gt keep outage short)
- eliminate single points of failure
- detect common cause faults
25Protection in the Dual Core Frame
Safe memories
Parity for buses
Dual-Rail Coding
Self-Checking Comparators
Core 1 (Master)
Instr. Addr.
Data in
Instr.
Data Addr.
Data out
Instr. Mem
Data Mem
?
?
?
Error_Sig
Instr. Addr.
Data in
Instr.
Data Addr.
Data out
Core 2 (Checker)
26Potential for Common Cause Faults
- identical input data
- identical clock (lock step)
- shared clock generator
- shared power supply
- both processors on same die
27Temporal Diversity
- operate checker with a delay against master
- same fault hits at different point of computation
- therefore different effect
- therefore better chance to detect by comparison
- store master output for comparison
- choose delay of 1.5 clock cycles
- larger delay causes high effort for little gain
(gtexperiments) - non-integer cycle number against clock related
effects - easy to implement by clock inversion
28Temporal Diversity Implementation
Core 1 (Master)
Instr. Addr.
Data in
Instr.
Data Addr.
Data out
Instr. Mem
Data Mem
?
?
?
Error
DT
Instr. Addr.
Data in
Instr.
Data Addr.
Data out
Core 2 (Checker)
29Fail Safe Dual Core Frame Summary
- safe memories for instructions and data
- comparison of all core outputs
- parity protection for buses (data, address)
- dual rail coding for single signals (int, rst,
err) - totally self-checking comparators
- temporal diversity
- How safe is the proposed solution?
30Assessment of the Solutions Quality
- How measure quality? ( Aim is fail
safe) - error detection coverage gt detect all errors
- error detection latency gt detect them
quickly - Which method to choose?
- theoretical analysis / modelling
- experimental fault injection
- field observation
31Fault Injection Experiment 1
- 2 SPEAR cores in fail safe frame ( DUT)
- synthesized to EDIF netlist
- injected one by one into netlist
- exhaustive list of stuck-at-1 and stuck-at-0
faults - download to FPGA, application run
- golden device as reference ( REF)
- upon mismatch (DUT ? REF) gt check comparator
32Results of FI Experiment 1
?
?
Temporal diversity causes detection latency
Recovery becomes difficult !
33Delayed WR as a Remedy
- problem status
- data corruption during ED latency (due to
temporal diversity) - solution approach
- delay data output / memory WR until comparison
complete - allow memory RD without delay (performance!)
- restrictions
- RD-after-WR gt data conflict
- WR-after-RD gt stale data
- gt compiler has to take care of this
34Enabling fast Recovery
- error signal (dual rail)
- notifies external component / memory
- turns any further WR into RD (error confinement)
- triggers processor interrupt
- status register (memory mapped)
- updated by HW
- indicates source of error (data parity, address
mismatch,) - recovery
- can build on uncorrupted status
- can benefit from detailed status information
35Results of FI Experiment 2
?
No change of memory contents in case of error
Erroneous read access is uncritical
36Fail Safe Dual Core Summary
- duplicate compare
- generic approach, applicable to any core type
- covers all (local) errors
- need to carefully eliminate single points of
failure - need to complement with protection for signals
buses - temporal diversity
- mitigates (many) common cause failures
- requires output delay to ensure error confinement
37Squeezing our more Efficiency
- dual core is expensive ?
- normally yields performance improvement
- would be welcome here as well increasing
performance demand _at_ limited clock rates - but exclusively dedicated to safety here
- observation not all tasks are safety critical
- enable flexible switching between safety
mode and performance mode
38Operation in Performance Mode
- cores execute different instruction streams in
parallel - both cores have direct access to memory /
peripherals - instruction caches introduced to minimize
penalties from conflicting access - temporal diversity disabled
- comparator disabled
39Requirements on the Mode Switching
- coherent operation in safety mode
- internal states of cores must be aligned before
switching to safety mode (register file, cache) - safe operation in safety mode
- switching must not introduce safety leakage
- no corruption of safety-relevant data in perform.
mode - low performance penalty for mode switching
- slow or complicated switching would spoil the
anticipated performance gain
40Implementation of the Split Core Frame
41Instruction RAM Control Unit (ICU)
- handles all accesses to the instruction RAM(in
case of cache miss) - safety mode core 1 exclusively supplies the
instruction address - performance mode both cores request
instructions independently, ICU resolves
simultaneous requests
42Data RAM Control Unit (DCU)
- handles accesses to peripherals and data memory
- maintains a unique identification bit for each
core - provides a memory locking mechanism (for atomic
RAM operations)
43Mode Switch Detect Units
- implemented as core-external units to still
allow the use of standard cores - are snooping the bus for the mode switch
instruction - trigger the mode switch when mode switch
instruction is encountered
44Mode Switch Safety gt Performance
load ID reg address
LDL r1, 248 LDH r1, 255 mode switching LDW r2,
r1 BTEST r2, 1 JMPI_CT
mode switch instrgt core1 waitgt core2 waitgt
clk aligngt switch mode
load check ID bitgt cond branch core2
45Mode Switch Performance gt Safety
core1 encounters mode switch instrgt trigger MSU
(core1 signal) gt halt core1 (wait1) gt
interrupt core2 (message2)
core2 encounters interruptgt save contextgt
jump to mode switch instr
core2 executes mode switchgt halt core2 switch
clockgt resume core1gt resume core2 after delay
46Analysis Example Clock Switching
- switching controlled by core mode signal (dual
rail coded) - special clock routing ensures detection of all
opens - watchdog with independent time source detects
clock failure - using core mode signal as trigger detects
failure to switch back to safety mode
47Fault Injection in Safety Mode
?
Delayed WR still ensures error confinement
48Fault Injection in Performance Mode
- fault injected in performance mode, then switch
to safety mode
No undetected effects / late detections in safety
mode
Watchdog important to prevent hang-up in perf mode
49Options for the CPU Core
- Single core ED
- Dual core cmp
- Superscalar proc. cmp ED
- duplicate pipeline, modify rest
- shared register file
- shared (safe) memory
- validity check for inputs
- self-checking comparator checks equality of
outputs - option mode switch
50Superscalar Proc Implementation
51Mode-Switch Principle
- Mode switch from safety mode to performance mode
- Mode switch from performance mode to safety mode
MS mode switch Px instr in perf mode Sx instr in
safe mode
52Overall Comparison of Options
results from PhD Kottke
53We still need a Safe Memory
- key parameters
- 128kB / 32bit words
- 0.25mm technology, 100MHz clock
- soft error rate l 10-12/h per bit
- permanent error rate l 10-15/h 10-14/h
- operating lifetime 10h
- working lifetime 104 h (10 years)
- allowed failure rate lspec lt 10-10/h overall
lnative ? 10-5/h
54Implementing a Safe Memory
Why not duplicate compare?
- detect bit flips in storage cells
- parity (or EDC/ECC)
- detect erroneous address decoding
- special decoder logic design
- protect interfaces
- parity for data, address and control buses
- prevent illegal WR access
- provide mask input for write enable
55We still need a Safe Memory
- detect bit flips in storage cells
- parity (or EDC/ECC)
- detect erroneous address decoding
- special decoder logic design
- protect interfaces
- parity for data, address and control buses
- prevent illegal WR access
- provide mask input for write enable
56Possible Address Decoder Errors
- correct behavior
- any given address activates exactly one assigned
memory cell - erroneous behaviors
- an address activates no memory cell at all
- an address activates more than one memory cell
- an address activates a wrong memory cell
57Checking the Address Decoder
check for missing or multiple cell
activationsXOR(upper half) ? XOR(lower half) ?
re-check parity behind cell arrayOR over even
cells ? parity ?
- large decoders built from cascade of smaller ones
58Outline
- Automotive Electronics the specific situation
- Node-level view
- designing cost efficient dependable nodes
- test purpose techniques
- System-level view
- communication system
- test purposes, challenges techniques
- Summary
59Node Testing Whats the Purpose
- factory test
- unveil manufacturing defects
- startup test
- check function before starting mission
- on-line test
- check function during mission
60Basic Principle of Testing
source Agilent
61The Complexity Problem
- Functional Testing
- apply all possible/relevant input patterns
- of required vectors explodes with DUT
complexity - example SW self-test of processor
- Structural Testing
- check function of all constituting components
- if no defect in components gt no defect in DUT
- need access to internal components gt scan test
62The Scan Test Principle
circuit registers are chained to one or more
shift registers
63Why Care for On-line Testing?
- Errors are detected anyway
- we have provided lots of mechanisms
- What about rarely used resources?
- faults will rarely get activated there
- so who cares when they are faulty and unused?
- What about fault accumulation?
- faults may accumulate in rarely used resources
- our assumption was single faults!
64A Toy Example Steer by Wire
65Example Architecture
single fault assumption!
Within a Fault Tolerant Unit (FTU) twocomputer
nodes operate in active redundancy.
66Reliability Model (Markov)
b
b
b
w
a
Additional parameters d rarely used
resources s activation rate of these
w
67Model Results
high activation rate
no rarely used resources
log(MTTF)
9
rare activation ofresources impairsMTTF gt
fault accumulation
8
7
6
2
0
0
20
-2
40
-4
60
-6
80
-8
log(activation rate)1/h
rarely used resources
100
68Are there rarely used Resources ?
memory
interconnect
comb logic
flip-flops
irregular use
irregular use
example TTP/C controller prototype chip
69Conclusion of the Analysis
- irregularly used resources deserve specific
attention - danger of fault accumulation
- memory is often the dominant resource in a
system - therefore relatively high error probability
- hardware resources tend to exhibit higher and
more regular activation than software
tasks/memory cells - it is wise to protect memory from fault
accumulation - on-line testing of memory
70Testing versus Error Detection
- concurrent error detection
- checks ongoing activities for certain properties
- does not perform explicit stimulation
- does not cover unused resources and irrelevant
errors - detects error as soon as it becomes activated
- testing
- applies explicit stimuli
- checks for expected result
- covers all resources included in the test scope
- detects defect only upon test execution
71Transparent On-line Memory Test
- problem on-line test needs to be transparent
- do not destroy memory contents or degrade
reaction time - solution systematic inversion of memory
contents - instead of writing 0 or 1 gt flip bit
- application of standard test algorithm possible
(March, e.g.) - upon CPU read or write gt suspend test
- keep track of inverted cells gt re-invert upon
CPU read - drawbacks
- need to introduce multiplexor gt access delay
- increased memory activity gt power consumption
- test controller not protected
72TOMT Implementation
Processor
Memory
73Outline
- Automotive Electronics the specific situation
- Node-level view
- designing cost efficient dependable nodes
- test purpose techniques
- System-level view
- communication system
- test purposes, challenges techniques
- Summary
74Interaction between Subsystems
- is the key to nowadays automotive innovations
- allows exchange of status
- allows sharing of resources (sensors, e.g.)
- allows coherent distributed activities
- enables completely new types of applications
- no way around that
- is the nightmare of every system validator
- applications become mutually dependent
- further explosion of test space
- thousands of options, versions etc.
- products from dozens of different vendors must
interact
75The Role of the Bus System
- point to point connection on demand
- too unflexible
- too much cabling (several km!)
- one generic bus system for the whole car
- demands are too different
- safety issues
- waste of bandwidth
- mix of different bus systems (plus bridges for
interconnect) - communication partly in parallel
- selection of bus protocol according to demands
76An Example Architecture
77Time-Triggered Communication
source G. Bauer
78The Temporal Firewall Principle
source G. Bauer
79Benefits of TT Communication
- temporal firewall
- decouples activities of individual nodes
- reduces coupling to desired data exchange only
- global periodic schedule
- complete temporal specification for global
activities - allows isolated development (and test!) of
components - enables systematic planning of resource
utilisation - provides life-sign for every sender
- allows masking of babbling idiots by bus
guardian - builds on existence of global time
80Outline
- Automotive Electronics the specific situation
- Node-level view
- designing cost efficient dependable nodes
- test purpose techniques
- System-level view
- communication system
- test purposes, challenges techniques
- Summary
81System-level Test The Concept
- principle of structural testing
- if all components OK gt system OK
- decoupling of components by TT-approach
- every component is fully specified and can be
developed and tested in isolation
everythings great!
82System-level Test The Reality
- dozens of complaints about mystic interactions of
subsystems reported - virtually all brands are affected
- What happened to our system-level test concept?
83The Root of the Problem
- Recall the purpose of structural testing
- identify defects
- Are we actually looking for defects??
- Our test concept is still very good at that!
- What do we actually want to test for?
- configuration errors
- system design errors (systems are too complex to
verify) - SOS (slightly off specification) errors
- white spots in the specifications (bus protocol,
)
We need to test the system function!
84Solutions Ahead?
- need to determine manageable and sufficient set
of (functional) test cases - divide and conquer in the functional domain
- hierarchic testing (requirements gt properties)
- inclusion of formal tools (model driven
testing) - inclusion of statistics
- inclusion of field experiences
- need to consider practical constraints
- limited accessibility, black boxes
- cost
85Solving the Complexity Problem
Application
- decomposition into services mechanisms
- clearly defined inputs, outputs and config.
parameters for each mechanism - use hierarchical structuring of mechanisms for
diagnosis
Transport
Data link
Physical
details http//embsys.technikum-wien.at/steacs.ht
ml
86Solving the Accessibility Problem
details http//www.ecs.tuwien.ac.at/armengaud/ex
tract
87Forcing the Clock Synchronization
details http//www.ecs.tuwien.ac.at/armengaud/ex
tract
88Summary
- the automotive domain has its own laws and rules
- need extremely cost-effective robust solutions
for safety-critical real-time applications,
versatile and custom tailored - on node level
- different redundancy concepts applicable
- example dual core CPU and memory with protection
mechs - on-line testing for memory may be required
- on system level
- crucial role of communication infrastructure
- advantages of time triggered approach
- insufficient suitability of structural testing
89Hungry for more?
- http//ti.tuwien.ac.at/ecs
- steininger_at_ecs.tuwien.ac.at
90Related publications of my group (1)
- 1 T. Kottke and A. Steininger, A Fail-Silent
Memory for Automotive Applications, 9th IEEE
European Test Symposium, Corsica 2004. - 2 T. Kottke and A. Steininger, A Generic Dual
Core Architecture with Error Containment,
Journal of Computing and informatics, vol. 23,
no.5, 2004. - 3 T. Kottke and A. Steininger, A
Reconfigurable Generic Dual-Core Architecture,
Intl Conference on Dependable Systems and
Networks (DSN2006), Philadelphia, 2006. - 4 T. Kottke and A. Steininger, A Fail-Silent
Reconfigurable Superscalar Processor, 13th IEEE
Pacific Rim Intl Symposium on Dependable
Computing, Melbourne, 2007. - 5 C. El Salloum, A. Steininger, P.
Tummeltshammer and W. Harter, Recovery
Mechanisms for Dual Core Architectures, 21st
IEEE Intl Symposium on Defect and Fault
Tolerance in VLSI Systems (DFT06), Washington,
2006. - 6 A. Steininger and C. Temple, Economic
Self-Test in the Time-Triggered Architecture,
IEEE Design Test of Computers, vol 3/1999 - 7 A. Steininger, Testing and Built-in
Self-Test A Survey, Journal of Systems
Architecture 46(2000)
91Related publications of my group (2)
- 8 A. Steininger and C. Scherrer, On the
Necessity of BIST in Safety-Critical Applications
A Case Study, 29th Annual Intl Symposium on
Fault-Tolerant Computing (FTCS29), Madison,
1999. - 9 C. Scherrer and A. Steininger, How does
Resource Utilization Affect Fault Tolerance?,
2000 IEEE International Symposium on Defect and
Fault Tolerance in VLSI Systems (DFT00),
Yamanashi, 2001. - 10 C. Scherrer and A. Steininger, How to Tune
the MTTF of a Fail-Silent System, 2001 IEEE
International Symposium on Defect and Fault
Tolerance in VLSI Systems (DFT01), San
Francisco, 2001 - 11 C. Scherrer and A. Steininger, Dealing with
Dormant Faults in an Embedded Fault-Tolerant
Computer System, IEEE Transactions on
Reliability, vol. 52, no. 4, 2003. - 12 K. Thaller and A. Steininger, A
Transparent Online Memory Test for Simultaneous
Detection of Functional Faults and Soft Errors in
Memories, IEEE Transactions on Reliability, vol.
52, no. 4, 2003.
92Related publications of my group (3)
- 13 E. Armengaud, F. Rothensteiner, A.
Steininger, R. Pallierer, M. Horauer, M. Zauner,
A Structured Approach for the Systematic Test of
Embedded Automotive Communication Systems, Intl
Test Conference 2005, Austin 2005. - 14 E. Armengaud, A. Steininger, M. Horauer, R.
Pallierer, A Layer Model for the Systematic Test
of Time-Triggered Automotive Communication
Systems, 5th IEEE Intl Workshop on Factory
Communication Systems, Vienna, 2005. - 15 E. Armengaud, A. Steininger and M. Horauer,
Automatic Parameter Identification in FlexRay
based Automotive Communication Networks, 11th
IEEE Intl Conference on Emerging Technologies
and Factory Automation, Prague 2006. - 16 E. Armengaud and A. Steininger, Pushing the
Limits of Remote Online Diagnosis in Embedded
Real-Time Networks, 6th IEEE Intl Workshop on
Factory Communication Systems, Torino, 2006. - 17 P. Milbredt, A. Steininger and M. Horauer,
Automated Testing of FlexRay Clusters for System
Inconsistencies in Automotive Networks, 4th
Intl Symposium on Electronic Design, Test and
Applications (DELTA 2008), Hong Kong, 2008.
93Related Theses Projects
- T. Kottke, Untersuchung von fehlertoleranten
Prozessorarchitekturen für sicherheitsrelevante
Automobilanwendungen, PhD thesis, Vienna
University of Technology, 2005. (German) - C. Scherrer, Zuverlässigkeit zweifach
redundanter Architekturen unter besonderer
Berücksichtigung latenter Fehler, PhD thesis,
Vienna University of Technology, 2002. (German) - K. Thaller, A Transparent Online Memory Test,
PhD thesis, Vienna University of Technology,
2001. - E. Armengaud, A Transparent Online Test Approach
for Time-Triggered Communication Protocols, PhD
thesis, Vienna University of Technology, 2008. - STEACS (Systematic Test of Embedded Automotive
Communication Systems)http//embsys.technikum-wie
n.at/projects/steacs/index.html - EXTRACT (Exploiting Synchrony for Transparent
Communication Services Testing)http//ti.tuwien.a
c.at/ecs/research/projects/extract