Title: Why do so many chips fail
1Why do so many chips fail?
- Ira Chayut, Verification Architect
- (opinions are my own and do not necessarily
represent the opinion of my employer)
2Failure rate of first silicon is rising
- research by Collett International revealed
that 52 of complex application specific
integrated circuits (ASICs) required a respin and
the reason was largely due to functional errors.
(http//www.techonline.com/community/ed_resource/
feature_article/36655) - Who is to blame? (There must be someone to
blame!) - Management they didnt provide enough resources
- HW Engineering they created the functional
errors - Verification they didnt catch the functional
errors - Architecture they didnt focus on testability
- Marketing they kept changing the specs
3People dont kill chips, complexity kills chips
http//www.cs.utexas.edu/users/dburger/teaching/cs
395t-s99/papers/2_src.pdf (1999) Projected
numbers are a bit lower than current reality a
dual core AMD Opteron has 233 million transistors
and the Intel Itanium 2 has 592 million
transistors
4Complexity increases exponentially
- Chip component count increases exponentially
over time (Moores law) - Interactions increase super-exponentially
- IP reuse and parallel design teams facilitate
more functions with fewer HW engineers per
function and more functions per chip - Verification effort gets combinatorially more
difficult as functions are added
5Why verification is not able to keep up
- Verification effort gets combinatorially more
difficult as functions are added - BUT
- Verification staffing/time cannot be made
combinatorially larger to compensate - AND
- Chip lifetimes are too short to allow for
complete testing - THUS
- Chips will continue to have ever-increasing
functional errors as chips get more complex
6Limiting the number of architectural and
functional errors
- Thorough unit-level verification testing
- Small simulations run faster
- Avoids combinatorial explosion of interactions
- Well defined interfaces between blocks with
assertions and formal verification techniques to
reduce inter-block problems - Emulation or FPGA prototyping to accelerate
testing
7How to live with functional errors
- Successful companies have learned how to ship
chips with functional and architectural time to
market pressures and chip complexity force the
delivery of chips that are not perfect (even if
that were possible). How can this be done
better? - For a long while, DRAMs have been made with extra
components to allow a less-than-perfect chip to
provide full device function and to ship - How to do the same with architectural features?
How can full device function exist in the
presence of architectural or implementation
omissions or errors?
8Architecture support
- Embrace Perls motto There's More Than One Way
to Do It allow for multiple ways of
accomplishing all critical specified functions - Analogous to Design for Test (DFT) and Design for
Verification (DFV), we should start thinking
about Architect for Verification (AFV) - Thanks to Dave Whipp for the AFV phrase and
acronym - In some problem domains, such as networking,
upper-layer protocols can recover from some
silicon errors though there is a performance
penalty when this is used
9Architect support, continued
- A programmable abstraction layer between the real
hardware and users API can hide functional warts
hardware catches specific operations and either
directs them to one of multiple hardware
implementations, or signals a software trap - Pyramid minicomputers hid the assembly language
from users, compiler could work around problems - Transmeta maps standard machine language to
hidden processor architecture, translation
software can work around problems - Soft hardware can allow chip redesign after
silicon is frozen (and shipped!)
10Summary
- Ever increasing chip complexity prevents total
testing before tape-out (or even before shipping) - AFV techniques can make chip verification not
subject to combinatorial explosion - We have to accept that there will be
architectural and functional failures in every
advanced chip that is built - Architecture support needed to allow failures to
be worked around or fixed after post-silicon