Title: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems
1On Modeling the Lifetime Reliability of
Homogeneous Manycore Systems
- Lin Huang and Qiang Xu
- CUhk REliable computing laboratory (CURE)
- The Chinese University of Hong Kong
2Integrated Circuit (IC) Product Reliability
- IC errors can be broadly classified into two
categories - Soft errors
- Do not fundamentally damage the circuits
- Hard errors
- Permanent once manifest
- E.g., time dependent dielectric breakdown (TDDB)
in the gate oxides, electromigration (EM) and
stress migration (SM) in the interconnects, and
thermal cycling (TC)
3Manycore Systems
- State-of-the-art computing systems have started
to employ multiple cores on a single die - General-purpose processors, multi-digital signal
processor systems - Power-efficiency
- Short time-to-market
Source Intel
Source Nvidia
4Problem Formulation
- To model the lifetime reliability of homogeneous
manycore systems using a load-sharing
nonrepairable k-out-of-n G system with general
failure distributions - Key features
- k-out-of-n G systems to provide fault tolerance
- Load-sharing each embedded core carries only
part of the load assigned by the operating system - Nonrepairable embedded cores are integrated on a
single silicon die - General failure distribution embedded cores age
in operation
5Queueing Model for Task Allocation
- Embedded cores execute tasks independently and
one core can perform at most one task at a time - Consider a manycore system composed of a set
identical embedded cores - The set of active cores , spare cores ,
and faulty cores
6Queueing Model for Task Allocation
- A general-purpose parallel processing system with
a central queue with a bulk arrival is modeled as
queueing system - The probability that a certain active core is
occupied by tasks (also called utilization) is
computed as - Target system
- Gracefully degrading systems
- Standby redundant systems
7Lifetime Reliability of Entire System
Gracefully Degrading System
- A functioning manycore system may contains
good cores - Let be the probability that the
system has active cores at time - The system reliability can therefore be expressed
as - Thus, the Mean Time to Failure (MTTF) of the
system can be written as
8Lifetime Reliability of Entire System
Gracefully Degrading System
- To determine
-
-
-
- Conditional probability
-
- For any
- Conditional probability
-
- The remaining is how to compute
9Behavior of Single Processor Core
- States of cores
- Spare mode cold standby
- Active mode
- Processing state
- Wait state warm standby
- The same shape but different scale
- parameter
- E.g.,
10Lifetime Reliability of A Single Core
Gracefully Degrading System
- Define accumulated time in a certain state at
time as how long it spends in such a state up
to time - Calculation
11Lifetime Reliability of A Single Core
Gracefully Degrading System
- Theorem 1 Suppose a manycore system with
gracefully degrading scheme has experienced
core failures, in the order of occurrence time at
, respectively, for any core
that has survived until time - its accumulated time in the processing state up
to time - its accumulated time as warm standby up to time
12Lifetime Reliability of A Single Core
Gracefully Degrading System
- Recall that the reliability functions in wait and
processing states have the same shape but
different scale parameter - General reliability function ,
abbreviated as - Reliability function in processing state
, denoted as - Reliability function in wait state
, denoted as - Relationships
and
13Lifetime Reliability of A Single Core
Gracefully Degrading System
- A subdivision of the time
- By the continuity of reliability function, we have
processing
wait
wait
Accumulated time in the processing state
Accumulated time in the wait state
14Lifetime Reliability of A Single Core
Gracefully Degrading System
- Theorem 2 Given a gracefully degrading manycore
system that has experienced core failures
which occur at respectively,
the probability that a certain core survives at
time provided that it has survived
until time is given by - where
15Lifetime Reliability of Entire System Standby
Redundant System
- A standby redundant system is functioning if it
contains at least good cores, among which
are configured as active one, the remaining are
spares - To determine
- Again, the key point is to compute
16Lifetime Reliability of A Single Core Standby
Redundant System
- Define a cores birth time as the time point
when it is configured as an active one - Theorem 3 In a standby redundant manycore
system, for any core with birth time that has
survived until time - its accumulated time in the processing state up
to time - its accumulated time as warm standby up to time
17Lifetime Reliability of A Single Core Standby
Redundant System
- Theorem 4 In a manycore system with standby
redundant scheme, the probability that a certain
core with birth time survives at time
is given by - where
18Experimental Setup
- Lifetime distributions
- Exponential
- Weibull
- Linear failure rate
- System parameters
-
-
- Consider a manycore system
- consisting of cores
19Misleading Caused by Exponential Assumption
Redundancy Scheme Sojourn Time (years) Sojourn Time (years) Sojourn Time (years) Sojourn Time (years) Sojourn Time (years)
Redundancy Scheme 0-Failure State 1-Failure State 2-Failure State 3-Failure State 4-Failure State
0 0.2188 0.2188
1 Degrading 0.2121 0.2188 0.4309
1 Standby 0.2188 0.2188 0.4376
2 Degrading 0.2059 0.2121 0.2188 0.6368
2 Standby 0.2188 0.2188 0.2188 0.6564
3 Degrading 0.2000 0.2059 0.2121 0.2188 0.8368
3 Standby 0.2188 0.2188 0.2188 0.2188 0.8752
4 Degrading 0.1944 0.2000 0.2059 0.2121 0.2188 1.0312
4 Standby 0.2188 0.2188 0.2188 0.2188 0.2188 1.0940
Expected lifetime of the
-core system
20Lifetime Reliability for Non-Exponential Lifetime
Distribution
(a) Weibull Distribution
(b) Linear Failure Rate Distribution
21Detailed Results for Gracefully Degrading System
Distribution Sojourn Time (years) Sojourn Time (years) Sojourn Time (years) Sojourn Time (years) Sojourn Time (years)
Distribution 0-Failure State 1-Failure State 2-Failure State 3-Failure State 4-Failure State
Weibull 0 2.2039 2.2039
Weibull 1 2.2153 0.5573 2.7726
Weibull 2 2.2260 0.5600 0.3055 3.0915
Weibull 3 2.2359 0.5626 0.3142 0.1040 3.2167
Weibull 4 2.2452 0.5649 0.2988 0.0955 0.0820 3.2864
Linear Failure Rate 0 1.8572 1.8572
Linear Failure Rate 1 1.8463 1.1367 2.9830
Linear Failure Rate 2 1.8354 1.1325 0.8926 3.8605
Linear Failure Rate 3 1.8243 1.1282 0.8798 0.6941 4.5264
Linear Failure Rate 4 1.8133 1.1237 0.8762 0.7055 0.6269 5.1456
22The Impact of Workload
23Comparison Between Gracefully Degrading System
and Standby Redundant System
Distribution Redundancy Scheme
Distribution Redundancy Scheme Hot Standby Warm Standby Warm Standby Warm Standby Warm Standby Cold Standby
Distribution Redundancy Scheme Hot Standby Cold Standby
Weibull 2 Degrading 1.5039 1.8232 2.1497 2.2930 2.4265 2.6258
Weibull 2 Standby 1.5314 1.8227 2.1133 2.2488 2.3484 2.5309
Weibull 4 Degrading 1.5046 1.8521 2.2305 2.4432 2.5771 2.8376
Weibull 4 Standby 1.5577 1.8545 2.1715 2.3103 2.4266 2.6261
Linear Failure Rate 2 Degrading 1.9115 2.3197 2.7070 2.8697 3.0105 3.2424
Linear Failure Rate 2 Standby 1.9608 2.3314 2.7330 2.8851 3.0091 3.2146
Linear Failure Rate 4 Degrading 2.1348 2.7122 3.3642 3.6529 3.9385 4.3590
Linear Failure Rate 4 Standby 2.3008 2.7899 3.4307 3.6015 3.8588 4.1881
24Conclusion
- State-of-the art CMOS technology enables the
chip-level manycore processors - The lifetime reliability of such large circuit is
a major concern - We propose a comprehensive analytical model to
estimate the lifetime reliability of manycore
systems - Some experimental results are shown to
demonstrate the effectiveness of the proposed
model
25Thank You for Your Attention!