On Modeling the Lifetime Reliability of Homogeneous Manycore Systems - PowerPoint PPT Presentation

About This Presentation
Title:

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems

Description:

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University of Hong Kong – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 26
Provided by: eduh83
Category:

less

Transcript and Presenter's Notes

Title: On Modeling the Lifetime Reliability of Homogeneous Manycore Systems


1
On Modeling the Lifetime Reliability of
Homogeneous Manycore Systems
  • Lin Huang and Qiang Xu
  • CUhk REliable computing laboratory (CURE)
  • The Chinese University of Hong Kong

2
Integrated Circuit (IC) Product Reliability
  • IC errors can be broadly classified into two
    categories
  • Soft errors
  • Do not fundamentally damage the circuits
  • Hard errors
  • Permanent once manifest
  • E.g., time dependent dielectric breakdown (TDDB)
    in the gate oxides, electromigration (EM) and
    stress migration (SM) in the interconnects, and
    thermal cycling (TC)

3
Manycore Systems
  • State-of-the-art computing systems have started
    to employ multiple cores on a single die
  • General-purpose processors, multi-digital signal
    processor systems
  • Power-efficiency
  • Short time-to-market

Source Intel
Source Nvidia
4
Problem Formulation
  • To model the lifetime reliability of homogeneous
    manycore systems using a load-sharing
    nonrepairable k-out-of-n G system with general
    failure distributions
  • Key features
  • k-out-of-n G systems to provide fault tolerance
  • Load-sharing each embedded core carries only
    part of the load assigned by the operating system
  • Nonrepairable embedded cores are integrated on a
    single silicon die
  • General failure distribution embedded cores age
    in operation

5
Queueing Model for Task Allocation
  • Embedded cores execute tasks independently and
    one core can perform at most one task at a time
  • Consider a manycore system composed of a set
    identical embedded cores
  • The set of active cores , spare cores ,
    and faulty cores

6
Queueing Model for Task Allocation
  • A general-purpose parallel processing system with
    a central queue with a bulk arrival is modeled as
    queueing system
  • The probability that a certain active core is
    occupied by tasks (also called utilization) is
    computed as
  • Target system
  • Gracefully degrading systems
  • Standby redundant systems

7
Lifetime Reliability of Entire System
Gracefully Degrading System
  • A functioning manycore system may contains
    good cores
  • Let be the probability that the
    system has active cores at time
  • The system reliability can therefore be expressed
    as
  • Thus, the Mean Time to Failure (MTTF) of the
    system can be written as

8
Lifetime Reliability of Entire System
Gracefully Degrading System
  • To determine
  • Conditional probability
  • For any
  • Conditional probability
  • The remaining is how to compute

9
Behavior of Single Processor Core
  • States of cores
  • Spare mode cold standby
  • Active mode
  • Processing state
  • Wait state warm standby
  • The same shape but different scale
  • parameter
  • E.g.,

10
Lifetime Reliability of A Single Core
Gracefully Degrading System
  • Define accumulated time in a certain state at
    time as how long it spends in such a state up
    to time
  • Calculation

11
Lifetime Reliability of A Single Core
Gracefully Degrading System
  • Theorem 1 Suppose a manycore system with
    gracefully degrading scheme has experienced
    core failures, in the order of occurrence time at
    , respectively, for any core
    that has survived until time
  • its accumulated time in the processing state up
    to time
  • its accumulated time as warm standby up to time

12
Lifetime Reliability of A Single Core
Gracefully Degrading System
  • Recall that the reliability functions in wait and
    processing states have the same shape but
    different scale parameter
  • General reliability function ,
    abbreviated as
  • Reliability function in processing state
    , denoted as
  • Reliability function in wait state
    , denoted as
  • Relationships
    and

13
Lifetime Reliability of A Single Core
Gracefully Degrading System
  • A subdivision of the time
  • By the continuity of reliability function, we have

processing
wait
wait
Accumulated time in the processing state
Accumulated time in the wait state
14
Lifetime Reliability of A Single Core
Gracefully Degrading System
  • Theorem 2 Given a gracefully degrading manycore
    system that has experienced core failures
    which occur at respectively,
    the probability that a certain core survives at
    time provided that it has survived
    until time is given by
  • where

15
Lifetime Reliability of Entire System Standby
Redundant System
  • A standby redundant system is functioning if it
    contains at least good cores, among which
    are configured as active one, the remaining are
    spares
  • To determine
  • Again, the key point is to compute

16
Lifetime Reliability of A Single Core Standby
Redundant System
  • Define a cores birth time as the time point
    when it is configured as an active one
  • Theorem 3 In a standby redundant manycore
    system, for any core with birth time that has
    survived until time
  • its accumulated time in the processing state up
    to time
  • its accumulated time as warm standby up to time

17
Lifetime Reliability of A Single Core Standby
Redundant System
  • Theorem 4 In a manycore system with standby
    redundant scheme, the probability that a certain
    core with birth time survives at time
    is given by
  • where

18
Experimental Setup
  • Lifetime distributions
  • Exponential
  • Weibull
  • Linear failure rate
  • System parameters
  • Consider a manycore system
  • consisting of cores

19
Misleading Caused by Exponential Assumption
Redundancy Scheme Sojourn Time (years) Sojourn Time (years) Sojourn Time (years) Sojourn Time (years) Sojourn Time (years)
Redundancy Scheme 0-Failure State 1-Failure State 2-Failure State 3-Failure State 4-Failure State
0 0.2188 0.2188
1 Degrading 0.2121 0.2188 0.4309
1 Standby 0.2188 0.2188 0.4376
2 Degrading 0.2059 0.2121 0.2188 0.6368
2 Standby 0.2188 0.2188 0.2188 0.6564
3 Degrading 0.2000 0.2059 0.2121 0.2188 0.8368
3 Standby 0.2188 0.2188 0.2188 0.2188 0.8752
4 Degrading 0.1944 0.2000 0.2059 0.2121 0.2188 1.0312
4 Standby 0.2188 0.2188 0.2188 0.2188 0.2188 1.0940
Expected lifetime of the
-core system
20
Lifetime Reliability for Non-Exponential Lifetime
Distribution
(a) Weibull Distribution
(b) Linear Failure Rate Distribution
21
Detailed Results for Gracefully Degrading System
Distribution Sojourn Time (years) Sojourn Time (years) Sojourn Time (years) Sojourn Time (years) Sojourn Time (years)
Distribution 0-Failure State 1-Failure State 2-Failure State 3-Failure State 4-Failure State
Weibull 0 2.2039 2.2039
Weibull 1 2.2153 0.5573 2.7726
Weibull 2 2.2260 0.5600 0.3055 3.0915
Weibull 3 2.2359 0.5626 0.3142 0.1040 3.2167
Weibull 4 2.2452 0.5649 0.2988 0.0955 0.0820 3.2864
Linear Failure Rate 0 1.8572 1.8572
Linear Failure Rate 1 1.8463 1.1367 2.9830
Linear Failure Rate 2 1.8354 1.1325 0.8926 3.8605
Linear Failure Rate 3 1.8243 1.1282 0.8798 0.6941 4.5264
Linear Failure Rate 4 1.8133 1.1237 0.8762 0.7055 0.6269 5.1456
22
The Impact of Workload
23
Comparison Between Gracefully Degrading System
and Standby Redundant System
Distribution Redundancy Scheme
Distribution Redundancy Scheme Hot Standby Warm Standby Warm Standby Warm Standby Warm Standby Cold Standby
Distribution Redundancy Scheme Hot Standby Cold Standby
Weibull 2 Degrading 1.5039 1.8232 2.1497 2.2930 2.4265 2.6258
Weibull 2 Standby 1.5314 1.8227 2.1133 2.2488 2.3484 2.5309
Weibull 4 Degrading 1.5046 1.8521 2.2305 2.4432 2.5771 2.8376
Weibull 4 Standby 1.5577 1.8545 2.1715 2.3103 2.4266 2.6261
Linear Failure Rate 2 Degrading 1.9115 2.3197 2.7070 2.8697 3.0105 3.2424
Linear Failure Rate 2 Standby 1.9608 2.3314 2.7330 2.8851 3.0091 3.2146
Linear Failure Rate 4 Degrading 2.1348 2.7122 3.3642 3.6529 3.9385 4.3590
Linear Failure Rate 4 Standby 2.3008 2.7899 3.4307 3.6015 3.8588 4.1881
24
Conclusion
  • State-of-the art CMOS technology enables the
    chip-level manycore processors
  • The lifetime reliability of such large circuit is
    a major concern
  • We propose a comprehensive analytical model to
    estimate the lifetime reliability of manycore
    systems
  • Some experimental results are shown to
    demonstrate the effectiveness of the proposed
    model

25
Thank You for Your Attention!
Write a Comment
User Comments (0)
About PowerShow.com