Extending Amdahls Law in the Multicore Era PowerPoint PPT Presentation

presentation player overlay
1 / 27
About This Presentation
Transcript and Presenter's Notes

Title: Extending Amdahls Law in the Multicore Era


1
Extending Amdahls Law in the Multicore Era
  • Erlin Yao, Yungang Bao, Guangming Tan and Mingyu
    Chen
  • Institute of Computing Technology, Chinese
    Academy of Sciences
  • yaoerlin_at_gmail.com, baoyg, tgm, cmy_at_ncic.ac.cn

2
A Brief Intro Of ICT, CAS
ICT has developed the Loongson CPU
ICT has built the Fastest HPC in China Dawning
5000, which is 233.5TFlops and rank 10th in
Top500.
3
Outline
  • I. Background and Related Works
  • II. Model of Multicore Scalability
  • III. Symmetrical Multicore Chips
  • IV. Asymmetrical Multicore Chips
  • V. Dynamic Multicore Chips
  • VI. Conclusion and Future Work

4
We are in the Multi-Core Era
  • Mainstream market has already been dominated by
    multicore
  • Intel 2-core Core Duo, 4-core i7
  • AMD 2-core Athlon, 4-core Opteron
  • IBM 2-core POWER6, 9-core Cell
  • Sun 8-core T1/T2

5
Many-Core is coming
  • Some processor vendors have announced or released
    their manycore processors
  • Tilera 64-core
  • Intel 80-core
  • GPGPU 100x-core

6
Revisiting Amdahls Law in the Multi/Many-Core Era
  • Assume that a fraction f of a programs execution
    time was infinitely parallelizable with no
    scheduling overhead, while the remaining
    fraction, 1 - f, was totally sequential. Using p
    processors to accelerate the parallel fraction.
  • Fixed-size speedup, the amount of work to be
    executed is independent of the number of
    processors

7
Implications of Amdahls Law
  • Despite its simplicity, Amdahls law applies
    broadly and gives important insights such as
  • (i) Attack the common case When f is small,
    optimization will have little effect.
  • (ii) The aspects you ignore also limit speedup
    Even if p approaches infinity, speedup is bounded
    by 1/(1-f) .

8
Mark Hill et al.s Insights
  • Hill and Marty apply Amdahls law to multicore
    hardware by constructing a cost model for the
    number and performance of cores in one chip.
  • ? Obtaining optimal multicore performance
    requires further research both in extracting more
    parallelism and in making sequential cores
    faster.
  • Woo and Lee have extended Hills work by taking
    power and energy into account.

9
Motivation of Our Work
  • The revised Amdahls Law model provides a better
    understanding of multicore scalability.
  • However, there is little work on theoretical
    analysis.
  • This paper presents our investigations on
    theoretical analysis of multicore scalability and
    attempts to find the optimal results under
    different conditions.

10
Model of Multicore Scalability
  • We adopt the same cost model on multicore
    hardware proposed by Hill and Marty, which
    includes two assumptions
  • First, assume that a multicore chip of given size
    and technology generation can contain at most n
    base core equivalents (BCE)
  • Second, assume that the individual core with more
    resources (r BCEs) can achieve better sequential
    performance.
  • 1 lt perf(r) lt r
  • The architecture of multicore chips can be
    classified into three types
  • Symmetric
  • Asymmetric
  • Dynamic

11
Model-Symmetrical
  • A symmetric multicore chip requires that all its
    cores have the same cost.
  • Example given 16 BCEs.
  • r 8 ? 2 cores 8 BCEs/core
  • r 4 ? 4 cores 4 BCEs/core
  • Given the resource budget of n BCEs, we have n/r
    cores, each with r BCEs. Performance of each core
    is perf(r). Then we get

12
Model-Asymmetrical
  • In an asymmetric multicore chip, several cores
    are more powerful than the others.
  • Example given 16 BCEs
  • 1 four-BCE core and 12 base cores.
  • 1 six-BCE core and 10 base cores.
  • Given the resource budget of n BCEs, we have
    1n-r cores with one larger core (with r BCEs)
    and n-r base cores (with 1 BCE each). Then we get

13
Model-Dynamic
  • A dynamic multicore chip can dynamically combine
    up to r cores into one core in order to boost
    sequential performance.
  • In sequential mode, it can execute with
    performance of perf(r) when the dynamic
    techniques use r BCEs.
  • In parallel mode, it can obtain performance of n
    using all base cores in parallel.
  • Then, we get

14
Symmetrical Multicore Chips
  • Fixed n and r, speedup is an increasing function
    of f
  • Fixed f and r, speedup is an increasing function
    of n
  • ? Increasing both the parallel fraction (f) and
    the number of base core (n) can improve the
    speedup of symmetric multicore chip.
  • For fixed f and n, we have the following theorem

15
Symmetrical Multicore Chips
  • For any fixed f and c,
  • if f lt c, the maximum speedup is achieved at r
    n.
  • if f gt c and n is not big, the maximum speedup is
    achieved at r 1.
  • if f gt c and n is big enough, to obtain optimal
    multicore performance,
  • the resources of BCEs should be
    dedicated to one core
  • intended to offer reasonable individual cores
    performance.

16
Symmetrical Multicore Chips
  • If n is big enough, then will the maximum speedup
    always be achieved between extremes for any
    perf(x) lt x?
  • Counterexample
  • (i) perf(x)kx, for any 0ltklt1
  • (ii) perf(x)xc, for any fltclt1.

17
Asymmetrical Multicore Chips
  • Similarly, increasing both the parallel fraction
    (f) and the number of BCEs (n) can improve the
    speedup of asymmetric multicore chip.
  • For fixed f and n, we have the following theorem

18
Asymmetrical Multicore Chips
  • If f gtc and n is not big, maximum speedup is
    achieved at r 1.
  • If f ltc and n is not big, maximum speedup is
    achieved at r n.
  • For any fixed f and c, if n is big enough, the
    maximum speedup is achieved at 1ltr0ltn.

19
Asymmetrical Multicore Chips
  • Note that the optimal r0 in Theorem 2 can not be
    solved analytically.
  • r0 is linear with n, and if n is big enough, r0
    will approach n to any extent.

20
Asymmetrical Multicore Chips
  • If n is big enough, will the maximum speedup
    always be achieved between extremes for any
    perf(x)ltx?
  • Counterexample
  • perf(x)kx, for any fltklt1.
  • For saturated functions,
  • Like p(x)xc, p(x)kxcmxc, where c, clt1.

21
Asymmetrical Multicore Chips
  • Based on the simplistic assumptions of Amdahls
    law, it makes most sense to devote extra
    resources to increase only one cores capability.
    In fact we have the following theorem
  • Although the architecture of asymmetric multicore
    chip using one large core and many base cores is
    assumed originally for simplicity, it is indeed
    the optimal architecture in the sense of speedup.

22
Dynamic Multicore Chips
  • We should increase both f and n to enhance the
    speedup of dynamic multicore chip.
  • For fixed f and n,
  • if perf(r) is an increasing function, speedup is
    also an increasing function
  • ? the maximum speedup is always achieved at r
    n.
  • ? Dynamic multicore chips can offer potential
    speedups that are greater and never worse than
    symmetric or asymmetric multicore chips with
    identical perf(r) functions.
  • So researchers should continue to investigate
    methods that approximate a dynamic multicore chip.

23
Potentials of Maximum Speedups
  • Recall that in the Amdahls law, even if the
    number of processors approaches infinity, the
    speedup is bound by1/(1-f) .
  • The increasing of n can improve the speedup
    continuously. Under the assumption of perf(r)
    rc, when n approaches infinity, the speedup can
    also approach infinity even if the performance
    index c is small.

24
Implications and Results
  • A theoretical analysis of multicore scalability
    is investigated, and quantitative conditions are
    given to determine how to obtain optimal
    multicore performance.
  • The theorems and corollary provide computer
    architects with a better understanding of
    multicore design types, enabling them to make
    more informed tradeoffs.
  • However, our precise quantitative results are
    suspect because the real world is much more
    complex. The model considered here ignores many
    important structures.
  • This theoretical analysis attempts to provide
    insights on future work.

25
Future Work
  • In applications, the parallel fraction f can not
    be infinitely parallelizable. The parallel degree
    can be less than some constant d or even be
    random in some circumstances.
  • Introducing practical structures, such as memory
    hierarchy, shared caches, etc.
  • More cores might allow more parallelism for
    larger problem size. Fixed-time speedup, like the
    Gustafsons law, should be considered.

26
Acknowledgements
  • We would like to thank Professor Mark Hill for
    his valuable comments and suggestions.
  • We also appreciate the help of Dr. Mark Squillant
    and the arrangement of the MAMA organizator on
    this video presentation.

27
Thanks
  • Welcome Questions and Comments
Write a Comment
User Comments (0)
About PowerShow.com