Title: Extending Amdahls Law in the Multicore Era
1Extending Amdahls Law in the Multicore Era
- Erlin Yao, Yungang Bao, Guangming Tan and Mingyu
Chen - Institute of Computing Technology, Chinese
Academy of Sciences - yaoerlin_at_gmail.com, baoyg, tgm, cmy_at_ncic.ac.cn
2A Brief Intro Of ICT, CAS
ICT has developed the Loongson CPU
ICT has built the Fastest HPC in China Dawning
5000, which is 233.5TFlops and rank 10th in
Top500.
3Outline
- I. Background and Related Works
- II. Model of Multicore Scalability
- III. Symmetrical Multicore Chips
- IV. Asymmetrical Multicore Chips
- V. Dynamic Multicore Chips
- VI. Conclusion and Future Work
4We are in the Multi-Core Era
- Mainstream market has already been dominated by
multicore - Intel 2-core Core Duo, 4-core i7
- AMD 2-core Athlon, 4-core Opteron
- IBM 2-core POWER6, 9-core Cell
- Sun 8-core T1/T2
5Many-Core is coming
- Some processor vendors have announced or released
their manycore processors - Tilera 64-core
- Intel 80-core
- GPGPU 100x-core
-
6Revisiting Amdahls Law in the Multi/Many-Core Era
- Assume that a fraction f of a programs execution
time was infinitely parallelizable with no
scheduling overhead, while the remaining
fraction, 1 - f, was totally sequential. Using p
processors to accelerate the parallel fraction. - Fixed-size speedup, the amount of work to be
executed is independent of the number of
processors
7Implications of Amdahls Law
- Despite its simplicity, Amdahls law applies
broadly and gives important insights such as - (i) Attack the common case When f is small,
optimization will have little effect. - (ii) The aspects you ignore also limit speedup
Even if p approaches infinity, speedup is bounded
by 1/(1-f) .
8Mark Hill et al.s Insights
- Hill and Marty apply Amdahls law to multicore
hardware by constructing a cost model for the
number and performance of cores in one chip. - ? Obtaining optimal multicore performance
requires further research both in extracting more
parallelism and in making sequential cores
faster. - Woo and Lee have extended Hills work by taking
power and energy into account.
9Motivation of Our Work
- The revised Amdahls Law model provides a better
understanding of multicore scalability. - However, there is little work on theoretical
analysis. -
- This paper presents our investigations on
theoretical analysis of multicore scalability and
attempts to find the optimal results under
different conditions.
10Model of Multicore Scalability
- We adopt the same cost model on multicore
hardware proposed by Hill and Marty, which
includes two assumptions - First, assume that a multicore chip of given size
and technology generation can contain at most n
base core equivalents (BCE) - Second, assume that the individual core with more
resources (r BCEs) can achieve better sequential
performance. - 1 lt perf(r) lt r
- The architecture of multicore chips can be
classified into three types - Symmetric
- Asymmetric
- Dynamic
11Model-Symmetrical
- A symmetric multicore chip requires that all its
cores have the same cost. - Example given 16 BCEs.
- r 8 ? 2 cores 8 BCEs/core
- r 4 ? 4 cores 4 BCEs/core
- Given the resource budget of n BCEs, we have n/r
cores, each with r BCEs. Performance of each core
is perf(r). Then we get
12Model-Asymmetrical
- In an asymmetric multicore chip, several cores
are more powerful than the others. - Example given 16 BCEs
- 1 four-BCE core and 12 base cores.
- 1 six-BCE core and 10 base cores.
- Given the resource budget of n BCEs, we have
1n-r cores with one larger core (with r BCEs)
and n-r base cores (with 1 BCE each). Then we get
13Model-Dynamic
- A dynamic multicore chip can dynamically combine
up to r cores into one core in order to boost
sequential performance. - In sequential mode, it can execute with
performance of perf(r) when the dynamic
techniques use r BCEs. - In parallel mode, it can obtain performance of n
using all base cores in parallel. - Then, we get
14Symmetrical Multicore Chips
- Fixed n and r, speedup is an increasing function
of f - Fixed f and r, speedup is an increasing function
of n - ? Increasing both the parallel fraction (f) and
the number of base core (n) can improve the
speedup of symmetric multicore chip. - For fixed f and n, we have the following theorem
15Symmetrical Multicore Chips
- For any fixed f and c,
- if f lt c, the maximum speedup is achieved at r
n. - if f gt c and n is not big, the maximum speedup is
achieved at r 1. - if f gt c and n is big enough, to obtain optimal
multicore performance, - the resources of BCEs should be
dedicated to one core - intended to offer reasonable individual cores
performance.
16Symmetrical Multicore Chips
- If n is big enough, then will the maximum speedup
always be achieved between extremes for any
perf(x) lt x? - Counterexample
- (i) perf(x)kx, for any 0ltklt1
- (ii) perf(x)xc, for any fltclt1.
17Asymmetrical Multicore Chips
- Similarly, increasing both the parallel fraction
(f) and the number of BCEs (n) can improve the
speedup of asymmetric multicore chip. - For fixed f and n, we have the following theorem
18Asymmetrical Multicore Chips
- If f gtc and n is not big, maximum speedup is
achieved at r 1. - If f ltc and n is not big, maximum speedup is
achieved at r n. - For any fixed f and c, if n is big enough, the
maximum speedup is achieved at 1ltr0ltn.
19Asymmetrical Multicore Chips
- Note that the optimal r0 in Theorem 2 can not be
solved analytically. - r0 is linear with n, and if n is big enough, r0
will approach n to any extent.
20Asymmetrical Multicore Chips
- If n is big enough, will the maximum speedup
always be achieved between extremes for any
perf(x)ltx? - Counterexample
- perf(x)kx, for any fltklt1.
- For saturated functions,
- Like p(x)xc, p(x)kxcmxc, where c, clt1.
21Asymmetrical Multicore Chips
- Based on the simplistic assumptions of Amdahls
law, it makes most sense to devote extra
resources to increase only one cores capability.
In fact we have the following theorem - Although the architecture of asymmetric multicore
chip using one large core and many base cores is
assumed originally for simplicity, it is indeed
the optimal architecture in the sense of speedup.
22Dynamic Multicore Chips
- We should increase both f and n to enhance the
speedup of dynamic multicore chip. - For fixed f and n,
- if perf(r) is an increasing function, speedup is
also an increasing function - ? the maximum speedup is always achieved at r
n. - ? Dynamic multicore chips can offer potential
speedups that are greater and never worse than
symmetric or asymmetric multicore chips with
identical perf(r) functions. - So researchers should continue to investigate
methods that approximate a dynamic multicore chip.
23Potentials of Maximum Speedups
- Recall that in the Amdahls law, even if the
number of processors approaches infinity, the
speedup is bound by1/(1-f) . - The increasing of n can improve the speedup
continuously. Under the assumption of perf(r)
rc, when n approaches infinity, the speedup can
also approach infinity even if the performance
index c is small.
24Implications and Results
- A theoretical analysis of multicore scalability
is investigated, and quantitative conditions are
given to determine how to obtain optimal
multicore performance. - The theorems and corollary provide computer
architects with a better understanding of
multicore design types, enabling them to make
more informed tradeoffs. - However, our precise quantitative results are
suspect because the real world is much more
complex. The model considered here ignores many
important structures. - This theoretical analysis attempts to provide
insights on future work.
25Future Work
- In applications, the parallel fraction f can not
be infinitely parallelizable. The parallel degree
can be less than some constant d or even be
random in some circumstances. - Introducing practical structures, such as memory
hierarchy, shared caches, etc. - More cores might allow more parallelism for
larger problem size. Fixed-time speedup, like the
Gustafsons law, should be considered. -
26Acknowledgements
- We would like to thank Professor Mark Hill for
his valuable comments and suggestions. - We also appreciate the help of Dr. Mark Squillant
and the arrangement of the MAMA organizator on
this video presentation.
27Thanks
- Welcome Questions and Comments