APC523AST523 Scientific Computation in Astrophysics - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

APC523AST523 Scientific Computation in Astrophysics

Description:

64-bit architectures supports 64-bit integers as well. ... Dynamic Random Access Memory (DRAM) - bits stored in 2D array, accessed by rows ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 31
Provided by: tri5104
Category:

less

Transcript and Presenter's Notes

Title: APC523AST523 Scientific Computation in Astrophysics


1
APC523/AST523Scientific Computation in
Astrophysics
  • Lecture 2
  • Computer Architecture

2
Is it really necessary to study CA to be
proficient in scientific computation?
After all, dont need to know how an internal
combustion engine works to drive a car.
3
If you plan to run only canned software packages,
then you probably do not need to know anything
about CA (and you probably shouldnt be taking
this course!) If you plan to write efficient
code on modern parallel processors,
you have to understand how those processors
work.
4
Current trends in CA.
  • Desktop systems.
  • driven by price/performance.
  • Servers.
  • driven by reliability, scalability
  • Embedded processors.
  • drive by price/power consumption

We need focus on desktop systems only.
5
Measuring Performance
Price/performance is the key design issue for
scientific computation. (caveat power
consumption is an important issue for large
clusters) Execution time time between the start
and end of an event Throughput total work done
in a given amount of time We are more interested
in execution time than throughput. CPU time
time CPU is computing Wall-clock time total
execution time, including CPU time, I/O, OS
overhead, everything Quanta of time on computer
is clock period.
6
Measuring performance
benchmark model program with which to measure
performance Real applications best choice for
a given user, but how to weight importance of
different applications? --gt benchmark
suites Kernel small, compute intensive pieces
of real programs used as a benchmark, e.g.
Linpack
7
Weve all been spoiled by Moores Law
The transistor density in integrated circuits
will be doubled every two years. Prediction
by Gordon Moore, 1965. Since performance scales
with transistor density, Moores Law has been
interpreted as a prediction about the former as
well. Actually, performance doubles about every
18 months. Shows no sign of ending. Imagine if
astronomical observatories doubled their
capabilities every 18 months, all with no
financing from the NSF! But we are forced to
use mass-produced commodity processors that were
not designed for scientific computation
8
Amdahls Law
Defines the speedup that can be gained by a
particular performance enhancement Let a
fraction of program that can use enhancement
S speedup of entire code Sa speedup of
enhanced portion of code by itself e.g. Suppose
your program takes 20 secs to execute. You
reduce the execution time of some portion of it
from 10 secs to 5 secs. Then a0.5, Sa 2, S
4/3 Amdahls law expresses law of diminishing
returns. Overall program performance limited by
slowest step. Improving performance of one part
may not lead to much improvement overall.
9
Top 500 listwww.top500.org
Since 1993, Linpack benchmark is used to measure
performance of machines worldwide. Every six
months, a list of the 500 best performing
machines is released. Of course, there is much
competition to be a the top of the list, but
utility of list is that it reveals important
trends in architecture of high-performance
computers.
10
Performance increase since 1993 is mind-boggling
1993
2005
11
(No Transcript)
12
Scalar processors completely dominate the list
today
13
Clusters beginning to dominate
14
Growing dominance by machines with 256 or more
processors
15
The story of mario a lesson in the rapid pace of
progress
  • Cray C90, 16Gflops, 4Gb main memory, 130Gb disks
  • Installed at Pittsburgh Supercomputer Center
    (PSC) in 1993
  • List price 35M
  • By 1996, Cray T3E at PSC outperformed mario by
    factor of 10
  • Decommissioned in 1999.
  • Sold on e-bay in 2000 for 50k as living room
    furniture
  • Today, quad Opteron with more memory and disk
    space is 5k

16
Basics components of a computer
  • Processor
  • Memory
  • Communication channels
  • Input/output (I/O) channels

17
1. Processor
The brains of the computer performs operations
on data according to instructions controlled by
the programmer. Generally, programmer writes
algorithm in a high-level language (e.g., C, C,
F90). Language is translated into instructions
by a compiler (producing an executable). Interpre
ted languages (e.g. Java) are also popular.
These are translated into instructions at
runtime. More flexible, easier to code, but are
generally much slower and may have unreliable
floating-point arithmetic. Probably a bad idea
to use Java for large-scale scientific
computations.
18
The instruction set
Each processor has a unique, specific instruction
set. Increasingly complex instruction sets were
developed up to 1980s. (e.g. VAX,
x86) mid-80s, reduced instruction set computers
(RISC) introduced (e.g. MIPS R4000). By focusing
on a smaller set of simple instructions, more
sophisticated hardware optimization strategies
could be implemented. Today, almost all
processors are RISC. x86 instruction set only
survives to retain binary compatibility.
19
Basic architecture
Virtually all processors use register-to-register
architecture. All data processed by CPU enters
and leaves via registers (usually 64-bits in
size) C A B becomes Load R1,
A Load R2, B Add R3, R1, R2
Store R3, C Operands most 32-bit processors
support 8-, 16- and 32-bit integer operands, and
32- and 64-bit floating-point arithmetic. 64-bit
architectures supports 64-bit integers as
well. See HP Appendix G on web to see how
floating point arithmetic actually works.
20
Pipelining most important RISC optimization
Often same sequence of instructions are repeated
many times (loops). Can optimize by designing
processor to operate overlap different steps in
sequence like a pipeline. Clock cycle
1 2 3 4 5 6 7
8 9 instruction I IF ID
EX ME WB instruction i1 IF ID
EX ME WB instruction i2
IF ID EX ME WB instruction i3
IF ID EX ME
WB instruction i4
IF ID EX ME WB IFinstruction
fetch IDinstruction decode EXexecute
MEmemory reference WBwrite back Pipeline in
example takes 9 clock cycles to complete 5
instructions Un-pipelined processor would take
56 30 cycles Pipelining is an example of
instruction level parallelism (ILP)
21
Hazards to pipelining
  • Data hazards data for instruction i1 depends
    on data produced by instruction i
  • Control hazards pipeline contains conditional
    branch
  • Most processors limit branching penalties by a
    variety of techniques branch prediction,
    predicted-not-taken, etc.
  • Lessons for programmer
  • Isolate recursion formulae from other work,
    since it will interrupt pipelining of other
    instructions.
  • Avoid conditional branches in pipelined code

22
The role of compilers in CA
  • Today, most code is written in one of a small
    number of languages. Hardware is now designed to
    optimize instructions produced by compilers of
    that language.
  • Compilers can
  • Integrate procedures into calling code
  • Eliminate common sub-expressions (do algebra!)
  • Eliminate unnecessary temporary variables
    (reduces loads/stores)
  • Change order of instructions (e.g. move code
    outside loop)
  • Pipeline
  • Optimize register allocation

23
2. Memory
Stores both data and instructions. Organized
into bits, bytes, words, and double words. Size
of words is now standardized (32-bits). Byte
order is still not standardized. Two
possibilities Little endian stores leading
byte last (little end) 76543210 (e.g.
Intel) Big endian stores leading byte first (big
end) 01234567 (Sparc, PowerPC) Data
transferred from one architecture to another
(e.g. for visualization) must be byte-swapped.
24
Memory design
Dynamic Random Access Memory (DRAM) - bits stored
in 2D array, accessed by rows and columns
(reduces number of address pins). Typical access
time 100ns. DRAM must be refreshed, typically 5
of reads have to wait for a refresh to finish.
Reading destroys data in DRAM, so it must be
re-written after a read. Both introduce
latency DRAM comes on Dual Inline Memory Modules
(DIMMs). Since 1998, memory on DIMMs doubles
every 2 yrs slower than Moores Law -- is
leading to a memory/processor performance
mismatch. Synchronous DRAM (SDRAM) - contains
clock that synchronizes with CPU to increase
memory bandwidth. Memory bus operates at 100-150
MHz, usually 8-bit wide channels, means
800-1200Mb/s Double Data Rate (DDR) SDRAM now
available, 2-bits transferred each clock cycle
25
Hierarchical Memory
  • Ideally, entire memory system would be built
    using fastest chips possible. Not practical.
  • Instead, exploit principle of locality. Most
    programs access adjacent memory locations
    sequentially.
  • location M at time t location M1 at time
    t1.
  • Design solution hierarchical memory
  • Memory closest to processor uses fastest chips
    (cache)
  • Main memory built from DDR SDRAM
  • Additional memory can be built from disks
    (virtual memory)
  • Usually cache is subdivided into several levels
    (L1, L2,L3).
  • Data is transferred between levels in blocks
  • Between cache and main memory cache line
  • Between main and virtual memory page

26
How does hierarchical memory work?
If processor needs item at address A, but it is
not in cache, memory system moves a cache line
containing A (and A1, A2, etc) from main memory
into cache. Then, if processor needs A1 on next
cycle, it is already in cache. If memory
location needed by processor is in cache
cache hit If memory location needed by processor
is not in cache cache miss Fraction of
requests which are hits is hit rate. Goal of CA
design cache to maximize hit rate of typical
program by optimizing cache size, cache line
size, using prefetch and non-blocking
cache. Goal of Programmer write code to maximize
hit rate
27
Effective access time for hierarchical memory is
teff effective access time tcache access
time of cache tmain access time of main
memory H hit rate Suppose tcache 10ns,
tmain100ns, H98 teff (0.98)(10)
(1-0.98)(100) 11.8ns almost as fast as
cache! In reality, tmain can vary greatly
depending on latency, OS, etc.
28
Write code that maximizes cache hits.
For example, always access data contiguously.
Order loops so inner loop is over neighboring
data elements. Avoid stride not equal to one.
for (i0 ilt100 i) for (j0 jlt100
j) / BAD / aji
bjicji for (i0 ilt100 i)
for (j0 jlt100 j) / GOOD
/ aij bijcij
Note exactly the OPPOSITE ordering is necessary
in FORTRAN
29
The role of compilers in memory organization
  • Compilers organize memory into
  • stack - local variables
  • heap - dynamic variables addressed by pointers
  • global data space - statically declared global
    variables/constants
  • Register allocation by compiler impossible for
    heap.
  • If there are multiple ways to reference address
    of variable, they are said to be aliased, and
    cannot be allocated to register.
  • p a / gets address of a /
  • a 2 / assigns to a directly /
  • p 1 / uses p to assign to a /
  • c ab / accesses a /
  • Cannot allocate a to register.
  • Moral for programmer Use pointer references
    sparingly!

30
Interleaved memory
To increase bandwidth to memory, and reduce
effect of latency, can divide memory into N
banks. Distribute data across banks with
successive addresses in successive
banks. bank0 bank1 . bankN 1
2 N N2 N3 2N

Provided data is accessed
with stride not equal to N, memory references are
to different banks.
Write a Comment
User Comments (0)
About PowerShow.com