Lecture 1 Overview of Computer Architecture - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Lecture 1 Overview of Computer Architecture

Description:

Execution time = user time system time (but OS self measurement may be ... will be used only a portion of the time. If it will be rarely used then why bother ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 51

Provided by: MantonM

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 1 Overview of Computer Architecture

1
Lecture 1Overview of Computer Architecture

CSCE 513 Computer Architecture

Topics
Overview
Readings Chapter 1

August 18, 2011
2
Course Pragmatics

Syllabus
Instructor Manton Matthews
Teaching Assistant Mr. Bud (Jet) Cut
Website http//www.cse.sc.edu/matthews/Courses/5
13/index.html
Text
Computer Architecture A Quantitative Approach,
4th ed.," John L. Hennessey and David A.
Patterson, Morgan Kaufman, 2006
Important Dates
Academic Integrity

3
Overview

New
Syllabus
What you should know!
What you will learn (Course Overview)
Instruction Set Design
Pipelining (Appendix A)
Instruction level parallelism
Memory Hierarchy
Multiprocessors
Why you should learn this

4
What is Computer Architecture?

Computer Architecture is those aspects of the
instruction set available to programmers,
independent of the hardware on which the
instruction set was implemented.
The term computer architecture was first used in
1964 by Gene Amdahl, G. Anne Blaauw, and
Frederick Brooks, Jr., the designers of the IBM
System/360.
The IBM/360 was a family of computers all with
the same architecture, but with a variety of
organizations(implementations).

5
What you should know

http//en.wikipedia.org/wiki/Intel_4004 (1971)
Steps in Execution
Load Instruction
Decode
.
.
.
.

6
Crossroads Conventional Wisdom in Comp. Arch

Old Conventional Wisdom Power is free,
Transistors expensive
New Conventional Wisdom Power wall Power
expensive, Xtors free (Can put more on chip than
can afford to turn on)
Old CW Sufficiently increasing Instruction Level
Parallelism via compilers, innovation
(Out-of-order, speculation, VLIW, )
New CW ILP wall law of diminishing returns on
more HW for ILP
Old CW Multiplies are slow, Memory access is
fast
New CW Memory wall Memory slow, multiplies
fast (200 clock cycles to DRAM memory, 4 clocks
for multiply)
Old CW Uniprocessor performance 2X / 1.5 yrs
New CW Power Wall ILP Wall Memory Wall
Brick Wall
Uniprocessor performance now 2X / 5(?) yrs
? Sea change in chip design multiple cores
(2X processors per chip / 2 years)
More simpler processors are more power efficient

7
Computer Arch. a Quantitative Approach

Hennessy and Patterson
Patterson UC Berkeley
Hennessy Stanford
Preface Bill Joy of Sun Micro Systems
Evolution of Editions
Almost universally used for graduate courses in
architecture
Pipelines moved to appendix A ??
Path through 1? appendix A ?2

8
CAQA - HP Chapter 1 Figure1.1
9
Trends in Microprocessor Performance
10
Memory Cost Trends
11
Moores Law

Gordon Moore, one of the founders of Intel
In 1965 he predicted the doubling of the number
of transistors per chip every couple of years
for the next ten years
http//www.intel.com/research/silicon/mooreslaw.ht
m

12
Sea Change in Chip Design

Intel 4004 (1971) 4-bit processor,2312
transistors, 0.4 MHz, 10 micron PMOS, 11 mm2
chip

RISC II (1983) 32-bit, 5 stage pipeline, 40,760
transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

125 mm2 chip, 0.065 micron CMOS 2312 RISC
IIFPUIcacheDcache
RISC II shrinks to 0.02 mm2 at 65 nm
Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) ?
Proximity Communication via capacitive coupling
at gt 1 TB/s ?(Ivan Sutherland _at_ Sun / Berkeley)

Processor is the new transistor?

13
ISA Example MIPs/ IA32
14
Main Memory

DRAM dynamic RAM one transistor/capacitor per
bit
SRAM static RAM four to 6 transistors per bit
DRAM density increases approx. 50 per year
DRAM cycle time decreases slowly (DRAMs have
destructive read-out, like old core memories, and
data row must be rewritten after each read)
DRAM must be refreshed every 2-8 ms
Memory bandwidth improves about twice the rate
that cycle time does due to improvements in
signaling conventions and bus width

15
Price of Pentiums
16
Pentium IV
17
The world's fastest¹, smartest PC CPU

Intel Core i7-980X processor Extreme Edition
The Intel Core i7 processor Extreme Edition is
the perfect engine for power users who demand
unparalleled performance and unlimited digital
creativity. Experience Intel's fastest¹, smartest
PC processor. You'll get maximum PC power for
whatever you do, thanks to the combination of
smart features like Intel Turbo Boost
Technology³ and Intel Hyper-Threading
Technologyd, which together activate full
processing power exactly where and when you need
it.
With 6 physical and 12 logical cores, 12MB Intel
Smart Cache (L3 cache), 32 nm, second generation
Hi-K metal gate process processor core, it's no
surprise the Intel Core i7 processor Extreme
Edition is the world's fastest¹, smartest PC
processor.

18
(No Transcript)
19
IC Wafer117 AMD OpteronFig 1.12
20
Cost of ICs

Cost of IC (Cost of die cost of testing die
cost of packaging and final test) / (Final test
yield)
Cost of die Cost of wafer / (Dies per wafer
die yield)
Dies per wafer is wafer area divided by die area,
less dies along the edge
(wafer area) / (die area) - (wafer
circumference) / (die diagonal)
Die yield (Wafer yield) ( 1 (defects per
unit area die area/alpha) ) (-alpha)

21
Case Study on Design

"Intel muted ambitious Pentium 4 design," Anthony
Cataldo, EE Times, Dec. 14, 2000.
Willamette shipped at 217 mm2 at 0.18 micron
feature size (217 mm2 was size of Pentium Pro)
had to reduce L1 data cache to 8 KB (cmp. to
Athlon 128 KB)
had to bit compress the trace cache (no L1
instruction cache)
had to omit an extra floating-point unit ("The
upshot a was five per cent hit on performance,
but the floating point real estate was squeezed
to less than half its former size." Darrell
Boggs)
due to expense had to omit a 1 MB L3 cache, which
would have been on another chip but packaged with
the processor in a cartridge

22
Markets for Processors

desktop (personal computer and workstation) --
price/performance
server -- provide high availability, good
scalability, and maximum throughput (transactions
per minute, web pages served per second, or file
transfer measures)
embedded systems-- minimize price, memory size,
and power

23
Component Costs for a 1000 PC
24
Performance Measures

Response time (latency) -- time between start and
completion
Throughput (bandwidth) -- rate -- work done per
unit time
Speedup -- B is n times faster than A
Means exec_time_A/exec_time_B rate_B/rate_A
Other important measures
power (impacts battery life, cooling, packaging)
RAS (reliability, availability, and
serviceability)
scalability (ability to scale up processors,
memories, and I/O)

25
Measuring Performance

Time is the measure of computer performance
Elapsed time program execution I/O wait --
important to user
Execution time user time system time (but OS
self measurement may be inaccurate)
CPU performance user time on unloaded system --
important to architect

26
Real Performance

Benchmark suites
Performance is the result of executing a workload
on a configuration
Workload program input
Configuration CPU cache memory I/O OS
compiler optimizations
compiler optimizations can make a huge
difference!

27
Benchmark Suites

Whetstone (1976) -- designed to simulate
arithmetic-intensive scientific programs.
Dhrystone (1984) -- designed to simulate systems
programming applications. Structure, pointer, and
string operations are based on observed
frequencies, as well as types of operand access
(global, local, parameter, and constant).
PC Benchmarks aimed at simulating real
environments
Business Winstone navigator Office Apps
CC Winstone
Winbench -

28
Comparing Performance

Total execution time (implies equal mix in
workload)
Just add up the times
Arithmetic average of execution time
To get more accurate picture, compute the average
of several runs of a program
Weighted execution time (weighted arithmetic
mean)
Program p1 makes up 25 of workload (estimated),
P2 75 then use weighted average

29
Comparing Performance cont.

Normalized execution time or speedup (normalize
relative to reference machine and take average)
SPEC benchmarks (base time a SPARCstation)
Arithmetic mean sensitive to reference machine
choice
Geometric mean consistent but cannot predict
execution time
Nth root of the product of execution time ratios
Combining samples

30
(No Transcript)
31
Improve Performance by

changing the
algorithm
data structures
programming language
compiler
compiler optimization flags
OS parameters
improving locality of memory or I/O accesses
overlapping I/O
on multiprocessors, you can improve performance
by avoiding cache coherency problems (e.g., false
sharing) and synchronization problems

32
Amdahls Law

Speedup
(performance of entire task not using
enhancement)
(performance of entire task using enhancement)
Alternatively
Speedup
(execution time without enhancement) /
(execution time with enhancement)

33
Performance Measures

Response time (latency) -- time between start and
completion
Throughput (bandwidth) -- rate -- work done per
unit time
Speedup
(execution time without enhance.) / (execution
time with enhance.)
timewo enhancement) / (timewith enhancement)
Processor Speed e.g. 1GHz
When does it matter?
When does it not?

34
MIPS and MFLOPS

MIPS (Millions of Instructions per second)
(instruction count) / (execution time 106)
Problem1 depends on the instruction set (ISA)
Problem2 varies with different programs on the
same machine
MFLOPS (mega-flops where a flop is a floating
point operation)
(floating point instruction count) / (execution
time 106)
Problem1 depends on the instruction set (ISA)
Problem2 varies with different programs on the
same machine

35
Comparing Performance fig 1.15
Comparing three program executing on three
machines
Faster than relationships A is 10 times
faster than B on program 1 B is 10 times
faster than A on program 2 C is 50 times
faster than A on program 2 3 2
comparisons (3 choose 2 computers 2
programs) So what is the relative performance of
these machines???
36
fig 1.15 Total Execution times
Comparing three program executing on three
machines
So now what is the relative performance of
these machines??? B is 1001/110 9.1 times
as fast as A Arithmetic mean execution time
37
Weighted Execution Times fig 1.15
Now assume that we know that P1 will run 90, and
P2 10 of the time. So now what is the relative
performance of these machines??? timeA .91
.11000 100.9 timeB .910 .1100
19 Relative performance A to B 100.9/19 5.31
38
Geometric Means

Compare ratios of performance to a standard
Using A as the standard
program 1 B ratio 10/1 10 C ratio
20/1 20
program 2 Br 100/1000 .1 Cr 20/1000
.02
B is twice as fast as C using A as the standard
Using B as the standard
program 1 Ar 1/10 .1 Cr
program 2 Br 1000/100 10 Cr
So now compare A and B ratios to each other you
get the same 10 and .1, so what? Same ?

39
Geometric Means fig 1.17

Measure performance ratios to a standard machine

40
Amdahls Law revisited

Speedup
(execution time without enhance.) / (execution
time with enhance.)
(time without) / (time with) Two / Twith
Notes
The enhancement will be used only a portion of
the time.
If it will be rarely used then why bother trying
to improve it
Focus on the improvements that have the highest
fraction of use time denoted Fractionenhanced.
Note Fractionenhanced is always less than 1.
Then

41
Amdahls with Fractional Use Factor

ExecTimenew
ExecTimeold ( 1- Fracenhanced)
(Fracenhanced)/(Speedupenhanced)
Speedupoverall (ExecTimeold) / (ExecTimenew)
1 / ( 1- Fracenhanced) (Fracenhanced)/(Spee
dupenhanced)

42
Amdahls with Fractional Use Factor

Example Suppose we are considering an
enhancement to a web server. The enhanced CPU is
10 times faster on computation but the same speed
on I/O. Suppose also that 60 of the time is
waiting on I/O
Fracenhanced .4
Speedupenhanced 10
Speedupoverall
1 / ( 1- Fracenhanced) (Fracenhanced)/(Spee
dupenhanced)

43
Graphics Square Root Enhancement p 42
44
CPU Performance Equation