IBM 650, perhaps the first

About This Presentation

Title:

IBM 650, perhaps the first

Description:

LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 ... ADDD F20,F18,F2 ADDD F24,F22,F2 5. SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 ... – PowerPoint PPT presentation

Number of Views:383

Avg rating:3.0/5.0

Slides: 36

Provided by: Montek5

Learn more at: http://www.cs.unc.edu

Category:

more less

Transcript and Presenter's Notes

Title: IBM 650, perhaps the first

1
IBM 650, perhaps the first mass produced
computer. IBM announced it in 1953, and expected
to sell only 50 units or so. By the time it was
discontinued in 1962, almost 2000 had been
sold. The 650 first used a magnetic drum memory
(top right) , which stored 10,000 or 20,000
digits (1,000 or 2,000 words). Later a disk was
introduced (right) with 50 platters and a
capacity of 6 million digits.
2
COMP 740Computer Architecture and Implementation

Montek Singh
Thu, Mar 19, 2009
Topic Case Studies
(Pentium IV and Itanium)

3
Pentium 4 Look at features
4
Pentium 4 - Fetch
Up to 3 IA-32 instructions/clk
5
Pentium 4 - Fetch
Can generate up to 6 uops/clk
6
Pentium 4 - uops
Trace cache, temporal coherence instead of spatial
7
Pentium 4 2 BTBs
Another BTB for uops
8
Pentium 4 Out of Order
Register renaming instead of ROB, 128 total regs
for renaming
9
Pentium 4 Issue and Commit
Three uops can be dispatched to queues each
clock, up to 6 to a fcn unit, three can commit
10
Pentium 4 Pipelining

Called Netburst
Minimum clocks to transit pipeline was initially
21
1.5 GHz clock
Compare to PIII, 11 clocks
Later P4s up to 31 clocks
Clock rate bumped to over 3 GHz

11
Pentium 4 Caches
18 cycle latency 108 GB/s throughput
4 cycle integer load latency, 12 for FP Up to 8
outstanding
12
Performance

Pentium 4 672, Prescott (2005)
3.8 GHz clock
800 MHz system bus
667 MHz DDR2 DRAMs
Because of power dissipation, Intel has switched
from Netburst to the Core (or Core 2)
microarchitecture (Pentium M uses this)
Shallower pipeline, wider issue

13
Branch Mispredictions/1000 inst.
Average 186 branches/1000
Average 48 branches/1000 for FP benchmarks
14
Cache Misses / 1000 instructions
15
Effective CPI

mcf (combinatorial optzn) and swim (fluid
dynamics) have high miss rates

16
P4 vs. Opteron

Opteron has lower CPI on average
But deep pipeline of P4 enables higher clock rate

17
SPEC Ratio Comparison

2.8 GHz Opteron vs. 3.8 GHz P4
Opteron outperforms P4 on most
Pipeline stalls slowing P4 (also perhaps memory
system)

18
Perspective

Interest in multiple-issue because wanted to
improve performance without affecting
uniprocessor programming model
Taking advantage of ILP is conceptually simple,
but design problems are amazingly complex in
practice
Processors of last 5 years (Pentium 4, IBM Power
5, AMD Opteron) have the same basic structure and
similar sustained issue rates (3 to 4
instructions per clock) as the 1st dynamically
scheduled, multiple-issue processors announced in
1995
Clocks 10 to 20X faster, caches 4 to 8X bigger, 2
to 4X as many renaming registers, and 2X as many
load-store units? performance 8 to 16X
Peak v. delivered performance gap increasing

19
VLIW and EPIC
20
Review Very Long Instruction Word (VLIW)

Each instruction has explicit coding for
multiple operations
In IA-64, grouping called a packet
In Transmeta, grouping called a molecule
(atoms ops)
Trade off instruction space for simple decoding
The long instruction word has room for many
operations
By definition, all the operations the compiler
puts in the long instruction word can execute in
parallel
E.g., 2 integer operations, 2 FP operations, 2
memory references, 1 branch
16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide
Need very sophisticated compiling technique
that schedules across several branches

21
Recall Unrolled Loop for Single-Issue
1 Loop L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16
(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8
,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D
0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 D
SUBUI R1,R1,32 13 BNEZ R1,LOOP 14 S.D 8(R1),F16
8-32 -24 14 clock cycles, or 3.5 per iteration
L.D to ADD.D 1 Cycle ADD.D to S.D 2 Cycles
22
Review Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per
iteration (1.8X)
Average 2.5 ops per clock, 50 efficiency
Need more registers in VLIW (15 instead of 6)

23
Early VLIW

Floating-Point Systems AP-120B
Multiflow
7, 14, or 28 instructions per word (512 1024
bit instruction)
Cydrome
7 operations in a 256-bit instruction

24
Problems with 1st Generation VLIW

Increase in code size
generating enough operations in a straight-line
code fragment requires ambitiously unrolling
loops
whenever VLIW instructions are not full, unused
functional units translate to wasted bits in
instruction encoding
Operated in lock-step no hazard detection HW
a stall in any functional unit pipeline (or cache
miss) caused entire processor to stall, since all
functional units must be kept synchronized
Compiler might prediction function units, but
caches hard to predict
Binary code compatibility
Pure VLIW gt different numbers of functional
units and unit latencies require different
versions of the code

25
Intel/HP IA-64 Explicitly Parallel Instruction
Computer (EPIC)

IA-64 instruction set architecture
Hardware checks dependencies, so not fully static
Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mispredictions?
(see notes from last class)
BNEZ R1,L
ADDU R2,R3,R0
L
becomes CMOVZ R2,R3,R1
128 64-bit integer regs 128 82-bit FP regs!
Not separate register files per functional unit
as in old VLIW

26
IA-64 Predication more general

Instead of just a move, IA-64 has 64 1-bit
predicate registers
Each instruction is associated with a register (a
6-bit field in instruction)
Set instruction for predicate reg can perform
Boolean (ex an AND of 2 comparisons)

27
Instruction Format

Not all slots used for a particular instruction

28
Instruction Group

An instruction group has no register dependencies
within it
All instructions in group can be executed in
parallel
If theres enough hardware
Group can be arbitrarily long
Compiler determines groups

29
Bundles

128 bits
Template (5 bits) and 3 slots
Bars are stops between groups

30
Example

Coded with MIPS assembly ops
Loop
xi xi s
Unrolled 7 times

31
Minimize Instruction Size
32
Minimize Execution Time

Assuming enough functional units

33
Implementations

Itanium was first implementation (2001)
6-wide, 10-stage pipeline at 800 MHz on 0.18 µ
process
Itanium 2 was 2nd implementation (2005)
6-wide, 8-stage pipeline at 1.6 GHz on 0.13 µ
process
Caches 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D,
9216 KB L3

34
Performance
SPECfp
SPECint

Itanium biggest and most power hungry

35
References

Appendix G includes more compiler techniques that
take advantage of VLIW
The group that led much of the development of
VLIW compiler technology was led by Fisher at
Yale
They founded Multiflow

Write a Comment

User Comments (0)