Title: IBM 650, perhaps the first
1IBM 650, perhaps the first mass produced
computer. IBM announced it in 1953, and expected
to sell only 50 units or so. By the time it was
discontinued in 1962, almost 2000 had been
sold. The 650 first used a magnetic drum memory
(top right) , which stored 10,000 or 20,000
digits (1,000 or 2,000 words). Later a disk was
introduced (right) with 50 platters and a
capacity of 6 million digits.
2COMP 740Computer Architecture and Implementation
- Montek Singh
- Thu, Mar 19, 2009
- Topic Case Studies
- (Pentium IV and Itanium)
3Pentium 4 Look at features
4Pentium 4 - Fetch
Up to 3 IA-32 instructions/clk
5Pentium 4 - Fetch
Can generate up to 6 uops/clk
6Pentium 4 - uops
Trace cache, temporal coherence instead of spatial
7Pentium 4 2 BTBs
Another BTB for uops
8Pentium 4 Out of Order
Register renaming instead of ROB, 128 total regs
for renaming
9Pentium 4 Issue and Commit
Three uops can be dispatched to queues each
clock, up to 6 to a fcn unit, three can commit
10Pentium 4 Pipelining
- Called Netburst
- Minimum clocks to transit pipeline was initially
21 - 1.5 GHz clock
- Compare to PIII, 11 clocks
- Later P4s up to 31 clocks
- Clock rate bumped to over 3 GHz
11Pentium 4 Caches
18 cycle latency 108 GB/s throughput
4 cycle integer load latency, 12 for FP Up to 8
outstanding
12Performance
- Pentium 4 672, Prescott (2005)
- 3.8 GHz clock
- 800 MHz system bus
- 667 MHz DDR2 DRAMs
- Because of power dissipation, Intel has switched
from Netburst to the Core (or Core 2)
microarchitecture (Pentium M uses this) - Shallower pipeline, wider issue
13Branch Mispredictions/1000 inst.
Average 186 branches/1000
Average 48 branches/1000 for FP benchmarks
14Cache Misses / 1000 instructions
15Effective CPI
- mcf (combinatorial optzn) and swim (fluid
dynamics) have high miss rates
16P4 vs. Opteron
- Opteron has lower CPI on average
- But deep pipeline of P4 enables higher clock rate
17SPEC Ratio Comparison
- 2.8 GHz Opteron vs. 3.8 GHz P4
- Opteron outperforms P4 on most
- Pipeline stalls slowing P4 (also perhaps memory
system)
18Perspective
- Interest in multiple-issue because wanted to
improve performance without affecting
uniprocessor programming model - Taking advantage of ILP is conceptually simple,
but design problems are amazingly complex in
practice - Processors of last 5 years (Pentium 4, IBM Power
5, AMD Opteron) have the same basic structure and
similar sustained issue rates (3 to 4
instructions per clock) as the 1st dynamically
scheduled, multiple-issue processors announced in
1995 - Clocks 10 to 20X faster, caches 4 to 8X bigger, 2
to 4X as many renaming registers, and 2X as many
load-store units? performance 8 to 16X - Peak v. delivered performance gap increasing
19VLIW and EPIC
20Review Very Long Instruction Word (VLIW)
- Each instruction has explicit coding for
multiple operations - In IA-64, grouping called a packet
- In Transmeta, grouping called a molecule
(atoms ops) - Trade off instruction space for simple decoding
- The long instruction word has room for many
operations - By definition, all the operations the compiler
puts in the long instruction word can execute in
parallel - E.g., 2 integer operations, 2 FP operations, 2
memory references, 1 branch - 16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide - Need very sophisticated compiling technique
- that schedules across several branches
21Recall Unrolled Loop for Single-Issue
1 Loop L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16
(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8
,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D
0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 D
SUBUI R1,R1,32 13 BNEZ R1,LOOP 14 S.D 8(R1),F16
8-32 -24 14 clock cycles, or 3.5 per iteration
L.D to ADD.D 1 Cycle ADD.D to S.D 2 Cycles
22Review Loop Unrolling in VLIW
- Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch - LD F0,0(R1) LD F6,-8(R1) 1
- LD F10,-16(R1) LD F14,-24(R1) 2
- LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3 - LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
- ADDD F20,F18,F2 ADDD F24,F22,F2 5
- SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
- SD -16(R1),F12 SD -24(R1),F16 7
- SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
- SD -0(R1),F28 BNEZ R1,LOOP 9
- Unrolled 7 times to avoid delays
- 7 results in 9 clocks, or 1.3 clocks per
iteration (1.8X) - Average 2.5 ops per clock, 50 efficiency
- Need more registers in VLIW (15 instead of 6)
23Early VLIW
- Floating-Point Systems AP-120B
- Multiflow
- 7, 14, or 28 instructions per word (512 1024
bit instruction) - Cydrome
- 7 operations in a 256-bit instruction
24Problems with 1st Generation VLIW
- Increase in code size
- generating enough operations in a straight-line
code fragment requires ambitiously unrolling
loops - whenever VLIW instructions are not full, unused
functional units translate to wasted bits in
instruction encoding - Operated in lock-step no hazard detection HW
- a stall in any functional unit pipeline (or cache
miss) caused entire processor to stall, since all
functional units must be kept synchronized - Compiler might prediction function units, but
caches hard to predict - Binary code compatibility
- Pure VLIW gt different numbers of functional
units and unit latencies require different
versions of the code
25Intel/HP IA-64 Explicitly Parallel Instruction
Computer (EPIC)
- IA-64 instruction set architecture
- Hardware checks dependencies, so not fully static
- Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mispredictions? - (see notes from last class)
- BNEZ R1,L
- ADDU R2,R3,R0
- L
- becomes CMOVZ R2,R3,R1
- 128 64-bit integer regs 128 82-bit FP regs!
- Not separate register files per functional unit
as in old VLIW
26IA-64 Predication more general
- Instead of just a move, IA-64 has 64 1-bit
predicate registers - Each instruction is associated with a register (a
6-bit field in instruction) - Set instruction for predicate reg can perform
Boolean (ex an AND of 2 comparisons)
27Instruction Format
- Not all slots used for a particular instruction
28Instruction Group
- An instruction group has no register dependencies
within it - All instructions in group can be executed in
parallel - If theres enough hardware
- Group can be arbitrarily long
- Compiler determines groups
29Bundles
- 128 bits
- Template (5 bits) and 3 slots
- Bars are stops between groups
30Example
- Coded with MIPS assembly ops
- Loop
- xi xi s
- Unrolled 7 times
31Minimize Instruction Size
32Minimize Execution Time
- Assuming enough functional units
33Implementations
- Itanium was first implementation (2001)
- 6-wide, 10-stage pipeline at 800 MHz on 0.18 µ
process - Itanium 2 was 2nd implementation (2005)
- 6-wide, 8-stage pipeline at 1.6 GHz on 0.13 µ
process - Caches 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D,
9216 KB L3
34Performance
SPECfp
SPECint
- Itanium biggest and most power hungry
35References
- Appendix G includes more compiler techniques that
take advantage of VLIW - The group that led much of the development of
VLIW compiler technology was led by Fisher at
Yale - They founded Multiflow