IBM 650, perhaps the first - PowerPoint PPT Presentation

About This Presentation
Title:

IBM 650, perhaps the first

Description:

LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 ... ADDD F20,F18,F2 ADDD F24,F22,F2 5. SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 ... – PowerPoint PPT presentation

Number of Views:383
Avg rating:3.0/5.0
Slides: 36
Provided by: Montek5
Learn more at: http://www.cs.unc.edu
Category:
Tags: ibm | f22 | first | perhaps

less

Transcript and Presenter's Notes

Title: IBM 650, perhaps the first


1
IBM 650, perhaps the first mass produced
computer. IBM announced it in 1953, and expected
to sell only 50 units or so. By the time it was
discontinued in 1962, almost 2000 had been
sold. The 650 first used a magnetic drum memory
(top right) , which stored 10,000 or 20,000
digits (1,000 or 2,000 words). Later a disk was
introduced (right) with 50 platters and a
capacity of 6 million digits.
2
COMP 740Computer Architecture and Implementation
  • Montek Singh
  • Thu, Mar 19, 2009
  • Topic Case Studies
  • (Pentium IV and Itanium)

3
Pentium 4 Look at features
4
Pentium 4 - Fetch
Up to 3 IA-32 instructions/clk
5
Pentium 4 - Fetch
Can generate up to 6 uops/clk
6
Pentium 4 - uops
Trace cache, temporal coherence instead of spatial
7
Pentium 4 2 BTBs
Another BTB for uops
8
Pentium 4 Out of Order
Register renaming instead of ROB, 128 total regs
for renaming
9
Pentium 4 Issue and Commit
Three uops can be dispatched to queues each
clock, up to 6 to a fcn unit, three can commit
10
Pentium 4 Pipelining
  • Called Netburst
  • Minimum clocks to transit pipeline was initially
    21
  • 1.5 GHz clock
  • Compare to PIII, 11 clocks
  • Later P4s up to 31 clocks
  • Clock rate bumped to over 3 GHz

11
Pentium 4 Caches
18 cycle latency 108 GB/s throughput
4 cycle integer load latency, 12 for FP Up to 8
outstanding
12
Performance
  • Pentium 4 672, Prescott (2005)
  • 3.8 GHz clock
  • 800 MHz system bus
  • 667 MHz DDR2 DRAMs
  • Because of power dissipation, Intel has switched
    from Netburst to the Core (or Core 2)
    microarchitecture (Pentium M uses this)
  • Shallower pipeline, wider issue

13
Branch Mispredictions/1000 inst.
Average 186 branches/1000
Average 48 branches/1000 for FP benchmarks
14
Cache Misses / 1000 instructions
15
Effective CPI
  • mcf (combinatorial optzn) and swim (fluid
    dynamics) have high miss rates

16
P4 vs. Opteron
  • Opteron has lower CPI on average
  • But deep pipeline of P4 enables higher clock rate

17
SPEC Ratio Comparison
  • 2.8 GHz Opteron vs. 3.8 GHz P4
  • Opteron outperforms P4 on most
  • Pipeline stalls slowing P4 (also perhaps memory
    system)

18
Perspective
  • Interest in multiple-issue because wanted to
    improve performance without affecting
    uniprocessor programming model
  • Taking advantage of ILP is conceptually simple,
    but design problems are amazingly complex in
    practice
  • Processors of last 5 years (Pentium 4, IBM Power
    5, AMD Opteron) have the same basic structure and
    similar sustained issue rates (3 to 4
    instructions per clock) as the 1st dynamically
    scheduled, multiple-issue processors announced in
    1995
  • Clocks 10 to 20X faster, caches 4 to 8X bigger, 2
    to 4X as many renaming registers, and 2X as many
    load-store units? performance 8 to 16X
  • Peak v. delivered performance gap increasing

19
VLIW and EPIC
20
Review Very Long Instruction Word (VLIW)
  • Each instruction has explicit coding for
    multiple operations
  • In IA-64, grouping called a packet
  • In Transmeta, grouping called a molecule
    (atoms ops)
  • Trade off instruction space for simple decoding
  • The long instruction word has room for many
    operations
  • By definition, all the operations the compiler
    puts in the long instruction word can execute in
    parallel
  • E.g., 2 integer operations, 2 FP operations, 2
    memory references, 1 branch
  • 16 to 24 bits per field gt 716 or 112 bits to
    724 or 168 bits wide
  • Need very sophisticated compiling technique
  • that schedules across several branches

21
Recall Unrolled Loop for Single-Issue
1 Loop L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16
(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8
,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D
0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 D
SUBUI R1,R1,32 13 BNEZ R1,LOOP 14 S.D 8(R1),F16
8-32 -24 14 clock cycles, or 3.5 per iteration
L.D to ADD.D 1 Cycle ADD.D to S.D 2 Cycles
22
Review Loop Unrolling in VLIW
  • Memory Memory FP FP Int. op/ Clockreference
    1 reference 2 operation 1 op. 2 branch
  • LD F0,0(R1) LD F6,-8(R1) 1
  • LD F10,-16(R1) LD F14,-24(R1) 2
  • LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
    F8,F6,F2 3
  • LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
  • ADDD F20,F18,F2 ADDD F24,F22,F2 5
  • SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
  • SD -16(R1),F12 SD -24(R1),F16 7
  • SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
  • SD -0(R1),F28 BNEZ R1,LOOP 9
  • Unrolled 7 times to avoid delays
  • 7 results in 9 clocks, or 1.3 clocks per
    iteration (1.8X)
  • Average 2.5 ops per clock, 50 efficiency
  • Need more registers in VLIW (15 instead of 6)

23
Early VLIW
  • Floating-Point Systems AP-120B
  • Multiflow
  • 7, 14, or 28 instructions per word (512 1024
    bit instruction)
  • Cydrome
  • 7 operations in a 256-bit instruction

24
Problems with 1st Generation VLIW
  • Increase in code size
  • generating enough operations in a straight-line
    code fragment requires ambitiously unrolling
    loops
  • whenever VLIW instructions are not full, unused
    functional units translate to wasted bits in
    instruction encoding
  • Operated in lock-step no hazard detection HW
  • a stall in any functional unit pipeline (or cache
    miss) caused entire processor to stall, since all
    functional units must be kept synchronized
  • Compiler might prediction function units, but
    caches hard to predict
  • Binary code compatibility
  • Pure VLIW gt different numbers of functional
    units and unit latencies require different
    versions of the code

25
Intel/HP IA-64 Explicitly Parallel Instruction
Computer (EPIC)
  • IA-64 instruction set architecture
  • Hardware checks dependencies, so not fully static
  • Predicated execution (select 1 out of 64 1-bit
    flags) gt 40 fewer mispredictions?
  • (see notes from last class)
  • BNEZ R1,L
  • ADDU R2,R3,R0
  • L
  • becomes CMOVZ R2,R3,R1
  • 128 64-bit integer regs 128 82-bit FP regs!
  • Not separate register files per functional unit
    as in old VLIW

26
IA-64 Predication more general
  • Instead of just a move, IA-64 has 64 1-bit
    predicate registers
  • Each instruction is associated with a register (a
    6-bit field in instruction)
  • Set instruction for predicate reg can perform
    Boolean (ex an AND of 2 comparisons)

27
Instruction Format
  • Not all slots used for a particular instruction

28
Instruction Group
  • An instruction group has no register dependencies
    within it
  • All instructions in group can be executed in
    parallel
  • If theres enough hardware
  • Group can be arbitrarily long
  • Compiler determines groups

29
Bundles
  • 128 bits
  • Template (5 bits) and 3 slots
  • Bars are stops between groups

30
Example
  • Coded with MIPS assembly ops
  • Loop
  • xi xi s
  • Unrolled 7 times

31
Minimize Instruction Size
32
Minimize Execution Time
  • Assuming enough functional units

33
Implementations
  • Itanium was first implementation (2001)
  • 6-wide, 10-stage pipeline at 800 MHz on 0.18 µ
    process
  • Itanium 2 was 2nd implementation (2005)
  • 6-wide, 8-stage pipeline at 1.6 GHz on 0.13 µ
    process
  • Caches 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D,
    9216 KB L3

34
Performance
SPECfp
SPECint
  • Itanium biggest and most power hungry

35
References
  • Appendix G includes more compiler techniques that
    take advantage of VLIW
  • The group that led much of the development of
    VLIW compiler technology was led by Fisher at
    Yale
  • They founded Multiflow
Write a Comment
User Comments (0)
About PowerShow.com