High Performance Computing Group - PowerPoint PPT Presentation

1 / 150
About This Presentation
Title:

High Performance Computing Group

Description:

pref brings all data into registers (allocated dynamically) L1 Cache ... and renaming assigns their registers to the preallocated by the pref instruction ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 151
Provided by: alex229
Category:

less

Transcript and Presenter's Notes

Title: High Performance Computing Group


1
High Performance Computing Group
  • Prof. Mateo Valero
  • http//www.people.ac.upc.es/mateo
  • Computer Architecture Department
  • Universitat Politècnica de Catalunya

2
Universitat Politècnica de Catalunya
  • Created in 1971
  • 38 departments and 6 research institutes
  • 9 schools , and 6 technical colleges
  • 2.240 professors, 30.443 students, and 1.221
    administrative staff
  • More than 250 research fields

3
Computer Architecture Department
  • Created in 1978 by Prof. Tomas Lang
  • 48 Tenure professors (13 on march. and
    instruction sched.)
  • 21 Full-time assistants (13 on march. and
    instruction sched.)
  • 13 Part-time assistants
  • 48 PhD. Fellowships (15 on march. and instruction
    sched.)
  • 14 Staff Members
  • 8 Administrative
  • 6 System administration

4
Computer Architecture Department
  • Main Research Groups
  • VLSI Systems Design
  • Broadband Integrated Communication Systems
  • Distributed Systems Architecture
  • High Performance Computing

5
High Performance Computing Group
  • Superscalar processors
  • Register File
  • Cache Memory
  • Branch Prediction
  • Data Value Prediction
  • Fetch Mechanisms
  • Data reuse
  • Clustered Microarchitectures
  • Power-Aware Architectures
  • Vector architectures
  • Efficient Access to Vectors
  • Advanced Vector Architectures
  • Vector Microprocessors
  • Multimedia Vector Processors
  • VLIW processors
  • Register File Use and Organization
  • Software Pipelining
  • Wide Architectures
  • Clustered Architectures
  • Compilers
  • Operating Systems
  • Algorithms Applications
  • Computer Architecture
  • Multithreaded processors
  • Multithreaded Vector Processor
  • Simultaneous Multithreaded Vector Processor
  • Speculative Multithreaded Scalar Processors
  • Clustered Speculative Multithreaded Processors
  • Multithreaded-Decoupled Architectures
  • Distant Parallelism

6
RD Projects on Parallel Software
94
95
96
97
93
92
98
99
00
01
Supernode II
Dimemas Paraver
Tools
Parmat
Parallelization
Sloegat
Identify
Metacomputing
Promenvir
BMW
ST-ORM
Bonanova
Data Base
DDT
Apparc
System
7
Selected Publications (last 6 years)
  • Conferences
  • 6 - ISCA
  • 14 - MICRO
  • 13 - HPCA
  • 12 - PACT
  • 33 - ICS
  • 1 - PLDI
  • Technical Journals
  • IEEE - TC
  • IEEE - Micro
  • IEEE - Computer
  • IEEE - TPDS
  • Supercomputing Journal

8
Some of the seminar guests
  • Krste Asanovic (MIT)
  • Venkata Krishnan (Compaq-DEC)
  • Trevor Mudge (U. Michigan)
  • Jim E. Smith (U. Wisconsin)
  • Luiz A. Barroso (WRL)
  • Josh Fisher (HP Labs)
  • Michel Dubois (USC)
  • Ronny Ronnen (Intel, Haifa)
  • Josep Torrellas (UIUC)
  • Per Stenstrom (U. Gothenburg)
  • Wen-mei Hwu (UIUC)
  • Jim Dehnert (Transmeta)
  • Fred Pollack (Intel)
  • Sanjay Patel (UIUC)
  • Daniel Tabak (George Mason U.)
  • Walid Najjar (Riverside)
  • Paolo Faboroschi (HP Labs)
  • Eduardo Sánchez (EPFL)
  • Guri Sohi (U. Michigan)
  • Miron Livny (U. Wisconsin)
  • Tomas Sterling (NASA JPL)
  • Maurice V. Wilkes (ATT Labs)
  • Theo Ungerer (Karlsruhe)
  • Mario Nemirovsky (Xstreamlogic)
  • Gordon Bell (Microsoft)
  • Timothy Pinkston (U.S.C.)
  • Walid Najjar (Riverside)
  • Roberto Moreno (ULPGC)
  • Kazuki Joe (Nara Women U.)
  • Alex Veidenbaum (Irvine)
  • G.R. Gao (U. Delaware)
  • Ricardo Baeza (U.de Chile,Santiago)
  • Gabby M. Silberman (CAS-IBM)
  • Sally A. McKee (U. Utah)
  • Evelyn Duesterwald (HP-Labs)
  • Yale Patt (Austin)
  • Burton Smith (Tera)
  • Doug Carmean (Intel, Oregon)

9
Industrial Relationships
  • Compaq
  • Sabbaticals
  • Roger Espasa (VSSAD)
  • Toni Juan (VSSAD)
  • Marta Jimenez (VSSAD)
  • Interns
  • Jesus Corbal (VSSAD)
  • Alex Ramirez (WRL)
  • Partnerships
  • BSSAD
  • HP
  • Sabbaticals
  • Josep Llosa (Cambridge)
  • Interns
  • Daniel Ortega
  • Javier Zalamea
  • Parnerships
  • Software Prefetching
  • Two-Level Register File
  • IBM
  • Interns
  • Xavi Serrano (CAS)
  • Daniel Jimenez (CAS)
  • Parnerships
  • Supercomputing (CIRI)
  • Low Power
  • Databases
  • Binary Translation
  • Intel
  • Interns
  • Adrian Cristal (Haifa)
  • Alex Ramirez (MRL)
  • Pedro Marcuello (MRL)
  • Parnerships
  • Semantic Gap
  • Smart Registers
  • Memory Architecture for Multithreaded Processors
  • Speculative Vector Processors

10
Superscalar Processors
  • Register File
  • Cache Memory
  • Branch Prediction
  • Data Value Prediction
  • Fetch Mechanisms
  • Data reuse
  • Clustered Microarchitectures
  • Power-Aware Architectures

11
Register File
  • Virtual-Physical Registers (HiPC97, HPCA-98,
    MICRO-99)
  • Register File cache (ISCA-00)

12
Virtual-Physical Registers
  • Motivation
  • Conventional renaming scheme
  • Virtual-Physical Registers

Icache
DecodeRename
Commit
Register used
Register unused
Register used
13
Performance and Number of Registers
SpecFp95
SpecIn95
14
Register Requirements
15
Register File Cache
  • Organization
  • Bank 1 (Register File)
  • All registers (128)
  • 2-cycle latency
  • Bank 2 (Reg. File Cache)
  • Register subset (16)
  • 1-cycle latency

16
Speed for Different RFC Architectures
SpecInt95
17
Compiler Directed Renaming Motivation
  • Binding Prefetch is very costly in terms of
    logical registers
  • Advancing one load implies using a logical
    register
  • ... limited logical register but unlimited (?!)
    physical
  • Non-binding prefetch needs another instruction to
    finally load the data from L1
  • Binding one piece of data in a line implies that
    the other pieces are not binded
  • Data is very near but not in register file
  • Compiler knows what pieces of data of the line
    are going to be needed
  • How can the compiler tell the hardware all this ?

18
Compiler Directed Renaming
19
Results
Speed ()
ICS-2001
20
Cache Memory
  • Locality sensitive cache memories
  • Hardware managed ( ICS-95 )
  • Software managed ( PACT-97)
  • Pseudo-random cache memories
  • Evaluation ( ICS-97 )
  • Implementation issues ( MICRO-97, IEEETC-99)
  • Cache Design and Technology Interaction
  • Difference-bit cache (ISCA-96 )
  • Data caches for superscalar processors ( ICS-97)
  • Reducing TLB power requirements (ISLPED-97)
  • Software Data Prefetching (ICS-96)
  • Locality Analysis (CPC-00, ISPASS-00, Europar-00)

21
Locality sensitive cache memories
  • Multi-module cache
  • Selective cache
  • Hardware / Software management

ICS 95, PACT 97, ICS 99
memory request
Miss
request to L2 cache
hitS
hitT
only temporal
Spatial Cache
Temporal Cache
Type of locality
data from L2 cache
hitT
hitS
bypass
temporal/spatial
data from/to CPU
22
Pseudo-random cache memories
  • Conflict misses are dominant in many applications
  • Pseudo-random placement
  • Bitwise XOR
  • Polynomial mapping
  • Critical-path impact
  • Line prediction based on address prediction

ICS 97, MICRO 97, IEEE TC 99
23
Branch Prediction
  • Dynamic History-Length Fitting (ISCA-98)
  • Early Branch Resolution (MICRO-98)
  • Through Value Prediction (PACT-99)
  • Compiler Support (CAECW-00, PACT-00, Europar-01)

24
Best history length?
  • Strong dependence on history length
  • go from 0.19 to 0.27
  • li from 0.04 to 0.13
  • Different behaviour
  • go best 3 history bits
  • li best 10 history bits
  • Best for one bad for the other

25
Early Branch Resolution
  • Branch window fed with Branch Flow from main
    window
  • Data Inputs to Branch Flow are predicted
  • KEY predict K iterations ahead!!!
  • Special register renaming for Branch Flow
  • Branch flow is executed on shared functional
    units
  • Result of branch is fed back to fetch engine as a
    prediction

Anticipated Branch Outcome
Value Prediction
26
Branch Prediction Through Value Prediction
ld r1, 8(r0)
bne r1,target
  • Approach
  • When a branch is fetched
  • The inputs are predicted
  • The outcome is computed with predicted inputs

27
Performance
  • 11 speedup for a 8KB predictor

28
The agbias Predictor (II)
Update only the selected predictor
Update BHR only for not strongly biased
branches
Agree Profile not strongly biased branches
Static selector based on the profiled branch
bias
Agree Profile strongly biased branches
Shared BHR
Selection
Europar-01
29
The agbias Predictor
Europar-01
30
Data Speculation
  • Value Prediction (ICS-98)
  • Memory Dependence Prediction (ICS-97, PACT-98)
  • Cost-effective Implementations (PACT-98, PACT-99)
  • Value Prediction for Speculative Multithreaded
    (MICRO-99)

31
Data Value Speculation
  • Data value speculation
  • Address Prediction and data prefetching (APDP)
  • Addresses more predictable than values
  • Baseline processor with total disambiguation
  • 32-entry inst. window, 8KB Dcache
  • T.Dis Total disambiguation
  • P.Dis Partial disambiguation

32
Fetch Mechanisms
  • Commercial Applications (ICPP-99)
  • Software Trace Cache (ICS-99)
  • Selective Trace Storage (HPCA-00)

33
The Fetch Engine
Fetch Address
Branch Target Buffer
Multiple Branch Predictor
Return Address Stack
Fill Unit
Core Fetch Unit
Hit
From Fetch or Commit
Next Fetch Address
34
Software Trace Cache
  • Optimize the code layout for optimized fetch
  • Use profile information
  • Build traces at compile time
  • Map traces for optimum I-cache performance

A
A
B
B
C
D
D
E
E
F
F
C
35
Selective Trace Storage
  • Compiler-built traces need not be stored in the
    trace cache
  • Filter traces in the fill unit
  • Store only traces containing taken branches

Blue (redundant) trace, present in both caches
Fill Unit
Red trace components, fetched in two cycles
36
Fetch Performance
Large cost reductions 1/2 to 1/4th or ... ...
1/16 vs non-optimized Tcache
37
Effect on Branch Prediction
  • Changing branch direction
  • Changes negative interference to positive
  • Many history values not used
  • Values with many taken branches
  • Simple predictors
  • Suffer heavy negative interference
  • Positive effect dominates
  • De-aliased predictors
  • Already remove negative interference
  • Only the negative effect remains

Change Branch Direction
Taken
Not Taken
Table Interference
Negative
Positive
Used BHR Values
Used
Not Used
Ramirez, Larriba-Pey Valero. The Effect of
Code Reordering on Branch Prediction Accuracy.
PACT'2000.
38
IPC Results
  • Agree works better for baseline layout
  • Gshare works better for STC layout
  • But STC layout always works better than baseline
    layout

16KB instruction cache
64KB instruction cache
Navarro, Ramirez, Larriba-Pey Valero, On the
performance of fetch engines running DSS
workloads, EuroPAR'2000
39
Data Reuse
  • Instruction-level reuse
  • Redundant computation buffer
  • Redundant stores
  • Trace-level reuse
  • Fuzzy Region Reuse

ICS99, ICPP 99, HPCN 99
40
Redundant Computation Buffer
value
PC
41
Speedup (200 KB)
1.25
1.20
1.15
1.10
1.05
1.00
42
Clustered Microarchitectures
  • Instruction distribution to local resources
  • Objective Maximize instruction throughput
  • Approach
  • Minimize instructions latencies
  • Minimize inter-cluster communication
  • Hide memory latency
  • Maximize workload balance
  • Reduce control/data hazards
  • Main contributions
  • Several dynamic steering mechanisms
  • Value prediction scheme to reduce wire delays

PACT 99, HPCA 00, MICRO 00
43
Dynamic Partitioning
44
Reducing Wire Delays through Value Prediction
Value Predictor
Send value before produced
Producer
Consumer
Send only if mispredicted
Validate Prediction
Micro-00
45
IPCR
46
Power-Aware Architectures
  • Low power out-of-order issue logic (WCED-00,
    ISCA-01)
  • Gating off wake up activity
  • Dynamic resizing of the instruction window
  • Subscalar microarchitecture (MICRO-33)
  • Very low power pipelines using significance
    compression
  • Fuzzy Region Reuse (ICS-01)
  • Skip large code blocks
  • Power saving and performance increase

47
Significance Compression
Conventional approach
Significance compression approach
Sign extn byte
48
Power Savings
  • Summary power savings

49
Low Power Pipelines
  • Byte-serial implementation

Tag compare
Tag compare
I-Cache tags
D-Cache tags
I-Cache 2/3
I-Cache 1
ALU
Data Cache
Register File
PC ADD
I-Cache 0
Writeback
exten.
exten
exten
G
exten.
50
Low Power Pipelines
  • Byte wide pipelines performance

79
23.6
6
2.5
51
Synthesis Vs Storage
  • Extremely different nature of media data types
  • New paradigms of computation
  • Maximum performance achieved with a trade-off
    between
  • Synthesis (computation)
  • Storage (memorization)
  • Data error tolerance
  • Present in multimedia applications
  • Not found in other integer or scientific
    applications

ICS-01
52
Fuzzy Instruction Reuse
  • Perform tolerant instruction/region reuse
  • Skip large code blocks
  • Power savings
  • Performance increase
  • Embedded processors for
  • Image compression / decompression
  • 3D processing

D2'
D2
f( )
D2 f(D2) D1'
ICS-01
53
Fuzzy Synthesis
  • Approximate results using linear functions
  • Use previous instances as inputs
  • Small quality degradations

Open GL, Direct 3D APIs, DCT, texture bilinear
filtering
D2
D2
f( )
D2 f(D0,D1)
ICS-01
54
Long Experience on Vector Processors
  • Selected Papers
  • 2 - ISCA
  • 2 MICRO
  • 3 HPCA
  • 2 PACT
  • 6 ICS
  • 1 - SPAA
  • 1 IEEE-TC
  • 1 - Micro Journal
  • 1 Supercomputing Journal
  • .....
  • Topics
  • Efficient Access to Vectors
  • Advanced Vector Architectures
  • In memory computation
  • Stores renaming

55
Efficient Access to Vectors
  • Out-of-order access to vectors
  • Single-processor (PPL-92, ISCA-92, IEEE-TC
    95...)
  • Conflict-free access to power-of-two strides
    (ICS-92)
  • Out-of-order access in vector multiprocessors
  • Conflict-free access (PPL-94, CONPAR-94,
    ICS-94,ISCA-95)
  • Efficient access (IPPS-95,HiPC-95,ICS-96)
  • Command vector memory access (PACT-98)

56
Command Memory System
Command lt_at_,Length,Stride,sizegt Break commands
into bursts at the section controller
57
Command Memory System
  • Between 15 and 50 of improvement (compared to a
    basic SDRAM system)
  • Same performance than ultra-fast SRAM with 2-4
    times fewer banks and commodity parts
  • 15-60 times cheaper than conventional vector
    memory systems

58
Advanced Vector Architectures
  • Vector Code Characterization
  • Decoupling Vector Architectures ( HPCA-96,
    JSuperc.99 )
  • Out-of-order Vector Architectures ( MICRO-96 )
  • Multithreaded Vector Architectures ( HPCA-97 )
  • Simultaneous Multithreaded Vector Architectures (
    HICS-97, IEEE-MICRO J.-97 )
  • Vector register-file organization ( PACT-97 )

59
Why Vectors for Multimedia
  • SIMD architectures (longer vectors) allow to
    leverage scalable performance without increasing
    complexity
  • Alleviates pressure over fetchdecode unit
  • No need for larger window/issue queue sizes
  • Simpler register files (size and ports)
  • Simple cache ports delivering high bandwidth
  • No need for recompilation (strong point Vs MMX)
  • Low power in nature

60
Multimedia Vector Processors
  • Short Registers plus Vector Cache (ICS-99,
    SPAA-01)
  • MOM Architecture (SC-99, MICRO-99)
  • SMT-MOM for MPEG-4 (HPCA-01)

61
Vector PC Architecture
FETCH
DECODE
DRDRAM
DRDRAM
DRDRAM
FP
INT
L/S
DRDRAM
I-Cache
RAMBUS Controller
Data Cache
VECTOR CACHE
3.2 GB/s
62
Vector MMX Matrix ISA
for(i1 to N) for(j1 to 4)
for(k1 to 4)
Aijk b Cijk
MICRO-32, Haifa
63
Matrix extensions for Multimedia
MOM
15
31
0
47

63
MMX
A1
A2
A4
A3
SS
A5
A6
A8
A7

A9
A10
A12
A11
15
31
0
47

63
A1
A2
A4
A3
A13
A14
A16
A15





B1
B2
B4
B3
B1
C1
C2
C4
C3
C5
C6
C8
C7
C1
C1
C2
C4
C3
C9
C10
C12
C11
C13
C14
C16
C15
64
Matrix ISA Relative Performance
INVERSE DCT TRANSFORM
MPEG-2 MOTION ESTIMATION
RGB-YCC Color CONVERSION
65
Program Level Performance
66
The reduction problem
  • MMX-like ISAs have problems handling reductions

67
Multimedia Accumulators
  • 192-bit multimedia PACKED accumulators (MDMX,
    Mips)
  • Advantages
  • doubles sub-word parallelism
  • high precision
  • Disadvantages
  • artificial recurrences

68
Recurrences and Efficiency Degradation
69
Accumulators for MOM
Solves the recurrence problem Powerful
instructions matrix x vector, matrix SAD 1
instr.!!
0
63
15
31
47
Vl
S a1j x b1j
S a2j x b2j
S a3j x b3j
S a4j x b4j
47
95
0
143

191
70
Benchmarks and Simulation Tools
  • Developed emulation libraries
  • MMX
  • 67 instructions emulated
  • 32 MMX registers (64 bits)
  • MDMX
  • 88 instructions emulated
  • 32 MDMX registers (64 bits), 4 MDMX accumulators
    (192 bits)
  • MOM
  • 121 instructions emulated
  • 16 MOM registers (16x64bits), 2 MOM accumulators
    (192 bits)
  • Added support to the JINKS simulator to
    correctly detect emulation function calls and
    translate to the emulated instruction (with the
    help of the ATOM tool)

71
Scalability
LTPPAR (from GSM encode)
Baseline MMX w/ 36 reg
Relative Speed-Up
physical SIMD registers
72
Tolerance to Instruction Latencies
LTPPAR (from GSM encode)
Execution Cycles
Latency increase in cycles
73
High-end media processors
  • Exploit both DLP and TLP
  • SMT vector processor
  • Matrix oriented multimedia extensions (MOM)

74
Future Multimedia Workloads
MPEG7
MPEG4
75
SMT SIMD ISAs
  • Natural way of exploiting the two main sources of
    media parallelism (TLP DLP) for next generation
    of media protocols (MPEG4/MPEG7)
  • SMT paradigm helps vector execution
  • Minimizes Amdahls impact
  • Allows to hide vector execution under scalar
    execution
  • SIMD ISAs helps SMT paradigm
  • Alleviates fetch pressure
  • Allows a better pipeline dispatch balance
  • Increases latency tolerance (L1 vector bypass)

76
SMT m-SIMD media processor
HPCA01, Monterrey
FP RF
FP queue
I cache
Decode
L1 cache
I
F
PC
V
INT Register File
INTEGER queue
8 program counters (one/ thread)
8 rename tables (one/thread)
MEMORY queue
L2 cache
Inst fetch
Inst decode
Thread ID
SIMD Register File
SIMD queue
Instruction Slots
Instruction Issue
RAMBUS
Reorder Buffer
Execution Pipeline
77
Cache Hierarchy
78
SMT SIMD ISAs bypassing L1
EIPC
79
Multithreaded Processors
  • Multithreaded Vector Processor (HPCA-97)
  • Simultaneous Multithreaded Vector Processor
    (IEEE-MICRO 97)
  • Speculative Multithreaded Scalar Processors
    (ICS-99)
  • Clustered Speculative Multithreaded Processors
    (MICRO-99)
  • Multithreaded-Decoupled Architectures (HPCA-99)
  • Distant Parallelism (ICS-99, PACT-99)

80
Speculative Multithreaded Processors
  • Multiple instruction windows
  • Non-contiguous
  • Interleaved
  • Inter-thread data speculation
  • Dependences
  • Values

W1
W2
W3
Dynamic instruction stream
81
Performance Potential
  • Speedup over single-thread execution

16 thread units
82
Speculative Multithreaded Processors
83
Clustered Speculative Multithreaded Processors
84
Multithreaded Decoupled Access/Execute Processors
  • Decoupling
  • Very effective to hide memory latency
  • A natural approach for a distributed organization
  • Multithreading
  • Provides additional ILP

Fech, dispatch rename
Fech, dispatch rename
Fech, dispatch rename
Instr. queues
AP
EP
Store address queues
Memory subsystem
HPCA-99
85
Latency Tolerance
  • Decoupling hides most memory latency
  • IPC loss is much lower with decoupling
  • Multithreading hardly improves memory latency
    tolerance

86
Distant Parallelism
Programs Structure
Distant
Trace
BB
Loop
Procedure
Compiler
Hardware
87
Examples Partial parallelisation
  • Go
  • big loop with dependencies
  • loop distribution
  • isolate recurrences
  • reduction of list operation

Coverage 50
88
Sequential vs. SMT
89
VLIW Architectures
High register requirements
Register-sensitive Software pipelining
Register-constrained Software pipelining
New VLIW organizations
Register File
Wide FU
Clustering
90
VLIW Architectures
  • Register File Use and Organization (HPCA-95)
  • Software Pipelining (MICRO-96, MICRO-97, IEEE-TC,
    PACT-98, PLDI-00, ICPP-00, ISSS-00, MICRO-00)
  • Software Prefetching (MICRO-97)
  • Wide Architectures (ICS-97, ICS-98, MICRO-98)
  • Clustered Architectures (HPCA-99)
  • Two-Level Register File Organization (MICRO-32)

91
Static register requirements
92
Dynamic register requirements
93
Register-sensitive SP
  • Objectives
  • Throughput optimal schedules
  • Minimum register requirements
  • Fast scheduling time
  • Proposed techniques
  • HRMS Hypernode Reduction Modulo Scheduling
    micro95
  • SMS Swing Modulo Scheduling pact96

94
HRMS/SMS
  • Static priority function to pre-order nodes
  • Hypernode reduction
  • Swing reduction critical path
  • Simple scheduling bidirectional greedy modulo
    scheduling
  • micro95, pact96

95
HRMS and SMS register requirements
96
Software Prefetching
  • Cache sensitive modulo scheduling
  • Interaction between software prefetching and
    software pipelining
  • Binding vs. non-binding prefetch
  • Based on
  • Data locality analysis
  • Dependence graph

3,6
8-way issue
MICRO 97
97
Register-constrained SP
  • Objectives
  • Schedule loops even if register requirements are
    bigger than the available registers
  • Reduce performance degradation (?throughput and
    ?memory traffic)
  • Reduce compilation time
  • Proposed techniques
  • Spilling vs. increasing the II micro96
  • New heuristics to add spill code. Combining
    spilling increasing the II pldi00
  • Iterative Modulo Scheduling with Integrated
    Register Spilling (MIRS) submitted to ics01

98
Scheduling environment
99
Heuristics for spill code
  • Performance evaluation (P4M2L4, all the loops)

Relative Performance
Memory Traffic
Var Use
Var Use
CC (critical cycle), QF (quantity factor) and TF
(traffic factor) pldi00
100
MIRS modulo scheduling and spilling
II MII HRMS priority
Start
Budged BR nodes
II
Yes
No
restart?
Select Node
nodes?
Exit
Yes
Find Cycle
No
Force Eject
Insert spill
Check Insert
101
MIRS modulo scheduling and spilling
102
MIRS modulo scheduling and spilling
103
MIRS evaluation
All loops
104
MIRS evaluation
Speed-up
105
New register file organizations
  • Register file requirements
  • Large register files
  • High bandwidth register files
  • Technological problems
  • Area, power consumption and access time grow
    with
  • Number of access ports, number of registers,
    number of bits per register

106
Monolithic register file
Cycle Time (hs)
Area (l2x106)
Power (W)
  • Rixners (et al.) model HPCA00
  • Technology of 0.18 mm
  • VLIW configurations GPxMyREGz where x6, y2,
    3 and z16, 32, 64, 128

107
Influence of the register file size
  • Memory ports
  • Memory latency

Execution Cycles
Memory Traffic
Execution Time
108
New register file organizations
  • Objective
  • Investigate the performance/area/power trade-off
  • Scheduling heuristics
  • Proposed Organizations
  • Sacks conpar94
  • Non-consistent register file hpca95
  • Hierarchical register file micro00

109
Sacks-based register file
conpar94
110
Sack-based RF performance
111
Non-consistent register file
hpca95
112
Non-consistent RF performance
113
Hierarchical register file
L1 Cache
L1 Cache
Load
Load
LoadR
R2
R1
R1
StoreR
Store
Store
  • micro00

114
Hierarchical RF design issues
  • R1 size 16 registers
  • Between R1-R24 load 2 store ports
  • R2 size 64 registers
  • Total storage capacity 80 registers

115
Hierarchical RF compilation issues
  • Two-step register allocation
  • Adding LoadR and StoreR instructionsRegister
    allocation in R1Insert spill code between R1 and
    R2
  • Register allocation in R2Insert spill code
    between R2 and L1-cache

116
Hierarchical RF performance
GP6M2REGx vs GP6M2TWO16, ideal memory
Execution Cycles
Memory Traffic
Execution Time
117
Hierarchical RF performance
GP6M3REGx vs GP6M2TWO16, ideal memory
Execution Cycles
Memory Traffic
Execution Time
118
Hierarchical RF performance
  • L1-cache of 32 Kb with 32 bytes line size
  • Lockup-free and allows up to 8 pending memory
    accesses
  • Latency load is 2 cycles and store is 1
  • Three miss latency values
  • low (10 hs)
  • medium (20 hs)
  • high (40 hs)
  • Binding prefetching

119
Hierarchical RF performance
Execution Cycles
Execution Time
120
Wide architectures
  • Objectives
  • Exploit Data Parallelism in loops
  • Optimize Performance/Cost
  • Papers
  • ICS97
  • MICRO98
  • ICS98
  • ICPP99

121
Wide architectures
ICS97, ICS98
122
Wide architectures design issues
Area
MICRO98
123
Wide architectures design issues
Relative cycle time
124
Wide architectures performance
MICRO98
125
Wide architectures performance
Register constraints
126
Clustered architectures
  • Each cluster has a local register file and a set
    of functional units
  • Clusters interconnected by bidirectional ring of
    queues
  • Variables that span across clusters are allocated
    to communication queues
  • Techniques developed to allocate variables to
    queue register files
  • New scheduling technique developed
  • Distributed Modulo Scheduling HPCA99

127
Ring Clustered architectures
HPCA-99
Add
Copy
Mul
128
Clustered architectures performance
129
Clustered VLIW Architecture
ICPP 00, ISSS 00, MICRO-33
130
Scheduling Algorithm
  • Order instructions
  • Priority to nodes in recurrences
  • Avoid predecessors and successors before a node
  • For each instruction
  • Try in all possible clusters (resource
    constraints)
  • Choose cluster with best profit in outedges
  • If more than one, minimize register requirements
  • If new subgraph, then another cluster
  • Schedule node in chosen cluster
  • Selective unrolling

131
Results
2-cluster
4-cluster
132
Cache Miss Equations
  • Techniques that exploit intrinsic properties of
    CME
  • Important speed-up
  • Between 30-40 for SPECfp95
  • Sampling reduces significantly the computational
    requirements
  • Computing cost per program
  • Usually less than a minute
  • Never more than 5 minutes
  • Accurate
  • Error less than 0.2 for more than 50 loops
  • Never higher than 1

Interact 00, CPC 00, ISPASS 00, Europar 00
133
MultiVLIWProcessors
MICRO 33
134
RMCA Modulo Scheduler (I)
135
RMCA Modulo Scheduler (II)
136
Performance Results
4-cluster
NMB 1 LMB 1
NMB 1 LMB 4
NMB 2 LMB 1
NMB 2 LMB 4
137
Clustered architectures interconnect
L1 Cache
R1
R1
R1
138
Clustered architectures RF
L1 Cache
R2
R1
R1
R1
139
Current work
  • Integrated scheduling and register spilling for
    clustered architectures
  • Not only performance. Area and power consumption
    oriented proposals
  • Hierarchical register file organization for
    clustered/wide VLIW architectures
  • Not only numerical applications. Multimedia
    applications
  • Widening Clustering

140
Compilers
  • Instruction-level scheduling
  • Linear loop transformations
  • Automatic data distribution and exploitation of
    data locality
  • Exploitation of multilevel parallelism in OpenMP

141
Operating Systems
  • Analysis and visualization tools
  • OS support to parallel applications
  • Parallel I/O
  • Efficient execution of parallel applications on
    multiprogrammed environments

142
Algorithms
  • Multilevel Block Algorithms
  • Parallel Algorithms for Sparse Matrices
  • Systematic Mapping of Algorithms onto
    Multicomputers

143
Projects (1 of 2)
  • Management and Technology Transfer
  • PCI-PACOS, PCI-II, CEPBA-TTN
  • Research and Development
  • COMANDOS-II, Supernode-II, EDS, Genesis, SHIPS,
    IDENTIFY, DDT, PROMENVIR, PERMPAR, PARMAT, SLOEGAT

144
Projects (2 of 2)
  • Basic and Long-term Research
  • SEPIA, APPARC, NANOS, MHAOTEU
  • Mobility of Researchers
  • HCM, PECO, TMR
  • Training
  • COMETT, PARANDES, PARALIN

145
Large Software Development
  • DDT
  • Data distribution tool
  • Dimemas Paraver
  • Parallel Performance Prediction Analysis
  • Paros
  • Parallel Operating Systems
  • Ictineo
  • Fortran compiler
  • Dixie
  • Binary translation instrumentation

146
Large Software Development
  • NANOS
  • OpenMP environment (Compiler, Visualization)
  • Resource manager minimizing control switches and
    migrations in large multiprocessors
  • PERMAS
  • Automatic Parallelization at run time of 1.5
    Mlines legacy code
  • ST-ORM
  • Metacomputing environment for stochastic analysis
    and optimization

147
Interdisciplinary Topics
  • Big group
  • Expertise in many areas
  • Critical mass favours production
  • Resource sharing
  • More external visibility
  • Cross fertilization of topics
  • Compilers vs. Architecture
  • Numerical Applications vs Memory Hierarchy
  • etc. etc.

148
HPC group seminar
  • A forum for learning and discussing
  • Starts in fall 1998
  • 56 conferences (January 2000)
  • 23 by HPC group members
  • 33 by guests
  • See our web site
  • http//www.ac.upc.es/HPCseminar

149
Summary
  • Large group
  • Experience in many topics
  • Very good students
  • A proven track record with past projects

150
Thank you
Write a Comment
User Comments (0)
About PowerShow.com