High Performance Computing Group

About This Presentation

Title:

High Performance Computing Group

Description:

pref brings all data into registers (allocated dynamically) L1 Cache ... and renaming assigns their registers to the preallocated by the pref instruction ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 151

Provided by: alex229

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Computing Group

1
High Performance Computing Group

Prof. Mateo Valero
http//www.people.ac.upc.es/mateo
Computer Architecture Department
Universitat Politècnica de Catalunya

2
Universitat Politècnica de Catalunya

Created in 1971
38 departments and 6 research institutes
9 schools , and 6 technical colleges
2.240 professors, 30.443 students, and 1.221
administrative staff
More than 250 research fields

3
Computer Architecture Department

Created in 1978 by Prof. Tomas Lang
48 Tenure professors (13 on march. and
instruction sched.)
21 Full-time assistants (13 on march. and
instruction sched.)
13 Part-time assistants
48 PhD. Fellowships (15 on march. and instruction
sched.)
14 Staff Members
8 Administrative
6 System administration

4
Computer Architecture Department

Main Research Groups
VLSI Systems Design
Broadband Integrated Communication Systems
Distributed Systems Architecture
High Performance Computing

5
High Performance Computing Group

Superscalar processors
Register File
Cache Memory
Branch Prediction
Data Value Prediction
Fetch Mechanisms
Data reuse
Clustered Microarchitectures
Power-Aware Architectures
Vector architectures
Efficient Access to Vectors
Advanced Vector Architectures
Vector Microprocessors
Multimedia Vector Processors
VLIW processors
Register File Use and Organization
Software Pipelining
Wide Architectures
Clustered Architectures

Compilers
Operating Systems
Algorithms Applications
Computer Architecture
Multithreaded processors
Multithreaded Vector Processor
Simultaneous Multithreaded Vector Processor
Speculative Multithreaded Scalar Processors
Clustered Speculative Multithreaded Processors
Multithreaded-Decoupled Architectures
Distant Parallelism

6
RD Projects on Parallel Software
94
95
96
97
93
92
98
99
00
01
Supernode II
Dimemas Paraver
Tools
Parmat
Parallelization
Sloegat
Identify
Metacomputing
Promenvir
BMW
ST-ORM
Bonanova
Data Base
DDT
Apparc
System
7
Selected Publications (last 6 years)

Conferences
6 - ISCA
14 - MICRO
13 - HPCA
12 - PACT
33 - ICS
1 - PLDI

Technical Journals
IEEE - TC
IEEE - Micro
IEEE - Computer
IEEE - TPDS
Supercomputing Journal

8
Some of the seminar guests

Krste Asanovic (MIT)
Venkata Krishnan (Compaq-DEC)
Trevor Mudge (U. Michigan)
Jim E. Smith (U. Wisconsin)
Luiz A. Barroso (WRL)
Josh Fisher (HP Labs)
Michel Dubois (USC)
Ronny Ronnen (Intel, Haifa)
Josep Torrellas (UIUC)
Per Stenstrom (U. Gothenburg)
Wen-mei Hwu (UIUC)
Jim Dehnert (Transmeta)
Fred Pollack (Intel)
Sanjay Patel (UIUC)
Daniel Tabak (George Mason U.)
Walid Najjar (Riverside)
Paolo Faboroschi (HP Labs)
Eduardo Sánchez (EPFL)
Guri Sohi (U. Michigan)

Miron Livny (U. Wisconsin)
Tomas Sterling (NASA JPL)
Maurice V. Wilkes (ATT Labs)
Theo Ungerer (Karlsruhe)
Mario Nemirovsky (Xstreamlogic)
Gordon Bell (Microsoft)
Timothy Pinkston (U.S.C.)
Walid Najjar (Riverside)
Roberto Moreno (ULPGC)
Kazuki Joe (Nara Women U.)
Alex Veidenbaum (Irvine)
G.R. Gao (U. Delaware)
Ricardo Baeza (U.de Chile,Santiago)
Gabby M. Silberman (CAS-IBM)
Sally A. McKee (U. Utah)
Evelyn Duesterwald (HP-Labs)
Yale Patt (Austin)
Burton Smith (Tera)
Doug Carmean (Intel, Oregon)

9
Industrial Relationships

Compaq
Sabbaticals
Roger Espasa (VSSAD)
Toni Juan (VSSAD)
Marta Jimenez (VSSAD)
Interns
Jesus Corbal (VSSAD)
Alex Ramirez (WRL)
Partnerships
BSSAD
HP
Sabbaticals
Josep Llosa (Cambridge)
Interns
Daniel Ortega
Javier Zalamea
Parnerships
Software Prefetching
Two-Level Register File

IBM
Interns
Xavi Serrano (CAS)
Daniel Jimenez (CAS)
Parnerships
Supercomputing (CIRI)
Low Power
Databases
Binary Translation
Intel
Interns
Adrian Cristal (Haifa)
Alex Ramirez (MRL)
Pedro Marcuello (MRL)
Parnerships
Semantic Gap
Smart Registers
Memory Architecture for Multithreaded Processors
Speculative Vector Processors

10
Superscalar Processors

Register File
Cache Memory
Branch Prediction
Data Value Prediction
Fetch Mechanisms
Data reuse
Clustered Microarchitectures
Power-Aware Architectures

11
Register File

Virtual-Physical Registers (HiPC97, HPCA-98,
MICRO-99)
Register File cache (ISCA-00)

12
Virtual-Physical Registers

Motivation
Conventional renaming scheme
Virtual-Physical Registers

Icache
DecodeRename
Commit
Register used
Register unused
Register used
13
Performance and Number of Registers
SpecFp95
SpecIn95
14
Register Requirements
15
Register File Cache

Organization
Bank 1 (Register File)
All registers (128)
2-cycle latency
Bank 2 (Reg. File Cache)
Register subset (16)
1-cycle latency

16
Speed for Different RFC Architectures
SpecInt95
17
Compiler Directed Renaming Motivation

Binding Prefetch is very costly in terms of
logical registers
Advancing one load implies using a logical
register
... limited logical register but unlimited (?!)
physical
Non-binding prefetch needs another instruction to
finally load the data from L1
Binding one piece of data in a line implies that
the other pieces are not binded
Data is very near but not in register file
Compiler knows what pieces of data of the line
are going to be needed
How can the compiler tell the hardware all this ?

18
Compiler Directed Renaming
19
Results
Speed ()
ICS-2001
20
Cache Memory

Locality sensitive cache memories
Hardware managed ( ICS-95 )
Software managed ( PACT-97)
Pseudo-random cache memories
Evaluation ( ICS-97 )
Implementation issues ( MICRO-97, IEEETC-99)
Cache Design and Technology Interaction
Difference-bit cache (ISCA-96 )
Data caches for superscalar processors ( ICS-97)
Reducing TLB power requirements (ISLPED-97)
Software Data Prefetching (ICS-96)
Locality Analysis (CPC-00, ISPASS-00, Europar-00)

21
Locality sensitive cache memories

Multi-module cache
Selective cache

Hardware / Software management

ICS 95, PACT 97, ICS 99
memory request
Miss
request to L2 cache
hitS
hitT
only temporal
Spatial Cache
Temporal Cache
Type of locality
data from L2 cache
hitT
hitS
bypass
temporal/spatial
data from/to CPU
22
Pseudo-random cache memories

Conflict misses are dominant in many applications
Pseudo-random placement
Bitwise XOR
Polynomial mapping
Critical-path impact
Line prediction based on address prediction

ICS 97, MICRO 97, IEEE TC 99
23
Branch Prediction

Dynamic History-Length Fitting (ISCA-98)
Early Branch Resolution (MICRO-98)
Through Value Prediction (PACT-99)
Compiler Support (CAECW-00, PACT-00, Europar-01)

24
Best history length?

Strong dependence on history length
go from 0.19 to 0.27
li from 0.04 to 0.13
Different behaviour
go best 3 history bits
li best 10 history bits
Best for one bad for the other

25
Early Branch Resolution

Branch window fed with Branch Flow from main
window
Data Inputs to Branch Flow are predicted
KEY predict K iterations ahead!!!
Special register renaming for Branch Flow
Branch flow is executed on shared functional
units
Result of branch is fed back to fetch engine as a
prediction

Anticipated Branch Outcome
Value Prediction
26
Branch Prediction Through Value Prediction
ld r1, 8(r0)
bne r1,target

Approach
When a branch is fetched
The inputs are predicted
The outcome is computed with predicted inputs

27
Performance

11 speedup for a 8KB predictor

28
The agbias Predictor (II)
Update only the selected predictor
Update BHR only for not strongly biased
branches
Agree Profile not strongly biased branches
Static selector based on the profiled branch
bias
Agree Profile strongly biased branches
Shared BHR
Selection
Europar-01
29
The agbias Predictor
Europar-01
30
Data Speculation

Value Prediction (ICS-98)
Memory Dependence Prediction (ICS-97, PACT-98)
Cost-effective Implementations (PACT-98, PACT-99)
Value Prediction for Speculative Multithreaded
(MICRO-99)

31
Data Value Speculation

Data value speculation
Address Prediction and data prefetching (APDP)
Addresses more predictable than values

Baseline processor with total disambiguation
32-entry inst. window, 8KB Dcache
T.Dis Total disambiguation
P.Dis Partial disambiguation

32
Fetch Mechanisms

Commercial Applications (ICPP-99)
Software Trace Cache (ICS-99)
Selective Trace Storage (HPCA-00)

33
The Fetch Engine
Fetch Address
Branch Target Buffer
Multiple Branch Predictor
Return Address Stack
Fill Unit
Core Fetch Unit
Hit
From Fetch or Commit
Next Fetch Address
34
Software Trace Cache

Optimize the code layout for optimized fetch
Use profile information
Build traces at compile time
Map traces for optimum I-cache performance

A
A
B
B
C
D
D
E
E
F
F
C
35
Selective Trace Storage

Compiler-built traces need not be stored in the
trace cache
Filter traces in the fill unit
Store only traces containing taken branches

Blue (redundant) trace, present in both caches
Fill Unit
Red trace components, fetched in two cycles
36
Fetch Performance
Large cost reductions 1/2 to 1/4th or ... ...
1/16 vs non-optimized Tcache
37
Effect on Branch Prediction

Changing branch direction
Changes negative interference to positive
Many history values not used
Values with many taken branches
Simple predictors
Suffer heavy negative interference
Positive effect dominates
De-aliased predictors
Already remove negative interference
Only the negative effect remains

Change Branch Direction
Taken
Not Taken
Table Interference
Negative
Positive
Used BHR Values
Used
Not Used
Ramirez, Larriba-Pey Valero. The Effect of
Code Reordering on Branch Prediction Accuracy.
PACT'2000.
38
IPC Results

Agree works better for baseline layout
Gshare works better for STC layout
But STC layout always works better than baseline
layout

16KB instruction cache
64KB instruction cache
Navarro, Ramirez, Larriba-Pey Valero, On the
performance of fetch engines running DSS
workloads, EuroPAR'2000
39
Data Reuse

Instruction-level reuse
Redundant computation buffer
Redundant stores
Trace-level reuse
Fuzzy Region Reuse

ICS99, ICPP 99, HPCN 99
40
Redundant Computation Buffer
value
PC
41
Speedup (200 KB)
1.25
1.20
1.15
1.10
1.05
1.00
42
Clustered Microarchitectures

Instruction distribution to local resources
Objective Maximize instruction throughput
Approach
Minimize instructions latencies
Minimize inter-cluster communication
Hide memory latency
Maximize workload balance
Reduce control/data hazards
Main contributions
Several dynamic steering mechanisms
Value prediction scheme to reduce wire delays

PACT 99, HPCA 00, MICRO 00
43
Dynamic Partitioning
44
Reducing Wire Delays through Value Prediction
Value Predictor
Send value before produced
Producer
Consumer
Send only if mispredicted
Validate Prediction
Micro-00
45
IPCR
46
Power-Aware Architectures

Low power out-of-order issue logic (WCED-00,
ISCA-01)
Gating off wake up activity
Dynamic resizing of the instruction window
Subscalar microarchitecture (MICRO-33)
Very low power pipelines using significance
compression
Fuzzy Region Reuse (ICS-01)
Skip large code blocks
Power saving and performance increase

47
Significance Compression
Conventional approach
Significance compression approach
Sign extn byte
48
Power Savings

Summary power savings

49
Low Power Pipelines

Byte-serial implementation

Tag compare
Tag compare
I-Cache tags
D-Cache tags
I-Cache 2/3
I-Cache 1
ALU
Data Cache
Register File
PC ADD
I-Cache 0
Writeback
exten.
exten
exten
G
exten.
50
Low Power Pipelines

Byte wide pipelines performance

79
23.6
6
2.5
51
Synthesis Vs Storage

Extremely different nature of media data types
New paradigms of computation
Maximum performance achieved with a trade-off
between
Synthesis (computation)
Storage (memorization)
Data error tolerance
Present in multimedia applications
Not found in other integer or scientific
applications

ICS-01
52
Fuzzy Instruction Reuse

Perform tolerant instruction/region reuse
Skip large code blocks
Power savings
Performance increase
Embedded processors for
Image compression / decompression
3D processing

D2'
D2
f( )
D2 f(D2) D1'
ICS-01
53
Fuzzy Synthesis

Approximate results using linear functions
Use previous instances as inputs
Small quality degradations

Open GL, Direct 3D APIs, DCT, texture bilinear
filtering
D2
D2
f( )
D2 f(D0,D1)
ICS-01
54
Long Experience on Vector Processors

Selected Papers
2 - ISCA
2 MICRO
3 HPCA
2 PACT
6 ICS
1 - SPAA
1 IEEE-TC
1 - Micro Journal
1 Supercomputing Journal
.....

Topics
Efficient Access to Vectors
Advanced Vector Architectures
In memory computation
Stores renaming

55
Efficient Access to Vectors

Out-of-order access to vectors
Single-processor (PPL-92, ISCA-92, IEEE-TC
95...)
Conflict-free access to power-of-two strides
(ICS-92)
Out-of-order access in vector multiprocessors
Conflict-free access (PPL-94, CONPAR-94,
ICS-94,ISCA-95)
Efficient access (IPPS-95,HiPC-95,ICS-96)
Command vector memory access (PACT-98)

56
Command Memory System
Command lt_at_,Length,Stride,sizegt Break commands
into bursts at the section controller
57
Command Memory System

Between 15 and 50 of improvement (compared to a
basic SDRAM system)
Same performance than ultra-fast SRAM with 2-4
times fewer banks and commodity parts
15-60 times cheaper than conventional vector
memory systems

58
Advanced Vector Architectures

Vector Code Characterization
Decoupling Vector Architectures ( HPCA-96,
JSuperc.99 )
Out-of-order Vector Architectures ( MICRO-96 )
Multithreaded Vector Architectures ( HPCA-97 )
Simultaneous Multithreaded Vector Architectures (
HICS-97, IEEE-MICRO J.-97 )
Vector register-file organization ( PACT-97 )

59
Why Vectors for Multimedia

SIMD architectures (longer vectors) allow to
leverage scalable performance without increasing
complexity
Alleviates pressure over fetchdecode unit
No need for larger window/issue queue sizes
Simpler register files (size and ports)
Simple cache ports delivering high bandwidth
No need for recompilation (strong point Vs MMX)
Low power in nature

60
Multimedia Vector Processors

Short Registers plus Vector Cache (ICS-99,
SPAA-01)
MOM Architecture (SC-99, MICRO-99)
SMT-MOM for MPEG-4 (HPCA-01)

61
Vector PC Architecture
FETCH
DECODE
DRDRAM
DRDRAM
DRDRAM
FP
INT
L/S
DRDRAM
I-Cache
RAMBUS Controller
Data Cache
VECTOR CACHE
3.2 GB/s
62
Vector MMX Matrix ISA
for(i1 to N) for(j1 to 4)
for(k1 to 4)
Aijk b Cijk
MICRO-32, Haifa
63
Matrix extensions for Multimedia
MOM
15
31
0
47

63
MMX
A1
A2
A4
A3
SS
A5
A6
A8
A7

A9
A10
A12
A11
15
31
0
47

63
A1
A2
A4
A3
A13
A14
A16
A15

B1
B2
B4
B3
B1
C1
C2
C4
C3
C5
C6
C8
C7
C1
C1
C2
C4
C3
C9
C10
C12
C11
C13
C14
C16
C15
64
Matrix ISA Relative Performance
INVERSE DCT TRANSFORM
MPEG-2 MOTION ESTIMATION
RGB-YCC Color CONVERSION
65
Program Level Performance
66
The reduction problem

MMX-like ISAs have problems handling reductions

67
Multimedia Accumulators

192-bit multimedia PACKED accumulators (MDMX,
Mips)

Advantages
doubles sub-word parallelism
high precision
Disadvantages
artificial recurrences

68
Recurrences and Efficiency Degradation
69
Accumulators for MOM
Solves the recurrence problem Powerful
instructions matrix x vector, matrix SAD 1
instr.!!
0
63
15
31
47
Vl
S a1j x b1j
S a2j x b2j
S a3j x b3j
S a4j x b4j
47
95
0
143

191
70
Benchmarks and Simulation Tools

Developed emulation libraries
MMX
67 instructions emulated
32 MMX registers (64 bits)
MDMX
88 instructions emulated
32 MDMX registers (64 bits), 4 MDMX accumulators
(192 bits)
MOM
121 instructions emulated
16 MOM registers (16x64bits), 2 MOM accumulators
(192 bits)
Added support to the JINKS simulator to
correctly detect emulation function calls and
translate to the emulated instruction (with the
help of the ATOM tool)

71
Scalability
LTPPAR (from GSM encode)
Baseline MMX w/ 36 reg
Relative Speed-Up
physical SIMD registers
72
Tolerance to Instruction Latencies
LTPPAR (from GSM encode)
Execution Cycles
Latency increase in cycles
73
High-end media processors

Exploit both DLP and TLP
SMT vector processor
Matrix oriented multimedia extensions (MOM)

74
Future Multimedia Workloads
MPEG7
MPEG4
75
SMT SIMD ISAs

Natural way of exploiting the two main sources of
media parallelism (TLP DLP) for next generation
of media protocols (MPEG4/MPEG7)
SMT paradigm helps vector execution
Minimizes Amdahls impact
Allows to hide vector execution under scalar
execution
SIMD ISAs helps SMT paradigm
Alleviates fetch pressure
Allows a better pipeline dispatch balance
Increases latency tolerance (L1 vector bypass)

76
SMT m-SIMD media processor
HPCA01, Monterrey
FP RF
FP queue
I cache
Decode
L1 cache
I
F
PC
V
INT Register File
INTEGER queue
8 program counters (one/ thread)
8 rename tables (one/thread)
MEMORY queue
L2 cache
Inst fetch
Inst decode
Thread ID
SIMD Register File
SIMD queue
Instruction Slots
Instruction Issue
RAMBUS
Reorder Buffer
Execution Pipeline
77
Cache Hierarchy
78
SMT SIMD ISAs bypassing L1
EIPC
79
Multithreaded Processors

Multithreaded Vector Processor (HPCA-97)
Simultaneous Multithreaded Vector Processor
(IEEE-MICRO 97)
Speculative Multithreaded Scalar Processors
(ICS-99)
Clustered Speculative Multithreaded Processors
(MICRO-99)
Multithreaded-Decoupled Architectures (HPCA-99)
Distant Parallelism (ICS-99, PACT-99)

80
Speculative Multithreaded Processors

Multiple instruction windows
Non-contiguous
Interleaved
Inter-thread data speculation
Dependences
Values

W1
W2
W3
Dynamic instruction stream
81
Performance Potential

Speedup over single-thread execution

16 thread units
82
Speculative Multithreaded Processors
83
Clustered Speculative Multithreaded Processors
84
Multithreaded Decoupled Access/Execute Processors

Decoupling
Very effective to hide memory latency
A natural approach for a distributed organization
Multithreading
Provides additional ILP

Fech, dispatch rename
Fech, dispatch rename
Fech, dispatch rename
Instr. queues
AP
EP
Store address queues
Memory subsystem
HPCA-99
85
Latency Tolerance

Decoupling hides most memory latency
IPC loss is much lower with decoupling
Multithreading hardly improves memory latency
tolerance

86
Distant Parallelism
Programs Structure
Distant
Trace
BB
Loop
Procedure
Compiler
Hardware
87
Examples Partial parallelisation

Go
big loop with dependencies
loop distribution
isolate recurrences
reduction of list operation

Coverage 50
88
Sequential vs. SMT
89
VLIW Architectures
High register requirements
Register-sensitive Software pipelining
Register-constrained Software pipelining
New VLIW organizations
Register File
Wide FU
Clustering
90
VLIW Architectures

Register File Use and Organization (HPCA-95)
Software Pipelining (MICRO-96, MICRO-97, IEEE-TC,
PACT-98, PLDI-00, ICPP-00, ISSS-00, MICRO-00)
Software Prefetching (MICRO-97)
Wide Architectures (ICS-97, ICS-98, MICRO-98)
Clustered Architectures (HPCA-99)
Two-Level Register File Organization (MICRO-32)

91
Static register requirements
92
Dynamic register requirements
93
Register-sensitive SP

Objectives
Throughput optimal schedules
Minimum register requirements
Fast scheduling time
Proposed techniques
HRMS Hypernode Reduction Modulo Scheduling
micro95
SMS Swing Modulo Scheduling pact96

94
HRMS/SMS

Static priority function to pre-order nodes
Hypernode reduction
Swing reduction critical path
Simple scheduling bidirectional greedy modulo
scheduling

micro95, pact96

95
HRMS and SMS register requirements
96
Software Prefetching

Cache sensitive modulo scheduling
Interaction between software prefetching and
software pipelining
Binding vs. non-binding prefetch
Based on
Data locality analysis
Dependence graph

3,6
8-way issue
MICRO 97
97
Register-constrained SP

Objectives
Schedule loops even if register requirements are
bigger than the available registers
Reduce performance degradation (?throughput and
?memory traffic)
Reduce compilation time
Proposed techniques
Spilling vs. increasing the II micro96
New heuristics to add spill code. Combining
spilling increasing the II pldi00
Iterative Modulo Scheduling with Integrated
Register Spilling (MIRS) submitted to ics01

98
Scheduling environment
99
Heuristics for spill code

Performance evaluation (P4M2L4, all the loops)

Relative Performance
Memory Traffic
Var Use
Var Use
CC (critical cycle), QF (quantity factor) and TF
(traffic factor) pldi00
100
MIRS modulo scheduling and spilling
II MII HRMS priority
Start
Budged BR nodes
II
Yes
No
restart?
Select Node
nodes?
Exit
Yes
Find Cycle
No
Force Eject
Insert spill
Check Insert
101
MIRS modulo scheduling and spilling
102
MIRS modulo scheduling and spilling
103
MIRS evaluation
All loops
104
MIRS evaluation
Speed-up
105
New register file organizations

Register file requirements
Large register files
High bandwidth register files
Technological problems
Area, power consumption and access time grow
with
Number of access ports, number of registers,
number of bits per register

106
Monolithic register file
Cycle Time (hs)
Area (l2x106)
Power (W)

Rixners (et al.) model HPCA00
Technology of 0.18 mm
VLIW configurations GPxMyREGz where x6, y2,
3 and z16, 32, 64, 128

107
Influence of the register file size

Memory ports
Memory latency

Execution Cycles
Memory Traffic
Execution Time
108
New register file organizations

Objective
Investigate the performance/area/power trade-off
Scheduling heuristics
Proposed Organizations
Sacks conpar94
Non-consistent register file hpca95
Hierarchical register file micro00

109
Sacks-based register file
conpar94
110
Sack-based RF performance
111
Non-consistent register file
hpca95
112
Non-consistent RF performance
113
Hierarchical register file
L1 Cache
L1 Cache
Load
Load
LoadR
R2
R1
R1
StoreR
Store
Store

micro00

114
Hierarchical RF design issues

R1 size 16 registers
Between R1-R24 load 2 store ports
R2 size 64 registers
Total storage capacity 80 registers

115
Hierarchical RF compilation issues

Two-step register allocation
Adding LoadR and StoreR instructionsRegister
allocation in R1Insert spill code between R1 and
R2
Register allocation in R2Insert spill code
between R2 and L1-cache

116
Hierarchical RF performance
GP6M2REGx vs GP6M2TWO16, ideal memory
Execution Cycles
Memory Traffic
Execution Time
117
Hierarchical RF performance
GP6M3REGx vs GP6M2TWO16, ideal memory
Execution Cycles
Memory Traffic
Execution Time
118
Hierarchical RF performance

L1-cache of 32 Kb with 32 bytes line size
Lockup-free and allows up to 8 pending memory
accesses
Latency load is 2 cycles and store is 1
Three miss latency values
low (10 hs)
medium (20 hs)
high (40 hs)
Binding prefetching

119
Hierarchical RF performance
Execution Cycles
Execution Time
120
Wide architectures

Objectives
Exploit Data Parallelism in loops
Optimize Performance/Cost
Papers
ICS97
MICRO98
ICS98
ICPP99

121
Wide architectures
ICS97, ICS98
122
Wide architectures design issues
Area
MICRO98
123
Wide architectures design issues
Relative cycle time
124
Wide architectures performance
MICRO98
125
Wide architectures performance
Register constraints
126
Clustered architectures

Each cluster has a local register file and a set
of functional units
Clusters interconnected by bidirectional ring of
queues
Variables that span across clusters are allocated
to communication queues
Techniques developed to allocate variables to
queue register files
New scheduling technique developed
Distributed Modulo Scheduling HPCA99

127
Ring Clustered architectures
HPCA-99
Add
Copy
Mul
128
Clustered architectures performance
129
Clustered VLIW Architecture
ICPP 00, ISSS 00, MICRO-33
130
Scheduling Algorithm

Order instructions
Priority to nodes in recurrences
Avoid predecessors and successors before a node
For each instruction
Try in all possible clusters (resource
constraints)
Choose cluster with best profit in outedges
If more than one, minimize register requirements
If new subgraph, then another cluster
Schedule node in chosen cluster
Selective unrolling

131
Results
2-cluster
4-cluster
132
Cache Miss Equations

Techniques that exploit intrinsic properties of
CME
Important speed-up
Between 30-40 for SPECfp95
Sampling reduces significantly the computational
requirements
Computing cost per program
Usually less than a minute
Never more than 5 minutes
Accurate
Error less than 0.2 for more than 50 loops
Never higher than 1

Interact 00, CPC 00, ISPASS 00, Europar 00
133
MultiVLIWProcessors
MICRO 33
134
RMCA Modulo Scheduler (I)
135
RMCA Modulo Scheduler (II)
136
Performance Results
4-cluster
NMB 1 LMB 1
NMB 1 LMB 4
NMB 2 LMB 1
NMB 2 LMB 4
137
Clustered architectures interconnect
L1 Cache
R1
R1
R1
138
Clustered architectures RF
L1 Cache
R2
R1
R1
R1
139
Current work

Integrated scheduling and register spilling for
clustered architectures
Not only performance. Area and power consumption
oriented proposals
Hierarchical register file organization for
clustered/wide VLIW architectures
Not only numerical applications. Multimedia
applications
Widening Clustering

140
Compilers

Instruction-level scheduling
Linear loop transformations
Automatic data distribution and exploitation of
data locality
Exploitation of multilevel parallelism in OpenMP

141
Operating Systems

Analysis and visualization tools
OS support to parallel applications
Parallel I/O
Efficient execution of parallel applications on
multiprogrammed environments

142
Algorithms

Multilevel Block Algorithms
Parallel Algorithms for Sparse Matrices
Systematic Mapping of Algorithms onto
Multicomputers

143
Projects (1 of 2)

Management and Technology Transfer
PCI-PACOS, PCI-II, CEPBA-TTN
Research and Development
COMANDOS-II, Supernode-II, EDS, Genesis, SHIPS,
IDENTIFY, DDT, PROMENVIR, PERMPAR, PARMAT, SLOEGAT

144
Projects (2 of 2)

Basic and Long-term Research
SEPIA, APPARC, NANOS, MHAOTEU
Mobility of Researchers
HCM, PECO, TMR
Training
COMETT, PARANDES, PARALIN

145
Large Software Development

DDT
Data distribution tool
Dimemas Paraver
Parallel Performance Prediction Analysis
Paros
Parallel Operating Systems
Ictineo
Fortran compiler
Dixie
Binary translation instrumentation

146
Large Software Development

NANOS
OpenMP environment (Compiler, Visualization)
Resource manager minimizing control switches and
migrations in large multiprocessors
PERMAS
Automatic Parallelization at run time of 1.5
Mlines legacy code
ST-ORM
Metacomputing environment for stochastic analysis
and optimization

147
Interdisciplinary Topics