Title: High Performance Computing Group
1High Performance Computing Group
- Prof. Mateo Valero
- http//www.people.ac.upc.es/mateo
- Computer Architecture Department
- Universitat Politècnica de Catalunya
2Universitat Politècnica de Catalunya
- Created in 1971
- 38 departments and 6 research institutes
- 9 schools , and 6 technical colleges
- 2.240 professors, 30.443 students, and 1.221
administrative staff - More than 250 research fields
3Computer Architecture Department
- Created in 1978 by Prof. Tomas Lang
- 48 Tenure professors (13 on march. and
instruction sched.) - 21 Full-time assistants (13 on march. and
instruction sched.) - 13 Part-time assistants
- 48 PhD. Fellowships (15 on march. and instruction
sched.) - 14 Staff Members
- 8 Administrative
- 6 System administration
4Computer Architecture Department
- Main Research Groups
- VLSI Systems Design
- Broadband Integrated Communication Systems
- Distributed Systems Architecture
- High Performance Computing
5High Performance Computing Group
- Superscalar processors
- Register File
- Cache Memory
- Branch Prediction
- Data Value Prediction
- Fetch Mechanisms
- Data reuse
- Clustered Microarchitectures
- Power-Aware Architectures
- Vector architectures
- Efficient Access to Vectors
- Advanced Vector Architectures
- Vector Microprocessors
- Multimedia Vector Processors
- VLIW processors
- Register File Use and Organization
- Software Pipelining
- Wide Architectures
- Clustered Architectures
- Compilers
- Operating Systems
- Algorithms Applications
- Computer Architecture
- Multithreaded processors
- Multithreaded Vector Processor
- Simultaneous Multithreaded Vector Processor
- Speculative Multithreaded Scalar Processors
- Clustered Speculative Multithreaded Processors
- Multithreaded-Decoupled Architectures
- Distant Parallelism
6RD Projects on Parallel Software
94
95
96
97
93
92
98
99
00
01
Supernode II
Dimemas Paraver
Tools
Parmat
Parallelization
Sloegat
Identify
Metacomputing
Promenvir
BMW
ST-ORM
Bonanova
Data Base
DDT
Apparc
System
7Selected Publications (last 6 years)
- Conferences
- 6 - ISCA
- 14 - MICRO
- 13 - HPCA
- 12 - PACT
- 33 - ICS
- 1 - PLDI
- Technical Journals
- IEEE - TC
- IEEE - Micro
- IEEE - Computer
- IEEE - TPDS
- Supercomputing Journal
8Some of the seminar guests
- Krste Asanovic (MIT)
- Venkata Krishnan (Compaq-DEC)
- Trevor Mudge (U. Michigan)
- Jim E. Smith (U. Wisconsin)
- Luiz A. Barroso (WRL)
- Josh Fisher (HP Labs)
- Michel Dubois (USC)
- Ronny Ronnen (Intel, Haifa)
- Josep Torrellas (UIUC)
- Per Stenstrom (U. Gothenburg)
- Wen-mei Hwu (UIUC)
- Jim Dehnert (Transmeta)
- Fred Pollack (Intel)
- Sanjay Patel (UIUC)
- Daniel Tabak (George Mason U.)
- Walid Najjar (Riverside)
- Paolo Faboroschi (HP Labs)
- Eduardo Sánchez (EPFL)
- Guri Sohi (U. Michigan)
- Miron Livny (U. Wisconsin)
- Tomas Sterling (NASA JPL)
- Maurice V. Wilkes (ATT Labs)
- Theo Ungerer (Karlsruhe)
- Mario Nemirovsky (Xstreamlogic)
- Gordon Bell (Microsoft)
- Timothy Pinkston (U.S.C.)
- Walid Najjar (Riverside)
- Roberto Moreno (ULPGC)
- Kazuki Joe (Nara Women U.)
- Alex Veidenbaum (Irvine)
- G.R. Gao (U. Delaware)
- Ricardo Baeza (U.de Chile,Santiago)
- Gabby M. Silberman (CAS-IBM)
- Sally A. McKee (U. Utah)
- Evelyn Duesterwald (HP-Labs)
- Yale Patt (Austin)
- Burton Smith (Tera)
- Doug Carmean (Intel, Oregon)
9Industrial Relationships
- Compaq
- Sabbaticals
- Roger Espasa (VSSAD)
- Toni Juan (VSSAD)
- Marta Jimenez (VSSAD)
- Interns
- Jesus Corbal (VSSAD)
- Alex Ramirez (WRL)
- Partnerships
- BSSAD
- HP
- Sabbaticals
- Josep Llosa (Cambridge)
- Interns
- Daniel Ortega
- Javier Zalamea
- Parnerships
- Software Prefetching
- Two-Level Register File
- IBM
- Interns
- Xavi Serrano (CAS)
- Daniel Jimenez (CAS)
- Parnerships
- Supercomputing (CIRI)
- Low Power
- Databases
- Binary Translation
- Intel
- Interns
- Adrian Cristal (Haifa)
- Alex Ramirez (MRL)
- Pedro Marcuello (MRL)
- Parnerships
- Semantic Gap
- Smart Registers
- Memory Architecture for Multithreaded Processors
- Speculative Vector Processors
10Superscalar Processors
- Register File
- Cache Memory
- Branch Prediction
- Data Value Prediction
- Fetch Mechanisms
- Data reuse
- Clustered Microarchitectures
- Power-Aware Architectures
11Register File
- Virtual-Physical Registers (HiPC97, HPCA-98,
MICRO-99) - Register File cache (ISCA-00)
12Virtual-Physical Registers
- Motivation
- Conventional renaming scheme
- Virtual-Physical Registers
Icache
DecodeRename
Commit
Register used
Register unused
Register used
13Performance and Number of Registers
SpecFp95
SpecIn95
14Register Requirements
15Register File Cache
- Organization
- Bank 1 (Register File)
- All registers (128)
- 2-cycle latency
- Bank 2 (Reg. File Cache)
- Register subset (16)
- 1-cycle latency
16Speed for Different RFC Architectures
SpecInt95
17Compiler Directed Renaming Motivation
- Binding Prefetch is very costly in terms of
logical registers - Advancing one load implies using a logical
register - ... limited logical register but unlimited (?!)
physical - Non-binding prefetch needs another instruction to
finally load the data from L1 - Binding one piece of data in a line implies that
the other pieces are not binded - Data is very near but not in register file
- Compiler knows what pieces of data of the line
are going to be needed - How can the compiler tell the hardware all this ?
18Compiler Directed Renaming
19Results
Speed ()
ICS-2001
20Cache Memory
- Locality sensitive cache memories
- Hardware managed ( ICS-95 )
- Software managed ( PACT-97)
- Pseudo-random cache memories
- Evaluation ( ICS-97 )
- Implementation issues ( MICRO-97, IEEETC-99)
- Cache Design and Technology Interaction
- Difference-bit cache (ISCA-96 )
- Data caches for superscalar processors ( ICS-97)
- Reducing TLB power requirements (ISLPED-97)
- Software Data Prefetching (ICS-96)
- Locality Analysis (CPC-00, ISPASS-00, Europar-00)
21Locality sensitive cache memories
- Multi-module cache
- Selective cache
- Hardware / Software management
ICS 95, PACT 97, ICS 99
memory request
Miss
request to L2 cache
hitS
hitT
only temporal
Spatial Cache
Temporal Cache
Type of locality
data from L2 cache
hitT
hitS
bypass
temporal/spatial
data from/to CPU
22Pseudo-random cache memories
- Conflict misses are dominant in many applications
- Pseudo-random placement
- Bitwise XOR
- Polynomial mapping
- Critical-path impact
- Line prediction based on address prediction
ICS 97, MICRO 97, IEEE TC 99
23Branch Prediction
- Dynamic History-Length Fitting (ISCA-98)
- Early Branch Resolution (MICRO-98)
- Through Value Prediction (PACT-99)
- Compiler Support (CAECW-00, PACT-00, Europar-01)
24Best history length?
- Strong dependence on history length
- go from 0.19 to 0.27
- li from 0.04 to 0.13
- Different behaviour
- go best 3 history bits
- li best 10 history bits
- Best for one bad for the other
25Early Branch Resolution
- Branch window fed with Branch Flow from main
window - Data Inputs to Branch Flow are predicted
- KEY predict K iterations ahead!!!
- Special register renaming for Branch Flow
- Branch flow is executed on shared functional
units - Result of branch is fed back to fetch engine as a
prediction
Anticipated Branch Outcome
Value Prediction
26Branch Prediction Through Value Prediction
ld r1, 8(r0)
bne r1,target
- Approach
- When a branch is fetched
- The inputs are predicted
- The outcome is computed with predicted inputs
27Performance
- 11 speedup for a 8KB predictor
28The agbias Predictor (II)
Update only the selected predictor
Update BHR only for not strongly biased
branches
Agree Profile not strongly biased branches
Static selector based on the profiled branch
bias
Agree Profile strongly biased branches
Shared BHR
Selection
Europar-01
29The agbias Predictor
Europar-01
30Data Speculation
- Value Prediction (ICS-98)
- Memory Dependence Prediction (ICS-97, PACT-98)
- Cost-effective Implementations (PACT-98, PACT-99)
- Value Prediction for Speculative Multithreaded
(MICRO-99)
31Data Value Speculation
- Data value speculation
- Address Prediction and data prefetching (APDP)
- Addresses more predictable than values
- Baseline processor with total disambiguation
- 32-entry inst. window, 8KB Dcache
- T.Dis Total disambiguation
- P.Dis Partial disambiguation
32Fetch Mechanisms
- Commercial Applications (ICPP-99)
- Software Trace Cache (ICS-99)
- Selective Trace Storage (HPCA-00)
33The Fetch Engine
Fetch Address
Branch Target Buffer
Multiple Branch Predictor
Return Address Stack
Fill Unit
Core Fetch Unit
Hit
From Fetch or Commit
Next Fetch Address
34Software Trace Cache
- Optimize the code layout for optimized fetch
- Use profile information
- Build traces at compile time
- Map traces for optimum I-cache performance
A
A
B
B
C
D
D
E
E
F
F
C
35Selective Trace Storage
- Compiler-built traces need not be stored in the
trace cache - Filter traces in the fill unit
- Store only traces containing taken branches
Blue (redundant) trace, present in both caches
Fill Unit
Red trace components, fetched in two cycles
36Fetch Performance
Large cost reductions 1/2 to 1/4th or ... ...
1/16 vs non-optimized Tcache
37Effect on Branch Prediction
- Changing branch direction
- Changes negative interference to positive
- Many history values not used
- Values with many taken branches
- Simple predictors
- Suffer heavy negative interference
- Positive effect dominates
- De-aliased predictors
- Already remove negative interference
- Only the negative effect remains
Change Branch Direction
Taken
Not Taken
Table Interference
Negative
Positive
Used BHR Values
Used
Not Used
Ramirez, Larriba-Pey Valero. The Effect of
Code Reordering on Branch Prediction Accuracy.
PACT'2000.
38IPC Results
- Agree works better for baseline layout
- Gshare works better for STC layout
- But STC layout always works better than baseline
layout
16KB instruction cache
64KB instruction cache
Navarro, Ramirez, Larriba-Pey Valero, On the
performance of fetch engines running DSS
workloads, EuroPAR'2000
39Data Reuse
- Instruction-level reuse
- Redundant computation buffer
- Redundant stores
- Trace-level reuse
- Fuzzy Region Reuse
ICS99, ICPP 99, HPCN 99
40Redundant Computation Buffer
value
PC
41Speedup (200 KB)
1.25
1.20
1.15
1.10
1.05
1.00
42Clustered Microarchitectures
- Instruction distribution to local resources
- Objective Maximize instruction throughput
- Approach
- Minimize instructions latencies
- Minimize inter-cluster communication
- Hide memory latency
- Maximize workload balance
- Reduce control/data hazards
- Main contributions
- Several dynamic steering mechanisms
- Value prediction scheme to reduce wire delays
PACT 99, HPCA 00, MICRO 00
43Dynamic Partitioning
44Reducing Wire Delays through Value Prediction
Value Predictor
Send value before produced
Producer
Consumer
Send only if mispredicted
Validate Prediction
Micro-00
45IPCR
46Power-Aware Architectures
- Low power out-of-order issue logic (WCED-00,
ISCA-01) - Gating off wake up activity
- Dynamic resizing of the instruction window
- Subscalar microarchitecture (MICRO-33)
- Very low power pipelines using significance
compression - Fuzzy Region Reuse (ICS-01)
- Skip large code blocks
- Power saving and performance increase
47Significance Compression
Conventional approach
Significance compression approach
Sign extn byte
48Power Savings
49Low Power Pipelines
- Byte-serial implementation
Tag compare
Tag compare
I-Cache tags
D-Cache tags
I-Cache 2/3
I-Cache 1
ALU
Data Cache
Register File
PC ADD
I-Cache 0
Writeback
exten.
exten
exten
G
exten.
50Low Power Pipelines
- Byte wide pipelines performance
79
23.6
6
2.5
51Synthesis Vs Storage
- Extremely different nature of media data types
- New paradigms of computation
- Maximum performance achieved with a trade-off
between - Synthesis (computation)
- Storage (memorization)
- Data error tolerance
- Present in multimedia applications
- Not found in other integer or scientific
applications
ICS-01
52Fuzzy Instruction Reuse
- Perform tolerant instruction/region reuse
- Skip large code blocks
- Power savings
- Performance increase
- Embedded processors for
- Image compression / decompression
- 3D processing
D2'
D2
f( )
D2 f(D2) D1'
ICS-01
53Fuzzy Synthesis
- Approximate results using linear functions
- Use previous instances as inputs
- Small quality degradations
Open GL, Direct 3D APIs, DCT, texture bilinear
filtering
D2
D2
f( )
D2 f(D0,D1)
ICS-01
54Long Experience on Vector Processors
- Selected Papers
- 2 - ISCA
- 2 MICRO
- 3 HPCA
- 2 PACT
- 6 ICS
- 1 - SPAA
- 1 IEEE-TC
- 1 - Micro Journal
- 1 Supercomputing Journal
- .....
- Topics
- Efficient Access to Vectors
- Advanced Vector Architectures
- In memory computation
- Stores renaming
55Efficient Access to Vectors
- Out-of-order access to vectors
- Single-processor (PPL-92, ISCA-92, IEEE-TC
95...) - Conflict-free access to power-of-two strides
(ICS-92) - Out-of-order access in vector multiprocessors
- Conflict-free access (PPL-94, CONPAR-94,
ICS-94,ISCA-95) - Efficient access (IPPS-95,HiPC-95,ICS-96)
- Command vector memory access (PACT-98)
56Command Memory System
Command lt_at_,Length,Stride,sizegt Break commands
into bursts at the section controller
57Command Memory System
- Between 15 and 50 of improvement (compared to a
basic SDRAM system) - Same performance than ultra-fast SRAM with 2-4
times fewer banks and commodity parts - 15-60 times cheaper than conventional vector
memory systems
58Advanced Vector Architectures
- Vector Code Characterization
- Decoupling Vector Architectures ( HPCA-96,
JSuperc.99 ) - Out-of-order Vector Architectures ( MICRO-96 )
- Multithreaded Vector Architectures ( HPCA-97 )
- Simultaneous Multithreaded Vector Architectures (
HICS-97, IEEE-MICRO J.-97 ) - Vector register-file organization ( PACT-97 )
59Why Vectors for Multimedia
- SIMD architectures (longer vectors) allow to
leverage scalable performance without increasing
complexity - Alleviates pressure over fetchdecode unit
- No need for larger window/issue queue sizes
- Simpler register files (size and ports)
- Simple cache ports delivering high bandwidth
- No need for recompilation (strong point Vs MMX)
- Low power in nature
60Multimedia Vector Processors
- Short Registers plus Vector Cache (ICS-99,
SPAA-01) - MOM Architecture (SC-99, MICRO-99)
- SMT-MOM for MPEG-4 (HPCA-01)
61Vector PC Architecture
FETCH
DECODE
DRDRAM
DRDRAM
DRDRAM
FP
INT
L/S
DRDRAM
I-Cache
RAMBUS Controller
Data Cache
VECTOR CACHE
3.2 GB/s
62Vector MMX Matrix ISA
for(i1 to N) for(j1 to 4)
for(k1 to 4)
Aijk b Cijk
MICRO-32, Haifa
63Matrix extensions for Multimedia
MOM
15
31
0
47
63
MMX
A1
A2
A4
A3
SS
A5
A6
A8
A7
A9
A10
A12
A11
15
31
0
47
63
A1
A2
A4
A3
A13
A14
A16
A15
B1
B2
B4
B3
B1
C1
C2
C4
C3
C5
C6
C8
C7
C1
C1
C2
C4
C3
C9
C10
C12
C11
C13
C14
C16
C15
64Matrix ISA Relative Performance
INVERSE DCT TRANSFORM
MPEG-2 MOTION ESTIMATION
RGB-YCC Color CONVERSION
65Program Level Performance
66The reduction problem
- MMX-like ISAs have problems handling reductions
67Multimedia Accumulators
- 192-bit multimedia PACKED accumulators (MDMX,
Mips)
- Advantages
- doubles sub-word parallelism
- high precision
- Disadvantages
- artificial recurrences
68Recurrences and Efficiency Degradation
69Accumulators for MOM
Solves the recurrence problem Powerful
instructions matrix x vector, matrix SAD 1
instr.!!
0
63
15
31
47
Vl
S a1j x b1j
S a2j x b2j
S a3j x b3j
S a4j x b4j
47
95
0
143
191
70Benchmarks and Simulation Tools
- Developed emulation libraries
- MMX
- 67 instructions emulated
- 32 MMX registers (64 bits)
- MDMX
- 88 instructions emulated
- 32 MDMX registers (64 bits), 4 MDMX accumulators
(192 bits) - MOM
- 121 instructions emulated
- 16 MOM registers (16x64bits), 2 MOM accumulators
(192 bits) - Added support to the JINKS simulator to
correctly detect emulation function calls and
translate to the emulated instruction (with the
help of the ATOM tool)
71Scalability
LTPPAR (from GSM encode)
Baseline MMX w/ 36 reg
Relative Speed-Up
physical SIMD registers
72Tolerance to Instruction Latencies
LTPPAR (from GSM encode)
Execution Cycles
Latency increase in cycles
73High-end media processors
- Exploit both DLP and TLP
- SMT vector processor
- Matrix oriented multimedia extensions (MOM)
74Future Multimedia Workloads
MPEG7
MPEG4
75SMT SIMD ISAs
- Natural way of exploiting the two main sources of
media parallelism (TLP DLP) for next generation
of media protocols (MPEG4/MPEG7) - SMT paradigm helps vector execution
- Minimizes Amdahls impact
- Allows to hide vector execution under scalar
execution - SIMD ISAs helps SMT paradigm
- Alleviates fetch pressure
- Allows a better pipeline dispatch balance
- Increases latency tolerance (L1 vector bypass)
76SMT m-SIMD media processor
HPCA01, Monterrey
FP RF
FP queue
I cache
Decode
L1 cache
I
F
PC
V
INT Register File
INTEGER queue
8 program counters (one/ thread)
8 rename tables (one/thread)
MEMORY queue
L2 cache
Inst fetch
Inst decode
Thread ID
SIMD Register File
SIMD queue
Instruction Slots
Instruction Issue
RAMBUS
Reorder Buffer
Execution Pipeline
77Cache Hierarchy
78SMT SIMD ISAs bypassing L1
EIPC
79Multithreaded Processors
- Multithreaded Vector Processor (HPCA-97)
- Simultaneous Multithreaded Vector Processor
(IEEE-MICRO 97) - Speculative Multithreaded Scalar Processors
(ICS-99) - Clustered Speculative Multithreaded Processors
(MICRO-99) - Multithreaded-Decoupled Architectures (HPCA-99)
- Distant Parallelism (ICS-99, PACT-99)
80Speculative Multithreaded Processors
- Multiple instruction windows
- Non-contiguous
- Interleaved
- Inter-thread data speculation
- Dependences
- Values
W1
W2
W3
Dynamic instruction stream
81Performance Potential
- Speedup over single-thread execution
16 thread units
82Speculative Multithreaded Processors
83Clustered Speculative Multithreaded Processors
84Multithreaded Decoupled Access/Execute Processors
- Decoupling
- Very effective to hide memory latency
- A natural approach for a distributed organization
- Multithreading
- Provides additional ILP
Fech, dispatch rename
Fech, dispatch rename
Fech, dispatch rename
Instr. queues
AP
EP
Store address queues
Memory subsystem
HPCA-99
85Latency Tolerance
- Decoupling hides most memory latency
- IPC loss is much lower with decoupling
- Multithreading hardly improves memory latency
tolerance
86Distant Parallelism
Programs Structure
Distant
Trace
BB
Loop
Procedure
Compiler
Hardware
87Examples Partial parallelisation
- Go
- big loop with dependencies
- loop distribution
- isolate recurrences
- reduction of list operation
Coverage 50
88Sequential vs. SMT
89VLIW Architectures
High register requirements
Register-sensitive Software pipelining
Register-constrained Software pipelining
New VLIW organizations
Register File
Wide FU
Clustering
90VLIW Architectures
- Register File Use and Organization (HPCA-95)
- Software Pipelining (MICRO-96, MICRO-97, IEEE-TC,
PACT-98, PLDI-00, ICPP-00, ISSS-00, MICRO-00) - Software Prefetching (MICRO-97)
- Wide Architectures (ICS-97, ICS-98, MICRO-98)
- Clustered Architectures (HPCA-99)
- Two-Level Register File Organization (MICRO-32)
91Static register requirements
92Dynamic register requirements
93Register-sensitive SP
- Objectives
- Throughput optimal schedules
- Minimum register requirements
- Fast scheduling time
- Proposed techniques
- HRMS Hypernode Reduction Modulo Scheduling
micro95 - SMS Swing Modulo Scheduling pact96
94HRMS/SMS
- Static priority function to pre-order nodes
- Hypernode reduction
- Swing reduction critical path
- Simple scheduling bidirectional greedy modulo
scheduling
95 HRMS and SMS register requirements
96Software Prefetching
- Cache sensitive modulo scheduling
- Interaction between software prefetching and
software pipelining - Binding vs. non-binding prefetch
- Based on
- Data locality analysis
- Dependence graph
3,6
8-way issue
MICRO 97
97Register-constrained SP
- Objectives
- Schedule loops even if register requirements are
bigger than the available registers - Reduce performance degradation (?throughput and
?memory traffic) - Reduce compilation time
- Proposed techniques
- Spilling vs. increasing the II micro96
- New heuristics to add spill code. Combining
spilling increasing the II pldi00 - Iterative Modulo Scheduling with Integrated
Register Spilling (MIRS) submitted to ics01
98Scheduling environment
99Heuristics for spill code
- Performance evaluation (P4M2L4, all the loops)
Relative Performance
Memory Traffic
Var Use
Var Use
CC (critical cycle), QF (quantity factor) and TF
(traffic factor) pldi00
100MIRS modulo scheduling and spilling
II MII HRMS priority
Start
Budged BR nodes
II
Yes
No
restart?
Select Node
nodes?
Exit
Yes
Find Cycle
No
Force Eject
Insert spill
Check Insert
101MIRS modulo scheduling and spilling
102MIRS modulo scheduling and spilling
103MIRS evaluation
All loops
104MIRS evaluation
Speed-up
105New register file organizations
- Register file requirements
- Large register files
- High bandwidth register files
- Technological problems
- Area, power consumption and access time grow
with - Number of access ports, number of registers,
number of bits per register
106Monolithic register file
Cycle Time (hs)
Area (l2x106)
Power (W)
- Rixners (et al.) model HPCA00
- Technology of 0.18 mm
- VLIW configurations GPxMyREGz where x6, y2,
3 and z16, 32, 64, 128
107Influence of the register file size
- Memory ports
- Memory latency
Execution Cycles
Memory Traffic
Execution Time
108New register file organizations
- Objective
- Investigate the performance/area/power trade-off
- Scheduling heuristics
- Proposed Organizations
- Sacks conpar94
- Non-consistent register file hpca95
- Hierarchical register file micro00
109Sacks-based register file
conpar94
110Sack-based RF performance
111Non-consistent register file
hpca95
112Non-consistent RF performance
113Hierarchical register file
L1 Cache
L1 Cache
Load
Load
LoadR
R2
R1
R1
StoreR
Store
Store
114Hierarchical RF design issues
- R1 size 16 registers
- Between R1-R24 load 2 store ports
- R2 size 64 registers
- Total storage capacity 80 registers
115Hierarchical RF compilation issues
- Two-step register allocation
- Adding LoadR and StoreR instructionsRegister
allocation in R1Insert spill code between R1 and
R2 - Register allocation in R2Insert spill code
between R2 and L1-cache
116Hierarchical RF performance
GP6M2REGx vs GP6M2TWO16, ideal memory
Execution Cycles
Memory Traffic
Execution Time
117Hierarchical RF performance
GP6M3REGx vs GP6M2TWO16, ideal memory
Execution Cycles
Memory Traffic
Execution Time
118Hierarchical RF performance
- L1-cache of 32 Kb with 32 bytes line size
- Lockup-free and allows up to 8 pending memory
accesses - Latency load is 2 cycles and store is 1
- Three miss latency values
- low (10 hs)
- medium (20 hs)
- high (40 hs)
- Binding prefetching
119Hierarchical RF performance
Execution Cycles
Execution Time
120Wide architectures
- Objectives
- Exploit Data Parallelism in loops
- Optimize Performance/Cost
- Papers
- ICS97
- MICRO98
- ICS98
- ICPP99
121Wide architectures
ICS97, ICS98
122Wide architectures design issues
Area
MICRO98
123Wide architectures design issues
Relative cycle time
124Wide architectures performance
MICRO98
125Wide architectures performance
Register constraints
126Clustered architectures
- Each cluster has a local register file and a set
of functional units - Clusters interconnected by bidirectional ring of
queues - Variables that span across clusters are allocated
to communication queues - Techniques developed to allocate variables to
queue register files - New scheduling technique developed
- Distributed Modulo Scheduling HPCA99
127Ring Clustered architectures
HPCA-99
Add
Copy
Mul
128Clustered architectures performance
129Clustered VLIW Architecture
ICPP 00, ISSS 00, MICRO-33
130Scheduling Algorithm
- Order instructions
- Priority to nodes in recurrences
- Avoid predecessors and successors before a node
- For each instruction
- Try in all possible clusters (resource
constraints) - Choose cluster with best profit in outedges
- If more than one, minimize register requirements
- If new subgraph, then another cluster
- Schedule node in chosen cluster
- Selective unrolling
131Results
2-cluster
4-cluster
132Cache Miss Equations
- Techniques that exploit intrinsic properties of
CME - Important speed-up
- Between 30-40 for SPECfp95
- Sampling reduces significantly the computational
requirements - Computing cost per program
- Usually less than a minute
- Never more than 5 minutes
- Accurate
- Error less than 0.2 for more than 50 loops
- Never higher than 1
Interact 00, CPC 00, ISPASS 00, Europar 00
133MultiVLIWProcessors
MICRO 33
134RMCA Modulo Scheduler (I)
135RMCA Modulo Scheduler (II)
136Performance Results
4-cluster
NMB 1 LMB 1
NMB 1 LMB 4
NMB 2 LMB 1
NMB 2 LMB 4
137Clustered architectures interconnect
L1 Cache
R1
R1
R1
138Clustered architectures RF
L1 Cache
R2
R1
R1
R1
139Current work
- Integrated scheduling and register spilling for
clustered architectures - Not only performance. Area and power consumption
oriented proposals - Hierarchical register file organization for
clustered/wide VLIW architectures - Not only numerical applications. Multimedia
applications - Widening Clustering
140Compilers
- Instruction-level scheduling
- Linear loop transformations
- Automatic data distribution and exploitation of
data locality - Exploitation of multilevel parallelism in OpenMP
141Operating Systems
- Analysis and visualization tools
- OS support to parallel applications
- Parallel I/O
- Efficient execution of parallel applications on
multiprogrammed environments
142Algorithms
- Multilevel Block Algorithms
- Parallel Algorithms for Sparse Matrices
- Systematic Mapping of Algorithms onto
Multicomputers
143Projects (1 of 2)
- Management and Technology Transfer
- PCI-PACOS, PCI-II, CEPBA-TTN
- Research and Development
- COMANDOS-II, Supernode-II, EDS, Genesis, SHIPS,
IDENTIFY, DDT, PROMENVIR, PERMPAR, PARMAT, SLOEGAT
144Projects (2 of 2)
- Basic and Long-term Research
- SEPIA, APPARC, NANOS, MHAOTEU
- Mobility of Researchers
- HCM, PECO, TMR
- Training
- COMETT, PARANDES, PARALIN
145Large Software Development
- DDT
- Data distribution tool
- Dimemas Paraver
- Parallel Performance Prediction Analysis
- Paros
- Parallel Operating Systems
- Ictineo
- Fortran compiler
- Dixie
- Binary translation instrumentation
146Large Software Development
- NANOS
- OpenMP environment (Compiler, Visualization)
- Resource manager minimizing control switches and
migrations in large multiprocessors - PERMAS
- Automatic Parallelization at run time of 1.5
Mlines legacy code - ST-ORM
- Metacomputing environment for stochastic analysis
and optimization
147Interdisciplinary Topics
- Big group
- Expertise in many areas
- Critical mass favours production
- Resource sharing
- More external visibility
- Cross fertilization of topics
- Compilers vs. Architecture
- Numerical Applications vs Memory Hierarchy
- etc. etc.
148HPC group seminar
- A forum for learning and discussing
- Starts in fall 1998
- 56 conferences (January 2000)
- 23 by HPC group members
- 33 by guests
- See our web site
- http//www.ac.upc.es/HPCseminar
149Summary
- Large group
- Experience in many topics
- Very good students
- A proven track record with past projects
150Thank you