Exploiting Vector Parallelism in Software Pipelined Loops - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting Vector Parallelism in Software Pipelined Loops

Description:

Slot 2. Slot 3. Slot 1. Cycle. II = 2. mod sched. for (i=0; i N; i ) { s = s S[i] ... Particularly in statically scheduled machines. Memory alignment ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 30
Provided by: samuel76
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Vector Parallelism in Software Pipelined Loops


1
Exploiting Vector Parallelismin Software
Pipelined Loops
Sam Larsen Rodric Rabbah Saman Amarasinghe Compu
ter Science and Artificial Intelligence
Laboratory Massachusetts Institute of Technology
2
Multimedia Extensions
  • Short vector extensions in ILP processors
  • AltiVec, 3DNow!, SSE, etc.
  • Accelerate loops in multimedia DSP codes
  • New designs have floating point support

3
Multimedia Extensions
  • Vector resources do not overwhelm the scalar
    resources
  • Scalar 2 FP ops / cycle
  • Vector 4 FP ops / cycle
  • Full vectorization may underutilize scalar
    resources
  • ILP techniques do not target vector resources
  • Need both

Courtesy of International Business Machines
Corporation. Unauthorized use not permitted.
4
Modulo Scheduling
for (i0 iltN i) s s Xi Yi
Cycle Slot 1 Slot 2 Slot 3
1 LOAD LOAD
2 MULT
3 LOAD LOAD ADD
4 MULT

Cycle Slot 1 Slot 2 Slot 3





5
Traditional Vectorization
for (i0 iltN i2) Sii1 Xii1
Yii1
for (i0 iltN i2) Sii1 Xii1
Yii1
for (i0 iltN i) s s Si
for (i0 iltN i) s s Si
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD
2 VLOAD
3 VMUL
4 VSTORE

Cycle Slot 1 Slot 2 Slot 3
1 LOAD
2 LOAD ADD
3
4

Cycle Slot 1 Slot 2 Slot 3





1
6
Vectorization without Distribution
for (i0 iltN i2) S Xii1
Yii1 s s S0 s s S1
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD
2 VLOAD
3 VMUL
4 VLOAD ADD
5 VLOAD ADD
6 VMUL
Cycle Slot 1 Slot 2 Slot 3





7
Selective Vectorization
for (i0 iltN i2) S Xii1
YiYi1 s s S0 s s S1
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD LOAD
2 LOAD
3 VLOAD LOAD
4 VMUL LOAD
5 VLOAD LOAD ADD
6 VMUL LOAD ADD
Cycle Slot 1 Slot 2 Slot 3






8
Complications
  • Complex scheduling requirements
  • Particularly in statically scheduled machines
  • Memory alignment
  • Example assumes no communication cost
  • In reality, explicit operations required
  • Often through memory
  • Reserve critical resources
  • Potential long latency
  • Performance improvement still possible

9
Tomcatv main loop (50)
10
Tomcatv (SpecFP 95)
Issue Width 6
Memory Units 2
ALUs 4
FPUs 2
Vector Units 1
Vector Length 2
1.7x Speedup over Modulo Scheduling
Technique ALU MEM FPU VEC
Modulo Scheduling 6 22 46 0
Full Vectorization 7 13 0 46
Selective Vectorization 7 27 19 27
11
Tomcatv (SpecFP 95)
12
Selective Vectorization
  • Balance computation among resources
  • Minimize II when loop is modulo scheduled
  • Carefully manage communication
  • Incorporate alignment information
  • Software pipelining hides latency
  • Adapt a 2-cluster partitioning heuristic
  • Fidduccia Matheyses 82
  • Kernighan Lin 70

13
Selective Vectorization
scalar
vector
cost
14
Cost Function
  • Projected II due to resources (ResMII)
  • Bin-packing approach Rau MICRO 94
  • With some modifications
  • Can ignore operation latency
  • Software pipelining hides latency
  • Vectorizable ops not on dependence cycles

for (i0 iltN i) Xi4 Xi
15
Evaluation
C or Fortran
  • SUIF front-end
  • Dependence analysis
  • Dataflow optimization
  • Trimaran back-end
  • Modulo scheduler
  • Register allocator
  • VLIW Simulator
  • Added vector ops

Simulation Binary
16
Evaluation
  • Operands communicated through memory
  • Software responsible for realignment

Issue Width 6
Memory Units 2
ALUs 4
FPUs 2
Vector Units 1
Vector Length 2
17
Evaluation
  • SpecFP 92, 95, 2000
  • Easier to extract dependence information
  • Detectable data parallelism
  • 64-bit data means vector length of 2
  • Considered amenable to vectorization SWP
  • Apply selective vectorization to DO loops
  • No control flow, no function calls
  • Fully simulate with training sets

18
Traditional Vectorization
19
Vectorization without Distribution
20
Vectorization Free Communication
21
Vectorization without Distribution
22
Selective Vectorization
23
Selective Vectorization
tomcatv
mgrid
su2cor
swim
24
Communication Support
  • Transfer through memory
  • Register to register copy
  • Uses fewer issue slots
  • Frees memory resources
  • Shared register file
  • Vector elements addressable in scalar ops
  • Requires no extra issue slots

25
Through Memory
tomcatv
mgrid
su2cor
swim
26
Reg to Reg Transfer Support
tomcatv
mgrid
su2cor
swim
27
Shared Register File
tomcatv
mgrid
su2cor
swim
28
Related Work
  • Traditional vectorization
  • Allen Kennedy, Wolfe
  • Software Pipelining
  • Raus iterative modulo scheduling
  • Clustered VLIW
  • Aleta MICRO34, Codina PACT01, Nystrom
    MICRO31, Sanchez MICRO33, Zalamea MICRO34
  • Partitioning among clusters similar
  • Ours is also an instruction selection problem
  • No dedicated communication resources

29
Conclusion
  • Targeting all FUs improves performance
  • Selective vectorization
  • Vectorization better in the backend
  • Cost analysis more accurate
  • Software pipeline vectorized loops
  • Good idea anyway
  • Facilitates selective vectorization
  • Hides communication and alignment latency
Write a Comment
User Comments (0)
About PowerShow.com