Computer Architecture - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Computer Architecture

Description:

Computer Architecture Lecture 7 Compiler Considerations and Optimizations – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 23
Provided by: oma100
Category:

less

Transcript and Presenter's Notes

Title: Computer Architecture


1
Computer Architecture
  • Lecture 7
  • Compiler Considerations and Optimizations

2
Structure of Recent Compilers
Front End
Transform Language to Common Intermediate Form
Note Only few companies make front for C.
Source code for C Front end is about 30 times
bigger than C. Most Front down convert C to C
before compilation.
High Level Optimization
High Level Loop Optimization Example Procedure
In-lining (Lang Dep., Machine Ind.)
Global and Local Optimization and register
allocation (Small Lang Dep., Small Machine dep.)
Global Optimization
Detailed Instruction Selection and machine
dependent optimization (No Lang Dep., Highly
Machine Dep.)
Code Generation
3
Compiler Prime Target
  • Program Correctness
  • Speed
  • Compilation Time?
  • Phases of compilers help write bug-free code

4
Optimizations
  • High-level
  • Local (Basic Block)
  • Global (across branches)
  • Register Allocation, Live Range Analysis
  • Processor Dependent

5
Optimization Names
  • Procedure Integration
  • Common Sub-expression Elimination/Dead Code
    Elimination
  • A b c dead code eliminated, no subsequent
    use of bc
  • A x y
  • Similarly if a procedure does not return a value
    and uses only local variables will be eliminated.
    (Test this in VC)
  • Constant Propagation A variable used as
    constant. (Constants arent, Variable Wont.
    Osborns Law)
  • Global Sub-expression Elimination
  • Copy Propagation (a b, a will be replaced by
    b)
  • Code Motion (A code that does not change with
    index in a loop will be moved out of the loop.)
  • Induction Variable Elimination (A A 5 in a
    loop that runs n times will be replaced with A
    A 5 n and moved out of loop, if A is not
    used,)
  • Strength Reduction (Multiply replaced with shift
    and add if possible, A25 b25 will be replaced
    with (AB) 25)
  • Pipeline Scheduling
  • Branch Optimization

6
Problems with Pointers
  • A 5
  • p xy
  • p 9 (only programmer knows A p)
  • Compiler cannot assign a register

7
Architecture Help
  • Provide Orthogonality
  • The Operations, The Data Types, The Addressing
    Modes, The Register Functions should be
    orthogonal
  • Simplify Trade-offs between alternatives (With
    caches and pipelining, trade-offs have become
    very complex) For Example Most difficult one in
    register-memory architecture How many times a
    variable is referenced before it is assigned a
    register.
  • Provide Instructions to Bind Variables with
    Constants
  • Most SIMD kernels are hand-coded as no compiler
    support

8
Hand-Coded VS Compiler GeneratedOn TMS320C6203
(VLIW CPU) (reported May 2000)
EEMBC Telecom Kernels Ratio of Execution Time (Compiler/Hand Written) Ratio of Code Size (Compiler/Hand Written)
Convolution Encoder 44.0 0.5
Fixed Point Complex FFT 13.5 1.0
Viterbi GSM Decoder 13.0 0.7
Fixed Point Bit Allocation 7.0 1.4
Auto-correlation 1.8 0.7
9
Basic Compiler Techniques
  • Basic Pipelining
  • Static Loop Unrolling
  • Example

Instruction Producing Result Instruction Using Result Latency in CC
FP ALU FP ALU 3
FP ALU Store 2
Load FP ALU 1
FP Load FP Store 0
10
Example (Contd)
  • Loop L.D F0, 0(R1)
  • ADD.D F4,F0,F2
  • S.D F4, 0(R1)
  • DADDUI R1,R1, -8
  • BNEQ R1,R2, Loop

11
Example (Without Scheduling)
Loop L.D F0, 0(R1) stall
LUD ADD.D F4,F0,F2 stall
stall S.D F4, 0(R1)
DADDUI R1,R1, -8 stall BNEQ
R1,R2, Loop stall Successor
flushed Total 10cc
12
Example (With Scheduling)
Loop L.D F0, 0(R1) DADDUI
R1,R1, -8 ADD.D F4,F0,F2
stall BNEQ R1,R2, Loop S.D
F4, 8(R1) delay slot Total 6cc (3 for data,
3 overhead)
13
Example (Static Loop Unrolling 4 times)
Loop L.D F0, 0(R1) L.D F6, -8(R1)
L.D F10, -16(R1) L.D F14, -24(R1)
ADD.D F4,F0,F2 ADD.D F8,F6,F2
ADD.D F12,F10,F2 ADD.D F16,F14,F2
S.D F4, 0(R1) S.D F8, -8(R1)
DADDUI R1,R1, -32 S.D F12, 16(R1)
BNEQ R1,R2, Loop S.D F16, 8(R1)
Delay slot Total 3.5cc per element
  • Compiler Considerations
  • Use of delay slot
  • Loop level independence
  • Register Assignment
  • Proper Loop Adjustment

14
Example (Static Dual Issue, 1 Int and 1 FP/CC)
Loop L.D F0, 0(R1) L.D F6, -8(R1)
L.D F10, -16(R1) ADD.D F4,F0,F2 L.D
F14, -32(R1) ADD.D F8,F6,F2 L.D F18,
-36(R1) ADD.D F12,F10,F2 S.D F4, 0(R1)
ADD.D F16,F14,F2 S.D F8, -8(R1)
ADD.D F20,F18,F2 S.D F12, -16(R1)
DADDUI R1,R1, -40 S.D F16, 16(R1)
BNEQ R1,R2, Loop S.D F20, 8(R1)
Delay slot Total 2.4cc per element
LUD
15
VLIW
  • Compiler formats issue packets
  • Compiler ensures that dependencies are not
    present
  • 64 to 200-bit long instructions

16
Example (VLIW, 1 Int, 2 FP, 2 LD/ST /CC 5-slots)
Mem 1 Slot Mem 2 Slot FP 1 Slot FP 2 Slot Int/ Branch
L.D F0, 0(R1) L.D F6, -8(R1)
L.D F10, -16(R1) L.D F14, -24(R1)
L.D F18, -36(R1) L.D F22, -40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2
L.D F26, -48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2
ADD.D F20,F18,F2 ADD.D F24,F12,F2
S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2
S.D F12, -16(R1) S.D F16, -24(R1) DADDUI R1,R1, -56
S.D F20, 24(R1) S.D F24, 16(R1)
S.D F28, 8(R1) BNEQ R1,R2, Loop
1.29cc per element, 23 slots used out of
potential 45
17
Loop Level Parallelism
  • Loop Carried Dependence
  • Data calculated in one loop iteration is required
    in the next loop.
  • A Parallel Loop
  • For (I 1000 I gt 0 I i-1)
  • xi xi s

18
Example
  • For (i 1 i lt 100 i i1)
  • Ai1 Ai Ci
  • Bi1 Bi Ai1
  • Dependences?

19
Example 2
  • Make the following loop parallel.
  • For (i 1 i lt 100 i i1)
  • Ai Ai Bi
  • Bi1 Ci Di

20
The GCD Test
  • Loop stores in a ? j b and later fetches from c
    ? k d.
  • Sufficient test is that if loop carried
    dependence exits then GCD(c,a) must integer
    divide (d-b) (no remainder).
  • For (i 1 i lt 100 i i1)
  • x(2i3 x2i 5
  • This test ignores loop bounds.

21
Example 2
  • Use renaming to find ILP
  • For (i 1 i lt 100 i i1)
  • Yi Xi /c1
  • Xi Xi c2
  • Zi Yi c3
  • Yi c4 - Yi /c

22
Other techniques
  • Addi R1, R2, 4
  • Addi R1, R2, 4
  • To
  • Addi R1, R2, 8 copy Propagation
  • And
  • Add R1, R2, R3
  • Add R2, R1, R5
  • Addi R7, R2, R8 (tree height reduction)
  • Sum sum xi
  • Sum (sum x1) ( x2 x3) (x4x5)
    recurrence optimization
Write a Comment
User Comments (0)
About PowerShow.com