Platform-based Design - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

Platform-based Design

Description:

MIPS (first one, licensing instruction set architecture) ... small code size, most popular one, licensing also. the micro-architecture as hard or soft IP) ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 66

Provided by: abc774

Category:

more less

Transcript and Presenter's Notes

Title: Platform-based Design

1
Platform-based Design

5kk70
2007

2
DSP
Programmable CPU
Programmable DSP
Application specific instruction set processor
(ASIP)
Application specific processor
3
efficiency
ASIC
high medium low
ASIP
DSP
GP proc FPGA
low medium high
flexibility
4
Programmable CPU cores

introduction
architecture of the MIPS core
discussed as an example
pipelining
application examples
software issues
comparison between different CPU cores
towards application specific architectures
discussion

5
Introduction

rationale General-purpose -gt large market
consequence often handcrafted design optimised
for clock rate
problem fast changes in the IC process
technology
examples embedded
MIPS (first one, licensing instruction set
architecture)
ARM (Advanced Risc Machines, telecom, low power,
small code size, most popular one, licensing also
the micro-architecture as hard or soft IP)
derivatives from general purpose CPUs
Intel, NEC, Hitachi, National, PowerPC

6
Introduction
Instruction set architectures
implicit operands
explicit operands
7
Architecture of the MIPS core
Hennessy Patterson
8
MIPS instruction formats ( 32 bits )
Hennessy Patterson
op operation of the instruction rs,rt,rd source
and destination registers shamt shift
amount funct operation of the instruction-part
2 imm for program constants addr target address
of a jump
9
Example 1 R - type add instruction
Hennessy Patterson
10
Critical path R-type operation
Clk
PC
Hennessy Patterson
Instruction address
Instruction Memory
Instruction
Rd
Rt
Rs
Imm
5
5
5
16
32
Rw Ra Rb 32 32-bit registers
Data Memory
Data address
32
32
Data out
Data in
Clk
32
Clk
11
Example 2 I-type load word
Hennessy Patterson

lw rs, rt, imm16
memPC
addr Rrs extimm16
Rrt memaddr
PC PC 4

12
Example 3 I-type branch
Hennessy Patterson

beq rs, rt, imm16
memPC
cond Rrs - Rrt
if cond 0
PC PC 4 ext(imm16)4
else
PC PC 4

13
Example 3 I-type branch
Hennessy Patterson
Rd
Rt
RedDst
Branch
dc (Rt)
Rs
Clk
ALUctr
PC
5
5
5
Reg Wr
Next Address Logic
BusA 32
Imm 16 16
Rw Ra Rb 32 32-bit registers
Bus W
32
BusB 32
Zero
Clk
To Instruction Memory
Imm 16 16
32
Extender
ALUSrc
ExtOp
14
Example 3 I-type branch
HennessyPatterson
30
30
Addrlt312gt Addrlt10gt Instruction Memory
30
PC
00
0
30
Clk
1
30
32
1
Imm 16 16
Instruction lt310gt
30
SignExt
Branch Zero
Instruction lt150gt
15
Architecture of the MIPS core

problem long critical path
defined by the slowest instruction (load)
solution ?
pipelining
break the instruction into smaller steps
all steps have about the same critical path

16
Pipelining lw instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
lw
Ifetch
RF read
ALU
dmem
RF write
Ifetch
RF read
ALU
dmem
RF write
lw

One instructions enters the pipeline every clock
cycle
One instructions leaves the pipeline every clock
cycle
gt CPI 1 (Cycles per Instruction)

17
Pipelining lw instructions
I
R
A
M
W
Instructions
Data
Current CPU cycle
18
4 stages of R-type instruction
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
Ifetch
RF read
ALU
RF write
E.g. ADD
19
Pipelining lw and R-type instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
add
Ifetch
RF read
ALU
RF write
20
Solution stretch R-type to 5 stages
Ifetch
RF read
ALU
dmem
RF write
Dummy op (noop)
HennessyPatterson
21
Ifetch
Reg/dec
exec
mem
wr
RegWr
branch
Next PC
Rfile
4
flags
Rs
BusA
Ra

Rt
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
MemtoReg
MemWr
HennessyPatterson
RegDst
ALUSrc
ExtOp
ALUop
22
Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
23
Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
Solution bypasses
24
Bypasses
HennessyPatterson
adr
Data mem
25
Data dependencies load instruction
HennessyPatterson
R1 lw...
R1 ...
R1 ...
R1 ...
26
Data dependencies load instruction
HennessyPatterson
R1 lw...
Bypass is no solution for instruction
R1 ...
R1 - ...
R1 - ...
27
Data dependencies load instruction
HennessyPatterson
R1 lw...
DM
IM
RF
RF
R1 ...
DM
IM
RF
RF
R1 - ...
R1 - ...
IM
RF
DM
RF
Solution pipeline interlock detects a data
hazard and stalls the pipeline until the hazard
is cleared
28
Application examples (1)
29
Application examples (1)
19 instructions per tap!!
30
Application examples (2)
Bit level operations finite field arithmetic
10 instructions!! Very simple in hardware
31
Application examples (2)
Bit level operations DES example
32
Application examples (2)
Bit level operations A5 example (GSM encryption)
33
Application examples conclusions

CPUs offer flexibility, but
not efficient in performance
not efficient in code size
not efficient in power consumption

34
Power Consumption in microprocessors

Power consumption is (becoming) the limiting
factor in processor design
Solution in direction of
Hardware acceleration
Instruction Level Parallelism instead of clock
speed
Code size efficiency

source ISSCC2001, Patrick Gelsinger, Intel
35
Amdahls law

Impact of an improvement on the execution time of
a program depends on 2 parameters
f fraction of the original computation time
that is affected by the improvement
s speedup factor (local)
exec_time_new exec_time_old (1-f)
exec_time_old f / s
speedup_overall exec_time_old /
exec_time_new 1 / ( 1 f f / s)
if s gtgt 1 then speedup_overall 1 / ( 1 f )
Example 40 of program can be executed 10 x
faster speedup_overall 1 / ( 0.6 0.4 / 10 )
1.56

36
Conclusions

Programmable CPU cores are important for the
control parts of the application.
They are well supported with tools to support
the development of end-user software. ( vs.
deeply embedded sw)
Keep it Simple heuristic (RISC vs. CISC)
Make frequent cases fast and rare cases correct.
Regular (orthogonal) instruction set
No special features that match a high level
language construct.
At least 16 registers to ease register
allocation.
Embedded cores are often light cores which are a
compromise between performance, area and power
dissipation. (vs.
stand-alone CPU cores which are optimised for
performance)

37
Programmable Digital Signal Processors

instruction level parallelism (ILP)
hardware support for loop control
attention for high level data types e.g. arrays,
delaylines
(vs. scalars for CPUs)
difficult to compare architectures
e.g. DIT, DIF, radix 2/4, FFT loop unrolling,
scaling,
shuffling, intialisation can be included or
forgotten
benchmarking (Berkeley Design Technology Inc
(BDTi))
(compare to SpecInt benchmarks for CPs)

38
Outline

architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples TI, Motorola, Philips
code generation
recent developments VLIW (Very Long Instruction
Word)
examples C6 and TM

39
Goal 1 cycle per iteration

position ACR (1 or 2)
adder/subtractor
extra pipelines
asymmetric inputs
multi-precision

Modifications
extra inputs/outputs

40
DSP data types

not every signal requires 32 bits
2 types of DSP floating point and integer
advantages FP most specs are in FP
(conversion to int is time consuming since the
behaviour
may change)
disadvantage FP cost (area, speed, power)
wanted type of output of an operation type
of input
(because both stored in RAM)
no problem for FP but for integer
integer multiplication doubles the number of
bits n n gt 2n
What about fractional numbers ?

41
DSP data types

integer and fractional numbers are a special
case of fixed point
fix ltp,qgt (ART designer SystemC)

p
q
1
1
0
1
1
0
1
-19/8 -2.375
1
fix lt8,3gt
Scale factor 1/8
2-2
2-3
2-1
20
21
22
23
-24
negative weight 2s complement
quantization error
Same alu handles fix lt8,1gt, fix lt8,2gt, fix
lt8,3gt, ...
if q0 then integer e.g. int lt8,0gt if qp-1 then
fractional e.g. int lt8,7gt
42
DSP data types

continue (after multiplication) with msb only
represents the limit of the accuracy of the
result
(can not be larger than the accuracy of the
inputs)
more efficient solution
continue with msb lsb
sum-of-product operations generate accumulative
noise at 32nd
vs. 16th bit
Still overflow for addition overflow bits
double precision accumulator
extra overflow bits
shift, round, truncate unit

43
(No Transcript)
44
Prog/data memory
prog mem.
data mem.
prog mem.
data mem. 1
data mem. 2
EXU
EXU
EXU
Harvard
Modified Harvard
Von Neumann (sequencial)
? c(i) x(i)
Goal 1 cycle per iteration
45
RAM_A
RAM_B
MAC
46
time loop
1 cycle/tap ?
? ci xi
filter loop i
How updating the delayline ?
x5
x4
x3
x2
x1
Z-1
Z-1
Z-1
Z-1
c4
c5
c3
c2
c1

y
47
Solution 2 indirect adressing

use of a pointer to mark the begin of the delay
line
update the pointer instead of moving the data
problem trashing of the whole memory
solution modulo addressing
need for a register to store the pointer

48
ACU architecture and Instruction set
A
S
Output reg A reg S Read_A A
A S Read_S S A S incA A1
A1 S decA A-1 A-1 S Step
AS AS S Inc_step S1 A S1
Modulo
16 10 000 23 10 111 mask hold
Modulo can be implemented as a mask operation
if the size is 2k
output to RAM
49
Addressing modes

register ADD R4, R3
RR4 RR4 RR3
immediate ADD R4, 3 RR4
RR4 3
direct ADD R4, (100)
RR4 RR4 Mem100
indirect ADD R4, (R3)
RR4 RR4 MemRR3
w. inc/dec ADD R4, (R3) RR4 RR4
MemRR3
RR3
RR3 1
indexed ADD R4, (R3R2) RR4 RR4
MemRR3
RR3
RR3 RR2

Remarks
direct for static data
indirect for arrays
inc/dec for stepping through arrays e.g. ? xn
index for stepping through arrays e.g. ? x2n

50
Addressing modes extra for DSP

8 ARs (address or auxiliary register) available
extra indirect modes
circular ARn post inc/dec by
1 - circular
ARn AR0 post inc/dec by AR0 -
circular
bit reverse ARn AR0 B post inc/dec by AR0 -
bit rev.

51
Interrupt address
Reset
ACU_A
ACU_B
AR_A
AR_B
Stack
1
PC
RAM_A
RAM_B
Program Memory
DR_A
DR_B
IR
MAC
ALU
Control Bus
Rfile
52
first solution
? c(i) x(i)
Not shown coefficient RAMACU
resources
6 clockcycles/sample limit pipelines in the
controller
time (cc)
53
Loopfolding (software pipelining)
54
Loopfolding (software pipelining)
? c(i) x(i)
Pre- and postamble 4 clockcycles /sample
55
hardware support for loop control
? c(i) x(i)
1 clockcycles/sample repeat instruction and
repeat block
56
TMS320C5000
T register
E
P
D
C
D
T
B
A
T
A
B
C
C
D
A
D
Sign ctr
Sign ctr
Sign ctr
Sign ctr
Sign ctr
A(40)
B(40)
Multiplier (1717)
MUX
A
ALU (40)
M
U
A
B
0
B
A
B
Barrer shifter
fractional
MUX
MUX
COMP
Adder (40)
MSW/LSW select
TRN
ZERO
SAT
ROUND
TC
57
Address bus
16 bits
Motorola 56K family
EXTERNAL ADRESS SWITCH
P Address
Y Address
X Address
Y memory 256-by-24-bit RAM 256-by-24-bit ROM
Address ALU
X memory 256-by-24-bit RAM 256-by-24-bit ROM
2,048-by-24-bit PROGRAM MEMORY ROM
EXTERNAL DATA-BUS SWITCH
INTERNAL DATA-BUS SWITCH
24 BITS
X-DATA
DATA BUS
Y DATA
P DATA
GLOBAL DATA
ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL
INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGR
AMMED I/O, BUS CONTROL
DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRO
DUCING 56 BIT RESULT
24 BITS
PROGRAM CONTROLLER
I/O PORTS
2 BITS
3 BITS
7 BITS
CLOCK
INTERRUPT
58

Program control unit
Two 16-by-16 bit multipliers
Y0
Y0
Y1
Y1
X
X
PO
P1
scale
scale
96-bit instructions
Program memory (Z data)
Instruction decoder
Two 40 bit arithmic- logic units
shift
Saturation
Saturation
Four 40 bit accumulators
16-bit bus
Saturation/scale
R.E.A.L.
X data
Y data
Buses for
Z data
59
source
lexical analysis
syntax analysis
Front end
semantic analysis
Intermediate machine independent representation
Code selection
Register allocation
Code generation
scheduling
1 instr // ops order of instr
code
60
Intermediate machine independent representation
BBi
BBk
BBj
a
b
c
d

c
t1 a b t2 c d t3 t1 c out
t2 t3
t1
t2

t3

61
Code selection example
d memory
p memory
ADSP Analog Devices
ax
ay
af
mx
my
mf
x
y
x
y
-

MAC
ALU
-
ar
mr
62
Example of code selection covering of
intermediate representation with RTPs
mx dmem
my pmem
ax dmem
ay pmem
a
b
c
d
mr dmem

c
ar ax ay
3
t1
t2

2
Mr mr (mx my)
t3

1
my ar
mr mr my
63

Problems
local decisions which have a global impact
phase coupling example
asap schedule
maximal freedom for scheduling
code selection during scheduling
register allocation comes afterwards
can lead to infeasible solutions

64
phase coupling discussion
It is very difficult and almost impossible to
develop robust and efficient DSP compilers.
Current DSP practice programming in assembler
Solution 1. Solve code generation for DSPs 2.
Step back and rethink the architecture develop
an architecture which is still efficient but
also a good model for building a compiler
Efficiency exploit instruction level
parallelism (ILP) compilation systematic
positioning of registers and regular

interconnect VLIW Very
Long Instruction Word
65