Title: Stream Compilation for Real-time Embedded Systems
1Stream Compilation for Real-time Embedded Systems
- Yoonseo Choi, Yuan Lin, Nathan Chong, Scott
Mahlke, and Trevor Mudge - University of Michigan,
- ARM Ltd.
2Stream Programming
- Programming style
- Embedded domain
- Audio/video (H.264), wireless (WCDMA)
- Mainstream
- Continuous query processing, Search Stream
- Collection of data records
- Kernels/Filters
- Functions applied to streams
- Input/Output are streams
- Coarse grain dataflow
- Amenable to aggressive compiler optimizations
ASPLOS02, 06, PLDI 03
3Compiling Stream Programs
Stream Program
Multicore System
compiler
?
- Coarse-grain Software pipeliningPLDI08
- Equal work distribution
- Communication/computation overlap
- Assumed an infinite amount of local memory
- Local storage constraints
- - Spilling to main memory, infeasible solution
- Latency constraints
- - Often found in stream programs
4Target Architecture
- Target Cell processor
- Cores with disjoint address spaces
- Explicit copy to access remote data
- DMA engine independent of PEs
5Outline
- Review stream graph modular scheduling
- Memory-aware stream graph scheduling
- Latency-aware stream graph scheduling
- Experimental results
6Processor Assignment Maximizing Throughputs
- Assigns each filter to a processor
Four Processing Elements
W workload
PE1
PE2
PE3
PE0
Partition problem NP-hard
PE1
PE0
PE2
PE3
7Forming Pipelines Stage Assignment
- Assigns each filter to a pipeline stage
Prologue
PE 1
i
Sj Si
j
II
producer-consumer dependence
PE 1
Si
i
SDMA gt Si
DMA
PE 2
Epilogue
Sj SDMA1
j
Communication-computation overlap
Traversing dataflow order
8Excess Buffer Requirements
II 50
PE 1
1
1
1
2
LS size 14
3
3
12
LS 0
18
2
1
LS 1
8
LS 2
1
2
LS 3
Maximum throughput, but not feasible!
Multiple buffering
PE1
PE0
PE2
PE3
9Memory-aware Stream Graph Scheduling
Buffer Requirement Estimation using Conservative
Stage Assignment
- Phased approach for solving each step optimally
- Maximizes the usage of limited local storage
- Attempts to find more solutions, not degrading
the performance
Polynomial
Memory requirement
Processor Assignment underMemory Constraints
NP-hard
Processor assignment best-so-far
Stage Optimization for reducing buffers/DMAs and
stages
Polynomial
Scheduling result
10Buffer Usage Estimation Using Conservative Stage
Assignment
Variable filter i to PE j
Buffer usage of filter i
Maximize throughput under memory constraints!
11Memory-aware Processor Assignment
- Starts with same filter workload, same local
storage size, same processors - Considers buffer requirements per filter that
will be allocated to the local storage of the
assigned processor - Generates different processor assignments fits
into the LS
PE1
PE2
PE3
PE0
12Reducing Overheads Stage Optimization
1
2
3
S0
A
S1
DMA
DMA
B
C
S2
S3
DMA
Always minimizes buffers/ DMAs/stages from the
given schedule
S4
D
S5
DMA
DMA
E
S6
S7
DMA
S8
F
Initial stages
13Latency-aware Stream Graph Scheduling
Latency constraints
- Does not always maximizes throughputs
- Achieves the throughput that can match the given
latencies - Generates a schedule that satisfies latency
constraints using the least number of PEs.
Calculate the Target Throughput
Processor Assignment for Achieving Target
Throughput
14Latency Constraints within a Stream Graph
A
B
C
D
E
F
lat(A, C) start( C) completion( A)
LAT lat(A, C), lat(B,E)
15Latency-aware Stream Graph Scheduling
- Calculate the target throughput
- Conservative stage assignment to make (Sj Si
1) a constant value - Calculates a, where II a, a min(lat(i, j) /
(Sj Si 1))
- Processor assignment
- Minimize the number of PEs, achieving a.
(bin-packing)
16Bounds on the Number of PEs
- LBPE solution from latency-aware scheduling
- IIbest best possible II, largest workload among
all filters - UBPE solution from latency-aware scheduling when
a is substituted by IIbest. - LBPE num(PE) UBPE
17Design Space Exploration Memory and Latency
No feasible sol.
yes
- Inputs
- Maximum workload
- Timing constraints
- Memory constraints
Calculate UBpe, LBpe UBpe min(UBpe, Availablepe)
UBpe lt LBpe
Solution found!
no
yes
no
Memory-aware scheduling
P LBpe
P P 1
Solution exists
P UBpe
no
No feasible sol.
18Experimental Results
- Benchmarks software defined radio protocols
- WCDMA common 3G protocol
- DVB digital media broadcasting protocol
- 4G next generation wireless protocol
- 10 to 20 filters
- Platform
- PS3 up to 6 SPEs
- Software
- SPEX-C to C SUIF
- IBM Cell SDK 3.0
19Scalability of Memory-aware Scheduling
Ub 15
Ub 4
3.5
6
3
5
Speed up
2.5
4
2
3
1.5
2
1
1
0.5
0
1
2
3
4
5
6
0
1
2
3
4
5
6
PE
4G
DVB
5
Calculated II
Ub 5
4
Measured exec time
3
- Synchronization cost
- Unhidden communication cost
2
- Imbalanced task set tiny workload smaller then
DMA, - Centralized DMAs
WCDMA
1
0
1
2
3
4
5
6
20Memory-aware Stream Graph Scheduling
PE
6
5
4
3
2
1
4G
6
5
4
3
2
1
DVB
LS size
32K
32K
64K
64K
128K
128K
256K
256K
512K
512K
1M
1M
6
5
4
3
2
1
WCDMA
32K
Sum of total data sizes 4G 200KB DVB
133KB WCDMA 90KB
64K
128K
256K
512K
1M
21Conclusions
- Coarse-grain software pipelining of stream
programs considering - memory constraints
- latency constraints
- Performance summary
- Up to 50 more scheduling solutions
- Does not degrade the quality of the solutions
- Future directions
- Modeling DMA costs, reducing synchronization
costs - Getting uniform workload
22 23Input language
- Input language stylized C
func_a ( int a, int b) int i, j int
dat for (i 0 i lt counter i)    dat
bi   dat dat dat 10   ai datÂ
stream // enclosing the stream structure  Â
for (i 0 i lt 1000 i)
func_a(aout, ain) Â Â Â Â Â func_b(bout,
aout) Â Â Â Â Â Â func_c(cout, bout)Â Â Â
func_d(ifout, cout) Â Â Â Â Â func_e(eout,
ifout)Â Â Â Â
Main function
A kernel function
24Parallelized Code on Cell
void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
commands
PPU
SPU