Stream Compilation for Real-time Embedded Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Stream Compilation for Real-time Embedded Systems

Description:

Stream Compilation for Real-time Embedded Systems. Yoonseo Choi, Yuan Lin, Nathan Chong ... Latency-aware Stream Graph Scheduling. Does not always maximizes ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 25
Provided by: tre117
Category:

less

Transcript and Presenter's Notes

Title: Stream Compilation for Real-time Embedded Systems


1
Stream Compilation for Real-time Embedded Systems
  • Yoonseo Choi, Yuan Lin, Nathan Chong, Scott
    Mahlke, and Trevor Mudge
  • University of Michigan,
  • ARM Ltd.

2
Stream Programming
  • Programming style
  • Embedded domain
  • Audio/video (H.264), wireless (WCDMA)
  • Mainstream
  • Continuous query processing, Search Stream
  • Collection of data records
  • Kernels/Filters
  • Functions applied to streams
  • Input/Output are streams
  • Coarse grain dataflow
  • Amenable to aggressive compiler optimizations
    ASPLOS02, 06, PLDI 03

3
Compiling Stream Programs
Stream Program
Multicore System
compiler
?
  • Coarse-grain Software pipeliningPLDI08
  • Equal work distribution
  • Communication/computation overlap
  • Assumed an infinite amount of local memory
  • Local storage constraints
  • - Spilling to main memory, infeasible solution
  • Latency constraints
  • - Often found in stream programs

4
Target Architecture
  • Target Cell processor
  • Cores with disjoint address spaces
  • Explicit copy to access remote data
  • DMA engine independent of PEs

5
Outline
  • Review stream graph modular scheduling
  • Memory-aware stream graph scheduling
  • Latency-aware stream graph scheduling
  • Experimental results

6
Processor Assignment Maximizing Throughputs
  • Assigns each filter to a processor

Four Processing Elements
W workload
PE1
PE2
PE3
PE0
Partition problem NP-hard
PE1
PE0
PE2
PE3
7
Forming Pipelines Stage Assignment
  • Assigns each filter to a pipeline stage

Prologue
PE 1
i
Sj Si
j
II
producer-consumer dependence
PE 1
Si
i
SDMA gt Si
DMA
PE 2
Epilogue
Sj SDMA1
j
Communication-computation overlap
Traversing dataflow order
8
Excess Buffer Requirements
  • Infeasible schedule

II 50
PE 1
1
1
1
2
LS size 14
3
3
12
LS 0
18
2
1
LS 1
8
LS 2
1
2
LS 3
Maximum throughput, but not feasible!
Multiple buffering
PE1
PE0
PE2
PE3
9
Memory-aware Stream Graph Scheduling
Buffer Requirement Estimation using Conservative
Stage Assignment
  • Phased approach for solving each step optimally
  • Maximizes the usage of limited local storage
  • Attempts to find more solutions, not degrading
    the performance

Polynomial
Memory requirement
Processor Assignment underMemory Constraints
NP-hard
Processor assignment best-so-far
Stage Optimization for reducing buffers/DMAs and
stages
Polynomial
Scheduling result
10
Buffer Usage Estimation Using Conservative Stage
Assignment
Variable filter i to PE j
Buffer usage of filter i
Maximize throughput under memory constraints!
11
Memory-aware Processor Assignment
  • Starts with same filter workload, same local
    storage size, same processors
  • Considers buffer requirements per filter that
    will be allocated to the local storage of the
    assigned processor
  • Generates different processor assignments fits
    into the LS

PE1
PE2
PE3
PE0
12
Reducing Overheads Stage Optimization
1
2
3
S0
A
S1
DMA
DMA
B
C
S2
S3
DMA
Always minimizes buffers/ DMAs/stages from the
given schedule
S4
D
S5
DMA
DMA
E
S6
S7
DMA
S8
F
Initial stages
13
Latency-aware Stream Graph Scheduling
Latency constraints
  • Does not always maximizes throughputs
  • Achieves the throughput that can match the given
    latencies
  • Generates a schedule that satisfies latency
    constraints using the least number of PEs.

Calculate the Target Throughput
Processor Assignment for Achieving Target
Throughput
14
Latency Constraints within a Stream Graph
A
B
C
D
E
F
lat(A, C) start( C) completion( A)
LAT lat(A, C), lat(B,E)
15
Latency-aware Stream Graph Scheduling
  • Calculate the target throughput
  • Conservative stage assignment to make (Sj Si
    1) a constant value
  • Calculates a, where II a, a min(lat(i, j) /
    (Sj Si 1))
  • Processor assignment
  • Minimize the number of PEs, achieving a.
    (bin-packing)

16
Bounds on the Number of PEs
  • LBPE solution from latency-aware scheduling
  • IIbest best possible II, largest workload among
    all filters
  • UBPE solution from latency-aware scheduling when
    a is substituted by IIbest.
  • LBPE num(PE) UBPE

17
Design Space Exploration Memory and Latency
No feasible sol.
yes
  • Inputs
  • Maximum workload
  • Timing constraints
  • Memory constraints

Calculate UBpe, LBpe UBpe min(UBpe, Availablepe)
UBpe lt LBpe
Solution found!
no
yes
no
Memory-aware scheduling
P LBpe
P P 1
Solution exists
P UBpe
no
No feasible sol.
18
Experimental Results
  • Benchmarks software defined radio protocols
  • WCDMA common 3G protocol
  • DVB digital media broadcasting protocol
  • 4G next generation wireless protocol
  • 10 to 20 filters
  • Platform
  • PS3 up to 6 SPEs
  • Software
  • SPEX-C to C SUIF
  • IBM Cell SDK 3.0

19
Scalability of Memory-aware Scheduling
Ub 15
Ub 4
3.5
6
3
5
Speed up
2.5
4
2
3
1.5
2
1
1
0.5
0
1
2
3
4
5
6
0
1
2
3
4
5
6
PE
4G
DVB
5
Calculated II
Ub 5
4
Measured exec time
3
- Synchronization cost
- Unhidden communication cost
2
  • Imbalanced task set tiny workload smaller then
    DMA,
  • Centralized DMAs

WCDMA
1
0
1
2
3
4
5
6
20
Memory-aware Stream Graph Scheduling
PE
6
5
4
3
2
1
4G
6
5
4
3
2
1
DVB
LS size
32K
32K

64K
64K



128K





128K







256K





256K





512K






512K







1M






1M
6
5
4
3
2
1
WCDMA
32K
Sum of total data sizes 4G 200KB DVB
133KB WCDMA 90KB




64K





128K






256K






512K






1M
21
Conclusions
  • Coarse-grain software pipelining of stream
    programs considering
  • memory constraints
  • latency constraints
  • Performance summary
  • Up to 50 more scheduling solutions
  • Does not degrade the quality of the solutions
  • Future directions
  • Modeling DMA costs, reducing synchronization
    costs
  • Getting uniform workload

22
  • Thank you!

23
Input language
  • Input language stylized C

func_a ( int a, int b)  int i, j  int
dat  for (i 0 i lt counter i)     dat
bi    dat dat dat 10    ai dat 

stream // enclosing the stream structure   
for (i 0 i lt 1000 i)
func_a(aout, ain)       func_b(bout,
aout)         func_c(cout, bout)   
func_d(ifout, cout)       func_e(eout,
ifout)      
Main function
A kernel function
24
Parallelized Code on Cell
  • Function offloading

void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
commands
PPU
SPU
Write a Comment
User Comments (0)
About PowerShow.com