Stream Compilation for Real-time Embedded Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Stream Compilation for Real-time Embedded Systems

Description:

Stream Compilation for Real-time Embedded Systems. Yoonseo Choi, Yuan Lin, Nathan Chong ... Latency-aware Stream Graph Scheduling. Does not always maximizes ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 25

Provided by: tre117

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Stream Compilation for Real-time Embedded Systems

1
Stream Compilation for Real-time Embedded Systems

Yoonseo Choi, Yuan Lin, Nathan Chong, Scott
Mahlke, and Trevor Mudge
University of Michigan,
ARM Ltd.

2
Stream Programming

Programming style
Embedded domain
Audio/video (H.264), wireless (WCDMA)
Mainstream
Continuous query processing, Search Stream
Collection of data records
Kernels/Filters
Functions applied to streams
Input/Output are streams
Coarse grain dataflow
Amenable to aggressive compiler optimizations
ASPLOS02, 06, PLDI 03

3
Compiling Stream Programs
Stream Program
Multicore System
compiler
?

Coarse-grain Software pipeliningPLDI08
Equal work distribution
Communication/computation overlap
Assumed an infinite amount of local memory
Local storage constraints
- Spilling to main memory, infeasible solution
Latency constraints
- Often found in stream programs

4
Target Architecture

Target Cell processor
Cores with disjoint address spaces
Explicit copy to access remote data
DMA engine independent of PEs

5
Outline

Review stream graph modular scheduling
Memory-aware stream graph scheduling
Latency-aware stream graph scheduling
Experimental results

6
Processor Assignment Maximizing Throughputs

Assigns each filter to a processor

Four Processing Elements
W workload
PE1
PE2
PE3
PE0
Partition problem NP-hard
PE1
PE0
PE2
PE3
7
Forming Pipelines Stage Assignment

Assigns each filter to a pipeline stage

Prologue
PE 1
i
Sj Si
j
II
producer-consumer dependence
PE 1
Si
i
SDMA gt Si
DMA
PE 2
Epilogue
Sj SDMA1
j
Communication-computation overlap
Traversing dataflow order
8
Excess Buffer Requirements

Infeasible schedule

II 50
PE 1
1
1
1
2
LS size 14
3
3
12
LS 0
18
2
1
LS 1
8
LS 2
1
2
LS 3
Maximum throughput, but not feasible!
Multiple buffering
PE1
PE0
PE2
PE3
9
Memory-aware Stream Graph Scheduling
Buffer Requirement Estimation using Conservative
Stage Assignment

Phased approach for solving each step optimally
Maximizes the usage of limited local storage
Attempts to find more solutions, not degrading
the performance

Polynomial
Memory requirement
Processor Assignment underMemory Constraints
NP-hard
Processor assignment best-so-far
Stage Optimization for reducing buffers/DMAs and
stages
Polynomial
Scheduling result
10
Buffer Usage Estimation Using Conservative Stage
Assignment
Variable filter i to PE j
Buffer usage of filter i
Maximize throughput under memory constraints!
11
Memory-aware Processor Assignment

Starts with same filter workload, same local
storage size, same processors
Considers buffer requirements per filter that
will be allocated to the local storage of the
assigned processor
Generates different processor assignments fits
into the LS

PE1
PE2
PE3
PE0
12
Reducing Overheads Stage Optimization
1
2
3
S0
A
S1
DMA
DMA
B
C
S2
S3
DMA
Always minimizes buffers/ DMAs/stages from the
given schedule
S4
D
S5
DMA
DMA
E
S6
S7
DMA
S8
F
Initial stages
13
Latency-aware Stream Graph Scheduling
Latency constraints

Does not always maximizes throughputs
Achieves the throughput that can match the given
latencies
Generates a schedule that satisfies latency
constraints using the least number of PEs.

Calculate the Target Throughput
Processor Assignment for Achieving Target
Throughput
14
Latency Constraints within a Stream Graph
A
B
C
D
E
F
lat(A, C) start( C) completion( A)
LAT lat(A, C), lat(B,E)
15
Latency-aware Stream Graph Scheduling

Calculate the target throughput
Conservative stage assignment to make (Sj Si
1) a constant value
Calculates a, where II a, a min(lat(i, j) /
(Sj Si 1))

Processor assignment
Minimize the number of PEs, achieving a.
(bin-packing)

16
Bounds on the Number of PEs

LBPE solution from latency-aware scheduling
IIbest best possible II, largest workload among
all filters
UBPE solution from latency-aware scheduling when
a is substituted by IIbest.
LBPE num(PE) UBPE

17
Design Space Exploration Memory and Latency
No feasible sol.
yes

Inputs
Maximum workload
Timing constraints
Memory constraints

Calculate UBpe, LBpe UBpe min(UBpe, Availablepe)
UBpe lt LBpe
Solution found!
no
yes
no
Memory-aware scheduling
P LBpe
P P 1
Solution exists
P UBpe
no
No feasible sol.
18
Experimental Results

Benchmarks software defined radio protocols
WCDMA common 3G protocol
DVB digital media broadcasting protocol
4G next generation wireless protocol
10 to 20 filters
Platform
PS3 up to 6 SPEs
Software
SPEX-C to C SUIF
IBM Cell SDK 3.0

19
Scalability of Memory-aware Scheduling
Ub 15
Ub 4
3.5
6
3
5
Speed up
2.5
4
2
3
1.5
2
1
1
0.5
0
1
2
3
4
5
6
0
1
2
3
4
5
6
PE
4G
DVB
5
Calculated II
Ub 5
4
Measured exec time
3
- Synchronization cost
- Unhidden communication cost
2

Imbalanced task set tiny workload smaller then
DMA,
Centralized DMAs

WCDMA
1
0
1
2
3
4
5
6
20
Memory-aware Stream Graph Scheduling
PE
6
5
4
3
2
1
4G
6
5
4
3
2
1
DVB
LS size
32K
32K

64K
64K

128K

128K

256K

256K

512K

512K

1M

1M
6
5
4
3
2
1
WCDMA
32K
Sum of total data sizes 4G 200KB DVB
133KB WCDMA 90KB

64K

128K

256K

512K

1M
21
Conclusions

Coarse-grain software pipelining of stream
programs considering
memory constraints
latency constraints
Performance summary
Up to 50 more scheduling solutions
Does not degrade the quality of the solutions
Future directions
Modeling DMA costs, reducing synchronization
costs
Getting uniform workload

Thank you!

23
Input language

Input language stylized C

func_a ( int a, int b) int i, j int
dat for (i 0 i lt counter i)     dat
bi    dat dat dat 10    ai dat

stream // enclosing the stream structure
for (i 0 i lt 1000 i)
func_a(aout, ain)       func_b(bout,
aout)       func_c(cout, bout)
func_d(ifout, cout)       func_e(eout,
ifout)
Main function
A kernel function
24
Parallelized Code on Cell

Function offloading

void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
//kernel function definitions //kernel stub
definitions //Data buffer definitions While(1
) switch(cmd) case runFilter case
DMA //send ACK to PPE
void thread( ) for() if(s0)
doDMA(..) blockingRead(..)
if(s1) runfilter(..)
blockingRead(..) barrier( )
commands
PPU
SPU

Write a Comment

User Comments (0)