An Overview of Scheduling for SCORE - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

An Overview of Scheduling for SCORE

Description:

... stitch buffers between temporal partitions. Hold intermediate ... Temporal Partitioning/Sequence. Resource Allocation. Management of high ... Temporal ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 35

Provided by: mar227

Category:

more less

Transcript and Presenter's Notes

Title: An Overview of Scheduling for SCORE

1
An Overview of Scheduling for SCORE

Yury Markovskiy
UC Berkeley
BRASS Group
DSA 2003

2
Outline

SCORE model the scheduler
Target platform
Basic operation
Run-time resource binding
Space of solutions for scheduling
Evaluation of quality
Cost
Spectrum of solutions
Temporal Graph Partitioning
Buffer Allocation
Improving Array Utilization
Conclusion

3
SCORE Reconfigurable Hardware Model

Paged FPGA
Compute Page (CP)
Fixed-size slice of reconfig. hardware (e.g.
512 4-LUTs)
Fixed number of I/O ports
Stream interface with input queue
Configurable Memory Block (CMB)
Distributed, on-chip memory (e.g. 2 Mbit)
Stream interface with input queue
High-level interconnect
Circuit switched with valid back-pressure
bits
Microprocessor
Run-time support user code

4
SCORE Programming and Execution Model

Computation is a graph of operators with dynamic
I/O rates
Operators are partitioned by the compiler into
virtual compute pages
memory segments
Streams for communication

JPEG Encode 11 pages 5 segments
5
SCORE Scheduler (1)

Assume device has 4 CPs, 16 CMBs
Temporal Partitioning

JPEG Encode 11 pages 5 segments
6
SCORE Scheduler (2)

Insert stitch buffers between temporal
partitions
Hold intermediate computation results

JPEG Encode 11 pages 5 segments
7
SCORE Scheduler (3)

At run-time, reconfigure the array
Swap individual pages and segments in and out

8
SCORE Scheduler (4)

Do we need run-time scheduler?
Run-time scheduling
Late binding of resources
Benefits
Automatic performance scaling
Multiple applications sharing an array
Extra burden scheduler
Complex optimization with multiple simultaneous
constraints(CPs, CMBs, and network) ? NP-hard
problem

9
Scheduling Solutions (1)

An example of simple app execution timeline
How to evaluate scheduler quality?
Total execution time
Average array activity
A fraction of cycles that average CP fires.

10
Scheduling Solutions (2)

Scheduling Costs
Time to compute schedule
Efficient scheduling algorithms
Off-line (static) scheduling
Time to reconfigure the array
High quality schedules that minimize
reconfiguration time and frequency.

11
Scheduling Solutions (3)

Computing Schedule involves
Temporal Partitioning/Sequence
Resource Allocation
Management of high-bandwidth CMB memory.
Placement/Routing
Timeslice sizing
Length of time that a page is resident on the
array.
Cycle-level hardware timing
Accomplished through h/w stream interfaces
And page firing logic.

12
Scheduling Solutions (4)

When should each scheduling step be performed?
The solutions range in quality and cost
Dynamic (at run time)
High run-time overhead
Responsive to application dynamics
Static (at load time / install time)
Low run-time overhead
Quality depends on the ability to predict
application behavior.
Only correct (possibly conservative) schedules
are acceptable.

13
Problems

Scheduler overhead
Sets the lower-bound on the response time
Restricts applicability of the model if high.
Run-time scheduler cannot react quicklyto
changes on the array.
Reduction without semantic restrictions
Operators with dynamic data-depend behavior
How small can we make the overhead?
Compare to reconfiguration time of 4-5Kcycles.
Scheduling quality
On idealized array ? no overheads, only dataflow
rate mismatches.
On realistic array ? minimize reconfiguration and
scheduling overheads.

14
Initial Scheduling Solution

Fully Dynamic Scheduler
Perform scheduling operation each timeslice

15
Fully Dynamic Scheduler (1)

Two types of overhead
Scheduler (avg. 124 Kcycles)
Reconfiguration array global controller (avg.
3.5 Kcycles)
Average overhead per timeslice gt 127 Kcycles

16
Fully Dynamic Scheduler (2)

Total Execution Time
Scheduler Overhead is on average 36 of execution
time
Requires Large Timeslice Size 250Kcycles.

17
Quasi-Static Scheduler (1)

Timeslice size
Dynamically controlled by array hardware stall
detect.
Hardware continuously (or at small intervals)
monitors array activity.

18
Quasi-Static Scheduler (2)

A low overhead scheduling solution
Scheduler overhead (avg. 14Kcycles)
Reconfiguration (avg. 4Kcycles)

7x average reduction in overhead

19
Quasi-Static Scheduler (3)

4.5x average application speedup
Reduction in overhead AND
Improvement in scheduling quality

20
Temporal Graph Partitioning (1)

Array Activity , Application Makespan
A fraction of cycles that an average CP fires.
SCORE virtual pages can have dynamic I/O rates
In practice, most pages exhibit static long-term
behavior.
Collect I/O rates by profiling and use for
partitioning
Example
Each node has inherent (FSM-dependent) I/O rates
Units for rates are tokens/fire (ranges between 0
and 1)
If nodes A and B are co-resident,the expected
activity
FA 1 fire/cycle
FB 0.5 fire/cycle

21
Temporal Graph Partitioning (2)

Continuing the example
Add node C with low input rate
If nodes A and B are co-resident, expected
activity
FA 0.2 fire/cycle (formerly 1)
FB 0.1 fire/cycle (formerly 0.5)
FC 1 fire/cycle
Synchronous Dataflow (SDF Edward Lee) balance
equations
Efficiently solve for relative page firing rates

22
Temporal Partitioning Heuristics (3)

Exhaustive Search
Goal maximize utilization, i.e. non-idle
page-cycles
How cluster to reduce rate mis-matches
Exhaustive search of feasible partitions for max
total utilization
Tried up to 30 pages (6 hours for small array,
minutes for large array)
Used as a reference point to evaluate other
heuristics
Min Cut
FBB Flow-Based, Balanced, multi-way partitioning
YangWong, ACM 1994
CP limit ? area constraint
CMB limit ? I/O cut constraint every user
segment has edge to sink to cost CMB in cut
Topological
Pre-cluster strongly connected components (SCCs)
Pre-cluster pairs of clusters if reduce I/O of
pair
Topological sort, partition at CP limit

23
Partitioning Heuristics Results (4)
Hardware Size (CP-CMB Pairs)
24
Partitioning Heuristics Results (5)
Hardware Size (CP-CMB Pairs)
25
Buffer Allocation (1)
X 1
Y 0.5
Z 5
1
0.5
1
1
0.1
1
A
B
C

CMBs distributed on-chip memory
Limited resource
Serialized transfers to/from primary memory
Given a graph,
Compute relative buffer sizes using collected I/O
rates
Stitch buffers store intermediate results
Scale buffers by a proportionality constant Q
X will be of size 1Q, Y 0.5Q, and Z 5Q
Q constraints the amount of work performed in a
timeslice before buffers empty or fill up,
causing a stall.

26
Buffer Allocation(2)

CMB
Controller limits the number of I/O ports
Memory can be shared by several segments
However, only one is active at a time.
Also caches page configurations/state
Choice of Q affects reconfiguration overhead
Large Q ? Large Buffers
Reconfiguration costs are well amortized by
processing large number of data tokens each
timeslice.
Large off-chip traffic to primary memory
(sequentialized)
Reduction in available space for cached
configurations/state.
Small Q ? Small Buffers
Every buffer and configuration/state is cached on
the arrayvirtually no off-chip traffic.
Little work is done each timeslice,
configuration overhead may dominate.

27
Buffer Allocation (3)

Current work
Algorithms to allocate memory buffers
Preserve temporal locality between timeslices
Minimize fragmentation
An analytical model to predict application
performance from Q.
Includes dataflow constraints
Reconfiguration Overhead/Expected Memory Traffic
Integration of buffer allocation into the
existing Quasi-static scheduler framework.

28
Improving Array Utilization (1)

Currently, the entire array is reconfigured at
once.
Simple to manage.
Schedule Example
Pages stall at different times
Timeslice size is variable, howeverMust wait for
all pages to stall in order to reconfigure
Dramatically reduces array utilization

29
Improving Array Utilization (2)

Average CP activity does not exceed 35.
On large arrays, application is limited by
available parallelism.
On small arrays, page I/O rate mismatches are to
blame.

30
Improving Array Utilization (3)

Match page I/O rates with stitch buffers inserted
between co-resident pages.
Cost high CMBCP ratio
Remove rigid timeslice boundary
Decrease the number of wasted cycles
Attempt to run pages only when their execution
time amortizes reconfiguration costs.

31
Improving Array Utilization (4)

Complicates static scheduler framework
Overhead to execute the schedule may increase.
Abandon regular schedule structure
Moving toward event-driven scheduling
The run-time scheduler responds to stall
alertsby scheduling the subsequent set of pages
in a sequence.
Static scheduler must pre-compute a dependency
graph and encode it into the reconfiguration
script.
Scheduler overhead ? the lower bound on the
response time
What is the attainable array utilization?

32
Conclusion

Run-time scheduler
Required for automatic scaling under hardware
virtualization
Run-time overhead ? lower bound of response time
Makes this model impractical for some apps
Low overhead run-time scheduling is possible
Without semantic restrictions
With higher (or comparable) scheduling quality.
Overhead 7x reduction and performance 2-7x
improvement.
Temporal Graph Partitioning
Performance on an idealized array constrained by
dataflow dependencies.
Buffer Allocation
Minimization of reconfiguration overhead/off-chip
traffic.
Improving Array Utilization
Moving toward event-driven scheduling

33
(No Transcript)
34
Quasi-Static Scheduler (4)

Tested applications
Image de/compression consist of both dynamic
and static rate operators.
All demonstrate similar speedups under
Quasi-Static scheduler.
Performance improvements can be attributed to
Reduced scheduler overhead
Improved scheduling quality
Global rather than local (BFS) view as in dynamic
scheduler
Reduction of the lower bound of timeslice size
Expands the space of apps well suited for
execution under a virtualized hardware
Retained powerful semantics of dynamic
data-dependent dataflow