Title: An Overview of Scheduling for SCORE
1An Overview of Scheduling for SCORE
- Yury Markovskiy
- UC Berkeley
- BRASS Group
- DSA 2003
2Outline
- SCORE model the scheduler
- Target platform
- Basic operation
- Run-time resource binding
- Space of solutions for scheduling
- Evaluation of quality
- Cost
- Spectrum of solutions
- Temporal Graph Partitioning
- Buffer Allocation
- Improving Array Utilization
- Conclusion
3SCORE Reconfigurable Hardware Model
- Paged FPGA
- Compute Page (CP)
- Fixed-size slice of reconfig. hardware (e.g.
512 4-LUTs) - Fixed number of I/O ports
- Stream interface with input queue
- Configurable Memory Block (CMB)
- Distributed, on-chip memory (e.g. 2 Mbit)
- Stream interface with input queue
- High-level interconnect
- Circuit switched with valid back-pressure
bits - Microprocessor
- Run-time support user code
4SCORE Programming and Execution Model
- Computation is a graph of operators with dynamic
I/O rates - Operators are partitioned by the compiler into
- virtual compute pages
- memory segments
- Streams for communication
JPEG Encode 11 pages 5 segments
5SCORE Scheduler (1)
- Assume device has 4 CPs, 16 CMBs
- Temporal Partitioning
JPEG Encode 11 pages 5 segments
6SCORE Scheduler (2)
- Insert stitch buffers between temporal
partitions - Hold intermediate computation results
JPEG Encode 11 pages 5 segments
7SCORE Scheduler (3)
- At run-time, reconfigure the array
- Swap individual pages and segments in and out
8SCORE Scheduler (4)
- Do we need run-time scheduler?
- Run-time scheduling
- Late binding of resources
- Benefits
- Automatic performance scaling
- Multiple applications sharing an array
- Extra burden scheduler
- Complex optimization with multiple simultaneous
constraints(CPs, CMBs, and network) ? NP-hard
problem
9Scheduling Solutions (1)
- An example of simple app execution timeline
- How to evaluate scheduler quality?
- Total execution time
- Average array activity
- A fraction of cycles that average CP fires.
10Scheduling Solutions (2)
- Scheduling Costs
- Time to compute schedule
- Efficient scheduling algorithms
- Off-line (static) scheduling
- Time to reconfigure the array
- High quality schedules that minimize
reconfiguration time and frequency.
11Scheduling Solutions (3)
- Computing Schedule involves
- Temporal Partitioning/Sequence
- Resource Allocation
- Management of high-bandwidth CMB memory.
- Placement/Routing
- Timeslice sizing
- Length of time that a page is resident on the
array. - Cycle-level hardware timing
- Accomplished through h/w stream interfaces
- And page firing logic.
12Scheduling Solutions (4)
- When should each scheduling step be performed?
- The solutions range in quality and cost
- Dynamic (at run time)
- High run-time overhead
- Responsive to application dynamics
- Static (at load time / install time)
- Low run-time overhead
- Quality depends on the ability to predict
application behavior. - Only correct (possibly conservative) schedules
are acceptable.
13Problems
- Scheduler overhead
- Sets the lower-bound on the response time
- Restricts applicability of the model if high.
- Run-time scheduler cannot react quicklyto
changes on the array. - Reduction without semantic restrictions
- Operators with dynamic data-depend behavior
- How small can we make the overhead?
- Compare to reconfiguration time of 4-5Kcycles.
- Scheduling quality
- On idealized array ? no overheads, only dataflow
rate mismatches. - On realistic array ? minimize reconfiguration and
scheduling overheads.
14Initial Scheduling Solution
- Fully Dynamic Scheduler
- Perform scheduling operation each timeslice
15Fully Dynamic Scheduler (1)
- Two types of overhead
- Scheduler (avg. 124 Kcycles)
- Reconfiguration array global controller (avg.
3.5 Kcycles) - Average overhead per timeslice gt 127 Kcycles
16Fully Dynamic Scheduler (2)
- Total Execution Time
- Scheduler Overhead is on average 36 of execution
time - Requires Large Timeslice Size 250Kcycles.
17Quasi-Static Scheduler (1)
- Timeslice size
- Dynamically controlled by array hardware stall
detect. - Hardware continuously (or at small intervals)
monitors array activity.
18Quasi-Static Scheduler (2)
- A low overhead scheduling solution
- Scheduler overhead (avg. 14Kcycles)
- Reconfiguration (avg. 4Kcycles)
- 7x average reduction in overhead
19Quasi-Static Scheduler (3)
- 4.5x average application speedup
- Reduction in overhead AND
- Improvement in scheduling quality
20Temporal Graph Partitioning (1)
- Array Activity , Application Makespan
- A fraction of cycles that an average CP fires.
- SCORE virtual pages can have dynamic I/O rates
- In practice, most pages exhibit static long-term
behavior. - Collect I/O rates by profiling and use for
partitioning - Example
- Each node has inherent (FSM-dependent) I/O rates
- Units for rates are tokens/fire (ranges between 0
and 1) - If nodes A and B are co-resident,the expected
activity - FA 1 fire/cycle
- FB 0.5 fire/cycle
21Temporal Graph Partitioning (2)
- Continuing the example
- Add node C with low input rate
- If nodes A and B are co-resident, expected
activity - FA 0.2 fire/cycle (formerly 1)
- FB 0.1 fire/cycle (formerly 0.5)
- FC 1 fire/cycle
- Synchronous Dataflow (SDF Edward Lee) balance
equations - Efficiently solve for relative page firing rates
22Temporal Partitioning Heuristics (3)
- Exhaustive Search
- Goal maximize utilization, i.e. non-idle
page-cycles - How cluster to reduce rate mis-matches
- Exhaustive search of feasible partitions for max
total utilization - Tried up to 30 pages (6 hours for small array,
minutes for large array) - Used as a reference point to evaluate other
heuristics - Min Cut
- FBB Flow-Based, Balanced, multi-way partitioning
YangWong, ACM 1994 - CP limit ? area constraint
- CMB limit ? I/O cut constraint every user
segment has edge to sink to cost CMB in cut - Topological
- Pre-cluster strongly connected components (SCCs)
- Pre-cluster pairs of clusters if reduce I/O of
pair - Topological sort, partition at CP limit
23Partitioning Heuristics Results (4)
Hardware Size (CP-CMB Pairs)
24Partitioning Heuristics Results (5)
Hardware Size (CP-CMB Pairs)
25Buffer Allocation (1)
X 1
Y 0.5
Z 5
1
0.5
1
1
0.1
1
A
B
C
- CMBs distributed on-chip memory
- Limited resource
- Serialized transfers to/from primary memory
- Given a graph,
- Compute relative buffer sizes using collected I/O
rates - Stitch buffers store intermediate results
- Scale buffers by a proportionality constant Q
- X will be of size 1Q, Y 0.5Q, and Z 5Q
- Q constraints the amount of work performed in a
timeslice before buffers empty or fill up,
causing a stall.
26Buffer Allocation(2)
- CMB
- Controller limits the number of I/O ports
- Memory can be shared by several segments
- However, only one is active at a time.
- Also caches page configurations/state
- Choice of Q affects reconfiguration overhead
- Large Q ? Large Buffers
- Reconfiguration costs are well amortized by
processing large number of data tokens each
timeslice. - Large off-chip traffic to primary memory
(sequentialized) - Reduction in available space for cached
configurations/state. - Small Q ? Small Buffers
- Every buffer and configuration/state is cached on
the arrayvirtually no off-chip traffic. - Little work is done each timeslice,
configuration overhead may dominate.
27Buffer Allocation (3)
- Current work
- Algorithms to allocate memory buffers
- Preserve temporal locality between timeslices
- Minimize fragmentation
- An analytical model to predict application
performance from Q. - Includes dataflow constraints
- Reconfiguration Overhead/Expected Memory Traffic
- Integration of buffer allocation into the
existing Quasi-static scheduler framework.
28Improving Array Utilization (1)
- Currently, the entire array is reconfigured at
once. - Simple to manage.
- Schedule Example
- Pages stall at different times
- Timeslice size is variable, howeverMust wait for
all pages to stall in order to reconfigure - Dramatically reduces array utilization
29Improving Array Utilization (2)
- Average CP activity does not exceed 35.
- On large arrays, application is limited by
available parallelism. - On small arrays, page I/O rate mismatches are to
blame.
30Improving Array Utilization (3)
- Match page I/O rates with stitch buffers inserted
between co-resident pages. - Cost high CMBCP ratio
- Remove rigid timeslice boundary
- Decrease the number of wasted cycles
- Attempt to run pages only when their execution
time amortizes reconfiguration costs.
31Improving Array Utilization (4)
- Complicates static scheduler framework
- Overhead to execute the schedule may increase.
- Abandon regular schedule structure
- Moving toward event-driven scheduling
- The run-time scheduler responds to stall
alertsby scheduling the subsequent set of pages
in a sequence. - Static scheduler must pre-compute a dependency
graph and encode it into the reconfiguration
script. - Scheduler overhead ? the lower bound on the
response time - What is the attainable array utilization?
32Conclusion
- Run-time scheduler
- Required for automatic scaling under hardware
virtualization - Run-time overhead ? lower bound of response time
- Makes this model impractical for some apps
- Low overhead run-time scheduling is possible
- Without semantic restrictions
- With higher (or comparable) scheduling quality.
- Overhead 7x reduction and performance 2-7x
improvement. - Temporal Graph Partitioning
- Performance on an idealized array constrained by
dataflow dependencies. - Buffer Allocation
- Minimization of reconfiguration overhead/off-chip
traffic. - Improving Array Utilization
- Moving toward event-driven scheduling
33(No Transcript)
34Quasi-Static Scheduler (4)
- Tested applications
- Image de/compression consist of both dynamic
and static rate operators. - All demonstrate similar speedups under
Quasi-Static scheduler. - Performance improvements can be attributed to
- Reduced scheduler overhead
- Improved scheduling quality
- Global rather than local (BFS) view as in dynamic
scheduler - Reduction of the lower bound of timeslice size
- Expands the space of apps well suited for
execution under a virtualized hardware - Retained powerful semantics of dynamic
data-dependent dataflow