An Overview of Scheduling for SCORE - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

An Overview of Scheduling for SCORE

Description:

... stitch buffers between temporal partitions. Hold intermediate ... Temporal Partitioning/Sequence. Resource Allocation. Management of high ... Temporal ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 35
Provided by: mar227
Category:

less

Transcript and Presenter's Notes

Title: An Overview of Scheduling for SCORE


1
An Overview of Scheduling for SCORE
  • Yury Markovskiy
  • UC Berkeley
  • BRASS Group
  • DSA 2003

2
Outline
  • SCORE model the scheduler
  • Target platform
  • Basic operation
  • Run-time resource binding
  • Space of solutions for scheduling
  • Evaluation of quality
  • Cost
  • Spectrum of solutions
  • Temporal Graph Partitioning
  • Buffer Allocation
  • Improving Array Utilization
  • Conclusion

3
SCORE Reconfigurable Hardware Model
  • Paged FPGA
  • Compute Page (CP)
  • Fixed-size slice of reconfig. hardware (e.g.
    512 4-LUTs)
  • Fixed number of I/O ports
  • Stream interface with input queue
  • Configurable Memory Block (CMB)
  • Distributed, on-chip memory (e.g. 2 Mbit)
  • Stream interface with input queue
  • High-level interconnect
  • Circuit switched with valid back-pressure
    bits
  • Microprocessor
  • Run-time support user code

4
SCORE Programming and Execution Model
  • Computation is a graph of operators with dynamic
    I/O rates
  • Operators are partitioned by the compiler into
  • virtual compute pages
  • memory segments
  • Streams for communication

JPEG Encode 11 pages 5 segments
5
SCORE Scheduler (1)
  • Assume device has 4 CPs, 16 CMBs
  • Temporal Partitioning

JPEG Encode 11 pages 5 segments
6
SCORE Scheduler (2)
  • Insert stitch buffers between temporal
    partitions
  • Hold intermediate computation results

JPEG Encode 11 pages 5 segments
7
SCORE Scheduler (3)
  • At run-time, reconfigure the array
  • Swap individual pages and segments in and out

8
SCORE Scheduler (4)
  • Do we need run-time scheduler?
  • Run-time scheduling
  • Late binding of resources
  • Benefits
  • Automatic performance scaling
  • Multiple applications sharing an array
  • Extra burden scheduler
  • Complex optimization with multiple simultaneous
    constraints(CPs, CMBs, and network) ? NP-hard
    problem

9
Scheduling Solutions (1)
  • An example of simple app execution timeline
  • How to evaluate scheduler quality?
  • Total execution time
  • Average array activity
  • A fraction of cycles that average CP fires.

10
Scheduling Solutions (2)
  • Scheduling Costs
  • Time to compute schedule
  • Efficient scheduling algorithms
  • Off-line (static) scheduling
  • Time to reconfigure the array
  • High quality schedules that minimize
    reconfiguration time and frequency.

11
Scheduling Solutions (3)
  • Computing Schedule involves
  • Temporal Partitioning/Sequence
  • Resource Allocation
  • Management of high-bandwidth CMB memory.
  • Placement/Routing
  • Timeslice sizing
  • Length of time that a page is resident on the
    array.
  • Cycle-level hardware timing
  • Accomplished through h/w stream interfaces
  • And page firing logic.

12
Scheduling Solutions (4)
  • When should each scheduling step be performed?
  • The solutions range in quality and cost
  • Dynamic (at run time)
  • High run-time overhead
  • Responsive to application dynamics
  • Static (at load time / install time)
  • Low run-time overhead
  • Quality depends on the ability to predict
    application behavior.
  • Only correct (possibly conservative) schedules
    are acceptable.

13
Problems
  • Scheduler overhead
  • Sets the lower-bound on the response time
  • Restricts applicability of the model if high.
  • Run-time scheduler cannot react quicklyto
    changes on the array.
  • Reduction without semantic restrictions
  • Operators with dynamic data-depend behavior
  • How small can we make the overhead?
  • Compare to reconfiguration time of 4-5Kcycles.
  • Scheduling quality
  • On idealized array ? no overheads, only dataflow
    rate mismatches.
  • On realistic array ? minimize reconfiguration and
    scheduling overheads.

14
Initial Scheduling Solution
  • Fully Dynamic Scheduler
  • Perform scheduling operation each timeslice

15
Fully Dynamic Scheduler (1)
  • Two types of overhead
  • Scheduler (avg. 124 Kcycles)
  • Reconfiguration array global controller (avg.
    3.5 Kcycles)
  • Average overhead per timeslice gt 127 Kcycles

16
Fully Dynamic Scheduler (2)
  • Total Execution Time
  • Scheduler Overhead is on average 36 of execution
    time
  • Requires Large Timeslice Size 250Kcycles.

17
Quasi-Static Scheduler (1)
  • Timeslice size
  • Dynamically controlled by array hardware stall
    detect.
  • Hardware continuously (or at small intervals)
    monitors array activity.

18
Quasi-Static Scheduler (2)
  • A low overhead scheduling solution
  • Scheduler overhead (avg. 14Kcycles)
  • Reconfiguration (avg. 4Kcycles)
  • 7x average reduction in overhead

19
Quasi-Static Scheduler (3)
  • 4.5x average application speedup
  • Reduction in overhead AND
  • Improvement in scheduling quality

20
Temporal Graph Partitioning (1)
  • Array Activity , Application Makespan
  • A fraction of cycles that an average CP fires.
  • SCORE virtual pages can have dynamic I/O rates
  • In practice, most pages exhibit static long-term
    behavior.
  • Collect I/O rates by profiling and use for
    partitioning
  • Example
  • Each node has inherent (FSM-dependent) I/O rates
  • Units for rates are tokens/fire (ranges between 0
    and 1)
  • If nodes A and B are co-resident,the expected
    activity
  • FA 1 fire/cycle
  • FB 0.5 fire/cycle

21
Temporal Graph Partitioning (2)
  • Continuing the example
  • Add node C with low input rate
  • If nodes A and B are co-resident, expected
    activity
  • FA 0.2 fire/cycle (formerly 1)
  • FB 0.1 fire/cycle (formerly 0.5)
  • FC 1 fire/cycle
  • Synchronous Dataflow (SDF Edward Lee) balance
    equations
  • Efficiently solve for relative page firing rates

22
Temporal Partitioning Heuristics (3)
  • Exhaustive Search
  • Goal maximize utilization, i.e. non-idle
    page-cycles
  • How cluster to reduce rate mis-matches
  • Exhaustive search of feasible partitions for max
    total utilization
  • Tried up to 30 pages (6 hours for small array,
    minutes for large array)
  • Used as a reference point to evaluate other
    heuristics
  • Min Cut
  • FBB Flow-Based, Balanced, multi-way partitioning
    YangWong, ACM 1994
  • CP limit ? area constraint
  • CMB limit ? I/O cut constraint every user
    segment has edge to sink to cost CMB in cut
  • Topological
  • Pre-cluster strongly connected components (SCCs)
  • Pre-cluster pairs of clusters if reduce I/O of
    pair
  • Topological sort, partition at CP limit

23
Partitioning Heuristics Results (4)
Hardware Size (CP-CMB Pairs)
24
Partitioning Heuristics Results (5)
Hardware Size (CP-CMB Pairs)
25
Buffer Allocation (1)
X 1
Y 0.5
Z 5
1
0.5
1
1
0.1
1
A
B
C
  • CMBs distributed on-chip memory
  • Limited resource
  • Serialized transfers to/from primary memory
  • Given a graph,
  • Compute relative buffer sizes using collected I/O
    rates
  • Stitch buffers store intermediate results
  • Scale buffers by a proportionality constant Q
  • X will be of size 1Q, Y 0.5Q, and Z 5Q
  • Q constraints the amount of work performed in a
    timeslice before buffers empty or fill up,
    causing a stall.

26
Buffer Allocation(2)
  • CMB
  • Controller limits the number of I/O ports
  • Memory can be shared by several segments
  • However, only one is active at a time.
  • Also caches page configurations/state
  • Choice of Q affects reconfiguration overhead
  • Large Q ? Large Buffers
  • Reconfiguration costs are well amortized by
    processing large number of data tokens each
    timeslice.
  • Large off-chip traffic to primary memory
    (sequentialized)
  • Reduction in available space for cached
    configurations/state.
  • Small Q ? Small Buffers
  • Every buffer and configuration/state is cached on
    the arrayvirtually no off-chip traffic.
  • Little work is done each timeslice,
    configuration overhead may dominate.

27
Buffer Allocation (3)
  • Current work
  • Algorithms to allocate memory buffers
  • Preserve temporal locality between timeslices
  • Minimize fragmentation
  • An analytical model to predict application
    performance from Q.
  • Includes dataflow constraints
  • Reconfiguration Overhead/Expected Memory Traffic
  • Integration of buffer allocation into the
    existing Quasi-static scheduler framework.

28
Improving Array Utilization (1)
  • Currently, the entire array is reconfigured at
    once.
  • Simple to manage.
  • Schedule Example
  • Pages stall at different times
  • Timeslice size is variable, howeverMust wait for
    all pages to stall in order to reconfigure
  • Dramatically reduces array utilization

29
Improving Array Utilization (2)
  • Average CP activity does not exceed 35.
  • On large arrays, application is limited by
    available parallelism.
  • On small arrays, page I/O rate mismatches are to
    blame.

30
Improving Array Utilization (3)
  • Match page I/O rates with stitch buffers inserted
    between co-resident pages.
  • Cost high CMBCP ratio
  • Remove rigid timeslice boundary
  • Decrease the number of wasted cycles
  • Attempt to run pages only when their execution
    time amortizes reconfiguration costs.

31
Improving Array Utilization (4)
  • Complicates static scheduler framework
  • Overhead to execute the schedule may increase.
  • Abandon regular schedule structure
  • Moving toward event-driven scheduling
  • The run-time scheduler responds to stall
    alertsby scheduling the subsequent set of pages
    in a sequence.
  • Static scheduler must pre-compute a dependency
    graph and encode it into the reconfiguration
    script.
  • Scheduler overhead ? the lower bound on the
    response time
  • What is the attainable array utilization?

32
Conclusion
  • Run-time scheduler
  • Required for automatic scaling under hardware
    virtualization
  • Run-time overhead ? lower bound of response time
  • Makes this model impractical for some apps
  • Low overhead run-time scheduling is possible
  • Without semantic restrictions
  • With higher (or comparable) scheduling quality.
  • Overhead 7x reduction and performance 2-7x
    improvement.
  • Temporal Graph Partitioning
  • Performance on an idealized array constrained by
    dataflow dependencies.
  • Buffer Allocation
  • Minimization of reconfiguration overhead/off-chip
    traffic.
  • Improving Array Utilization
  • Moving toward event-driven scheduling

33
(No Transcript)
34
Quasi-Static Scheduler (4)
  • Tested applications
  • Image de/compression consist of both dynamic
    and static rate operators.
  • All demonstrate similar speedups under
    Quasi-Static scheduler.
  • Performance improvements can be attributed to
  • Reduced scheduler overhead
  • Improved scheduling quality
  • Global rather than local (BFS) view as in dynamic
    scheduler
  • Reduction of the lower bound of timeslice size
  • Expands the space of apps well suited for
    execution under a virtualized hardware
  • Retained powerful semantics of dynamic
    data-dependent dataflow
Write a Comment
User Comments (0)
About PowerShow.com