Automatically Tuning Task-Based Programs for Multi-core Processors PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Automatically Tuning Task-Based Programs for Multi-core Processors


1
Automatically Tuning Task-Based Programs for
Multi-core Processors
  • Jin Zhou
  • Brian Demsky
  • Department of Electrical Engineering and Computer
    Science
  • University of California, Irvine

2
Motivation
  • Recent microprocessor trends
  • Number of cores increased rapidly
  • Architectures vary widely
  • Challenges for software development
  • Parallelization is now key for performance
  • Current parallel programming model threads
    locks
  • Hard to develop correct and efficient parallel
    software
  • Hard to adapt software to changes in architectures

3
Goals
  • Automatically generate parallel implementation
  • Automatically tune parallel implementation

4
Overview
Program
Profile Data
Processor Specification
Bamboo Compiler
Implementation Generator
Candidate implementations
Simulation-based Evaluator
Leading implementations
Implementation Optimizer
Optimized implementation
Tuned implementations
Code Generator
Optimized multi-core binary
Multi-core Processor
5
Example
  • MonteCarlo Example
  • Partitions problem into several simulations
  • Executes the simulations in parallel
  • Aggregates results of all simulations

6
Bamboo Language
  • A hybrid language combines data-flow and Java
  • Programs are composed of tasks
  • Tasks compose with dataflow-like semantics
  • Tasks contain Java-like object-oriented code
    internally
  • Programs cannot explicitly invoke tasks
  • Runtime automatically invokes tasks
  • Supports standard object-oriented constructs
    including methods and classes

7
Bamboo Language
  • Flags
  • Capture current role (type state) of object in
    computation
  • Each flag captures an aspect of the objects
    state
  • Change as the objects role evolves in program
  • Support orthogonal classifications of objects

8
  • task startup(StartupObject s in initialstate)
  • Aggregator aggr new Aggregator(s.args0)me
    rgetrue
  • for(int i 0 i lt 4 i)
  • Simulator sim new Simulator(aggr)runt
    rue
  • taskexit(s initialstatefalse)
  • task simulate(Simulator sim in run)
  • sim.runSimulate()
  • taskexit(sim runfalse, submittrue)
  • task aggregate(Aggregator aggr in merge,
  • Simulator sim in submit)
  • boolean allprocessed aggr.aggregateResult(si
    m)
  • if (allprocessed)
  • taskexit(aggr mergefalse,
    finishedtrue
  • sim submitfalse,
    finishedtrue)
  • taskexit(sim submitfalse, finishedtrue)

class Simulator flag run flag
submit flag finished ...
class Aggregator flag merge flag
finished
9
Bamboo Program Execution
Global Flagged Object Space
Runtime initialization
new
StartupObject
initialstate state
StartupObject
finished state
10
Bamboo Program Execution
Global Flagged Object Space
execute on
startup task
StartupObject
finished state
StartupObject
initialstate state
11
Bamboo Program Execution
set
Global Flagged Object Space
startup task
StartupObject
new
Aggregator
Simulator
Simulator
Simulator
Simulator
initialstate state
finished state
StartupObject
finished state
merge state
Aggregator
submit state
Simulator
finished state
run state
12
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
execute on
execute on
simulate task
simulate task
simulate
Simulator
Simulator
Simulator
Simulator
simulate task
simulate task
execute on
execute on
initialstate state
StartupObject
finished state
merge state
finished state
Aggregator
Simulator
run state
submit state
finished state
13
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
set
set
simulate task
simulate task
Simulator
Simulator
Simulator
Simulator
simulate task
simulate task
set
set
initialstate state
finished state
StartupObject
finished state
merge state
Aggregator
submit state
Simulator
run state
finished state
14
Bamboo Program Execution
Global Flagged Object Space
aggregate task
StartupObject
execute on
Aggregator
Simulator
Simulator
Simulator
Simulator
initialstate state
finished state
StartupObject
merge state
Aggregator
finished state
Simulator
run state
submit state
finished state
15
Bamboo Program Execution
aggregate task
Global Flagged Object Space
StartupObject
Aggregator
set
Simulator
Simulator
Simulator
Simulator
finished state
StartupObject
initialstate state
finished state
merge state
Aggregator
submit state
Simulator
run state
finished state
16
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
execute on
aggregate task
Simulator
Simulator
Simulator
Simulator
finished state
StartupObject
initialstate state
finished state
merge state
Aggregator
submit state
Simulator
run state
finished state
17
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
aggregate task
set
Simulator
Simulator
Simulator
Simulator
finished state
StartupObject
initialstate state
finished state
merge state
Aggregator
submit state
Simulator
run state
finished state
18
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
aggregate task
Simulator
Simulator
Simulator
Simulator
execute on
finished state
StartupObject
initialstate state
finished state
merge state
Aggregator
submit state
Simulator
run state
finished state
19
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
aggregate task
Simulator
Simulator
set
Simulator
Simulator
finished state
StartupObject
initialstate state
finished state
merge state
Aggregator
submit state
Simulator
run state
finished state
20
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
Simulator
Simulator
aggregate task
Simulator
Simulator
execute on
finished state
StartupObject
initialstate state
finished state
merge state
Aggregator
submit state
Simulator
run state
finished state
21
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
Simulator
Simulator
aggregate task
Simulator
Simulator
set
finished state
StartupObject
initialstate state
finished state
merge state
Aggregator
submit state
Simulator
run state
finished state
22
Implementation Generation
Bamboo Program
Profile Data
Processor Specification
Bamboo Compiler
Implementation Generator
Candidate implementations
Simulation-based Evaluator
Leading implementations
Implementation Optimizer
Optimized implementation
Tuned implementations
Code Generator
Optimized multi-core binary
Multi-core Processor
23
Implementation Generation
  • Dependence Analysis analyzes data dependence
    between tasks
  • Parallelism Exploration extracts potential
    parallelism
  • Mapping to Cores maps the program to real
    processor

24
Flag State Transition Graph (FSTG)
Simulator
run
simulate32Mcyc 100
submit
aggregate2Mcyc 100
finished
25
Combined Flag State Transition Graph (CFSTG)
StartupObject
initialstate
Number of new objects
startup3Mcyc 100
finished
1
4
Simulator
Aggregator
run
merge
simulate32Mcyc 100
aggregate2Mcyc 75
aggregate2Mcyc 25
submit
finished
aggregate2Mcyc 100
finished
26
Initial Mapping
Core Group
StartupObject
initialstate
startup3Mcyc 100
finished
1
4
Simulator
Aggregator
run
merge
simulate32Mcyc 100
aggregate2Mcyc 75
aggregate2Mcyc 25
submit
finished
aggregate2Mcyc 100
finished
27
Preprocessing Phase
  • Identifies strongly connected components (SCC)
    and merges them into a single core group
  • Converts CFSTG into a tree of core groups by
    replicating core groups as necessary

28
Data Locality Rule
  • Default rule
  • Maximize data locality to improve performance
  • Minimizes inter-core communications
  • Improves cache behavior

StartupObject
initialstate
startup3Mcyc 100
finished
1
4
Aggregator
Simulator
merge
aggregate2Mcyc 75
run
aggregate2Mcyc 25
finished
StartupObject
4
1
Simulator
Aggregator
29
Data Parallelization Rule
  • To explore potential data parallelism

StartupObject
initialstate
startup3Mcyc 100
finished
1
4
Aggregator
Simulator
merge
aggregate2Mcyc 75
run
aggregate2Mcyc 25
finished
1
StartupObject
Simulator
1
1
StartupObject
Aggregator
4
Simulator
1
Simulator
Aggregator
Simulator
Simulator
1
1
30
Rate Matching Rule
  • If the producer executes multiple times in a
    cycle, how many consumers are required?
  • Match two rates to estimate the number of
    consumers
  • Peak new object creation rate
  • Object consumption rate

Producer
produce
init
Consumer
produce
run

Consumer
Producer


Consumer
31
Mapping to Processor
  • Extended CFSTG

1
StartupObject
Simulator
1
1
Aggregator
Simulator
Simulator
Simulator
1
1
  • Constraint limited cores

Core 1
Core 2
  • Map CFSTG core groups to physical cores

32
Mapping to Cores
  • One possible mapping

Core 1
1
StartupObject
Simulator
1
1
Aggregator
Simulator
Core 2
Simulator
Simulator
1
1
33
Mapping to Cores
  • Isomorphic mappings have same performance

Core 1
Core 1
1
1
StartupObject
StartupObject
Simulator
Simulator
1
1
1
1
Core 2
Aggregator
Aggregator
Simulator
Simulator
Core 2
Simulator
Simulator
Simulator
Simulator
1
1
1
1
  • Backtracking-based search to generate
    non-isomorphic implementations

34
Implementation Generation
Bamboo Program
Profile Data
Processor Specification
Bamboo Compiler
Implementation Generator
Candidate implementations
Simulation-based Evaluator
Leading implementations
Implementation Optimizer
Optimized implementation
Tuned implementations
Code Generator
Optimized multi-core binary
Multi-core Processor
35
Simulation-Based Evaluation
  • To select the best candidate implementation
  • High-level simulation
  • Does NOT actually execute the program
  • Constructs abstract execution trace with similar
    statistics
  • Compare the execution time or throughput and core
    usage

Simulator
Core
Core
Task
Task
Task
Task
36
Simulation-Based Evaluation
  • Markov model
  • Built from profile data
  • For each task estimates
  • The destination state
  • The execution time
  • A count of each type of new objects

StartupObject
Simulator
initialstate
1
startup3Mcyc 100
fnished
Simulator
1
1
Aggregator
1
merge
Simulator
aggregate2Mcyc 75
aggregate2Mcyc 25
finished
1
Simulator
run
simulate32Mcyc 100
submit
aggregate2Mcyc 100
finished
37
Simulated Execution Trace
core 0
core 1
StartupObject(1)
0
startup task
3
Aggregator(1), Simulator (4)
transfer a Simulator
Simulator(1)
4
simulate task
simulate task
Aggregator(1), Simulator(1), Simulator(2)
35
Simulator(1)
36
Aggregator(1), Simulator(2), Simulator(2)
transfer a Simulator
37
simulate task
Aggregator(1), Simulator(3), Simulator(1)
67
simulate task
99
Aggregator(1), Simulator(4)
Aggregator(1), Simulator(4)
aggregate task
Aggregator(1), Simulator(3)
101
1 Aggregator in the initial state and 4
Simulators in the submit state
aggregate task
Aggregator(1), Simulator(2)
103
aggregate task
Aggregator(1), Simulator(1)
105
aggregate task
107
empty
38
Problem of Exhaustive Searching
Number of CFSTG Core Groups Number of Cores Number of Candidates
32 16 gt 6,000
64 32 gt 14,000,000
  • The search space expands quickly
  • Exhaustive search is not feasible for complicated
    applications

39
Random Search?
  • Very low chance to find the best implementation

Chance to find the best implementation
40
Developer Optimization Process
  • Create an initial implementation
  • Evaluate it and identify performance bottlenecks
  • Heuristically develop new implementations to
    remove bottlenecks
  • Iteratively repeat evaluation and optimization

41
Directed Simulated Annealing (DSA)
Randomly generate candidate implementations
Directed Simulated Annealing
High-level Simulator
Leading candidate implementations
As-built Critical Path Analysis
Potential bottlenecks
New candidate implementations
Implementation Generator
Tuned candidate implementation
42
As-Built Critical Path (ABCP)
  • Provide post-mortem analysis of project management

core 0
core 1
StartupObject(1)
0
startup task
3
Aggregator(1), Simulator (4)
transfer a Simulator
Simulator(1)
4
simulate task
simulate task
Aggregator(1), Simulator(1), Simulator(2)
35
Simulator(1)
36
Aggregator(1), Simulator(2), Simulator(2)
transfer a Simulator
37
simulate task
Aggregator(1), Simulator(3), Simulator(1)
67
simulate task
1
Simulator
StartupObject
1
99
Aggregator(1), Simulator(4)
Simulator
aggregate task
1
1
Aggregator
Aggregator(1), Simulator(3)
Simulator
101
aggregate task
1
Aggregator(1), Simulator(2)
103
aggregate task
Simulator
Aggregator(1), Simulator(1)
105
aggregate task
107
empty
43
As-Built Critical Path Analysis
core 0
core 1
StartupObject(1)
0
startup task
0
3
Aggregator(1), Simulator (4)
transfer a Simulator
Simulator(1)
4
simulate task
3
simulate task
Aggregator(1), Simulator(1), Simulator(2)
35
Simulator(1)
36
transfer a Simulator
Aggregator(1), Simulator(2), Simulator(2)
37
simulate task
3
  • Compute the time when a task invocations data
    dependences are resolved

Aggregator(1), Simulator(3), Simulator(1)
67
simulate task
3
99
Aggregator(1), Simulator(4)
aggregate task
35
Aggregator(1), Simulator(3)
101
aggregate task
101
Aggregator(1), Simulator(2)
103
aggregate task
103
Aggregator(1), Simulator(1)
105
aggregate task
105
107
empty
44
Waiting Task Optimization
  • Waiting tasks
  • Tasks whose real invocation time is later than
    the time when all its data dependences are
    resolved
  • Delayed because of resource conflicts
  • Bottlenecks, remove them from ABCP
  • Optimization
  • Migrate waiting tasks to spare cores
  • Shorten the ABCP to improve performance

45
Critical Task Optimization
  • There may not exist spare cores to move waiting
    tasks to
  • Identify critical tasks tasks that produce data
    that is consumed immediately
  • Attempt to execute critical tasks as early as
    possible
  • Migrate other tasks which blocked some critical
    task to other cores

core 0
core 1
simulate task
Aggregator(1), Simulator(1), Simulator(2)
35
Simulator(1)
36
simulate task
simulate task
1
Aggregator(1), Simulator(3), Simulator(1)
67
simulate task
Simulator(2)
3
99
Aggregator(1), Simulator(4)
2
aggregate task
35
101
Aggregator(1), Simulator(3)
aggregate task
46
Code Generator
Bamboo Program
Profile Data
Processor Specification
Bamboo Compiler
Implementation Generator
Candidate implementations
Simulation-based Evaluator
Leading implementations
Implementation Optimizer
Optimized implementation
Tuned implementations
Code Generator
Intermediate C code
Optimized multi-core binary
Multi-core Processor
47
Evaluation
  • MIT RAW simulator
  • Cycle accurate simulator configured for 16 cores
  • RAW chip tiled chip, shared memory, on-chip
    network
  • Benchmarks
  • Series Java Grande benchmark suite
  • MonteCarlo Java Grande benchmark suite
  • FilterBank StreamIt benchmark suite
  • Fractal

48
Speedups on 16 cores
  • Successfully generated implementations with good
    performance

Benchmark Clock Cycles (106 cyc) Clock Cycles (106 cyc) Speedup to 1-Core Bamboo
Benchmark 1-Core Bamboo 16-Core Bamboo Speedup to 1-Core Bamboo
Series 26.4 1.8 14.7
Fractal 38.4 3.3 11.6
MonteCarlo 191.7 19.0 10.1
FilterBank 91.2 6.7 13.6
49
Comparison to Hand-Written C Code
Benchmark Clock Cycles (106 cyc) Clock Cycles (106 cyc) Clock Cycles (106 cyc) Speedup to 1-Core C Overhead of Bamboo
Benchmark 1-Core C 1-Core Bamboo 16-Core Bamboo Speedup to 1-Core C Overhead of Bamboo
Series 25.0 26.4 1.8 13.9 5.6
Fractal 36.2 38.4 3.3 11.0 6.1
MonteCarlo 138.8 191.7 19.0 7.3 38.1
FilterBank 71.1 91.2 6.7 10.6 28.3
  • Overhead of Bamboo
  • Small for Series and Fractal
  • Larger overhead for MonteCarlo and FilterBank
  • GCC cannot reorder instructions to fill
    floating-point delay slots for Bamboo
    implementations due to imprecise alias results
  • Easy to add alias information to facilitate the
    reordering

50
Comparison of Estimation and Real Execution
Benchmark 1-Core Bamboo Binary 1-Core Bamboo Binary 1-Core Bamboo Binary 16-Core Bamboo Binary 16-Core Bamboo Binary 16-Core Bamboo Binary
Benchmark Clock Cycles (106 cyc) Clock Cycles (106 cyc) Error Clock Cycles (106 cyc) Clock Cycles (106 cyc) Error
Benchmark Estimation Real Error Estimation Real Error
Series 26.3 26.4 0.38 1.7 1.8 5.56
Fractal 38.4 38.4 0 3.1 3.3 6.06
MonteCarlo 191.0 191.7 0.37 18.3 19.0 3.68
FilterBank 91.2 91.2 0 6.5 6.7 2.99
  • The simulation estimations are close to the real
    execution time

51
Optimality of Directed Simulated Annealing
52
Fractal
53
MonteCarlo
54
FilterBank
55
Generality of Synthesized Implementation
Benchmark Profile_original, Input_double Profile_original, Input_double Profile_original, Input_double Profile_double, Input_double Profile_double, Input_double
Benchmark Clock Cycles (106 cyc) Clock Cycles (106 cyc) Speedup Clock Cycles (106 cyc) Speedup
Benchmark 1-Core 16-Core Speedup 16-Core Speedup
Series 54.2 3.6 15.1 3.6 15.1
Fractal 76.6 6.5 11.8 6.5 11.8
MonteCarlo 383.2 37.8 10.1 35.7 10.7
FilterBank 182.3 13.3 13.7 13.3 13.7
  • The speedups of both 16-core Bamboo versions are
    similar
  • Successfully generate a sophisticated
    implementation utilizing pipelining for MonteCarlo

56
Related Work
  • Data-flow and streaming languages
  • Bamboo relaxes typical restrictions in these
    models to permit
  • Flexible mutation of data structures
  • Data structures of arbitrarily complex constructs
  • Bamboo supports applications that
    non-deterministically access data
  • Tuple-space language compiler cannot
    automatically create multiple instantiations to
    utilize multiple cores
  • Self-tuning libraries mostly address specific
    computations

57
Conclusion
  • We developed a new approach to automatically tune
    task-based programs for multi-core processors
  • Automatically generate parallel implementations
  • Automatically tune according to specific
    architecture
  • The approach was evaluated on MIT RAW simulator
  • Successfully generated implementations with good
    performance
  • Successfully generated a sophisticated
    implementation utilizing pipelining
  • Can be extended to the broader context of
    traditional programming languages

58
Thank you!
59
Future Work
  • Apply our approach on non-simulated multi-core
    processors
  • Develop more sophisticated processor
    specification
  • Explore rich set of applications

60
Design Rationale
  • Why not dynamic scheduling?
  • Bad scalability over increasing cores
  • Our basic approach makes it easier to adapt to
    future changes in architectures

61
Tree Transform
4
Producer1
1
Consumer
1
Producer2
4
Producer1
Consumer
1
Producer2
1
Consumer
62
Tags
  • Motivation consider a video processor example
  • Tags group objects together
  • Tags have types
  • Can create many instances of a tag type
  • Each instance defines a group
  • Can bind tag instances to objects
  • Tags can specify that task parameters must be in
    the same group
Write a Comment
User Comments (0)
About PowerShow.com