Title: Fabrizio Ferrandi
1hArtes SW Partitioning overview
- Fabrizio Ferrandi
- Politecnico di Milano
- CASTNESS'08 15-18 January 2008 - ROMA - Italy
2Summary
- Objectives
- Task partitioning
- C based descriptions
- The PandA framework
- Preliminary results
- Conclusions
3hArtes partitioning objectives
- An important step in the hArtes design space
exploration process is the identification of a
good partitioning of the given application - identifying fragments of parallel code on which
transformations and metric analysis can be
performed - starting from C based application coming from
advanced audio and video systems that support the
next-generation of communication and
entertainment facilities - targeting a combination of embedded processors,
digital signal processing and reconfigurable
hardware.
4hArtes concurrency model
- hArtes applications characteristics
- Data intensive
- Control intensive
- Parallelism granularity
- Fine grain (basic blocks)
- Coarse grain (tasks)
- Specifications
- Sequential programs
- Cuncurrency extracted
- Explicit Concurrency with split/joint barriers
5Detection of clusters/tasks
- The detection of clusters of operations connected
by data exchanges - Uses dataflow analysis based on Control and data
Dependence Graphs (PDG) - Two strategies
- Top-down
- starting from high-level clusters of operations
specified by the system designers (the set of
functions) decompose each function in sub-tasks
that are connected by minimal sets of data
exchanges. - Bottom-up
- starting from minimum size clusters of operations
(i.e., the individual instructions as specified
by the C intermediate representation), clusterise
instructions along the heavier communications.
6Task granularity
- The actual size of the tasks can be controlled
- exploiting replication of the operations to
obtain more parallelism - exploiting loop based transformations
fusion/fission, loop unrolling, .
7Task mapping goal
- Goal identify the best trade-off in terms of
hardware and software tasks to satisfy designers
constraints - Parameters to be considered
- Platform architecture
- Performance required by the application
- Reconfigurable area available
- Profiling information starting from performance
costs evaluated on each specific platform
component
8Task mapping how
- The evaluation of the allocation of the modules
onto the different components of the platform is
performed as a step-by-step process. - Here the focus is either
- on traditional reinforcement learning algorithms
based on dynamic programming, such as Q-Learning,
TD (lambda), - or on more advanced techniques that exploit both
reinforcement learning and evolutionary
computation, such as learning classifier systems,
which provide more sophisticated generalization
capabilities
9Cost estimations and metrics
- Metric evaluation for
- partitioning
- mapping
- Input
- C based annotated description
- Application constraints Maximum size of task,
bandwidth, performance - Task decomposition (expressed as annotations or
external file) - Profiling information (expressed as annotations
or external file) - Target architecture constraints and description
- Tool extracts
- Inter-process synchronization and communication
- Inter-procedural control and data flow dependency
- Output
- C based annotated description
- Control and data characterization of task
exchanges - Behavior similarity
- Closeness between tasks
- Structural relationship between application and
target architecture
10Task partitioning interactions
11C based descriptions
- C does not have any parallelism based instruction
- Task described by C function
- A notation to express parallelism is required
- Many different notations are possible
- Pragma based
- Comment based
- XML based
- The notation must be powerful enough to express
that - two or more tasks can run in parallel (fork
operations) - the execution flow must wait for the termination
of one or more tasks (join operation)
12Motivations for OpenMP
- OpenMP is a collection of pre-processor
directives (pragmas) used to express parallelism
in C programs - Purposely created for a fork/join model
- It is an open and widely adopted standard
- supported in next release of GNU/gcc compiler
(gcc 4.2) - platform independent
- mapping independent
- easy functional verification of partitioned
programs on host machines
13Supported Constructs in Source Code
- We already support a large number of C constructs
(e.g. nested struct, union, pointer arithmetic,
) - Selected a meaningful subset of the GNU/GCC
Torture Testsuite composed by 834 benchmarks.
Currently, we cover 828/834 benchmarks (99.3). - Covering means
- parsing
- building internal representation
- dumping back the C code
- compiling the produced code
- executing it without errors (most of them are
software fault tolerant to detect execution
errors) - Ongoing works to support var_args, computed goto,
not reducible loops.
14The PandA framework
Integrated with GCC infrastructure The output of
the task partitioning is an OpenMP compliant C
code Group operations into tasks in an efficient
way, in order to meet performance
requirements Initial mapping on the platform
based on metrics under development
15Preliminary results
ADPCM
JPEG
Results on ADPCM and JPEG algorithms (1.7-1.4 max
speedup) Correctly analyzed all the hArtes
applications
16Questions?
- ? Hopefully on Friday directly to Fabrizio