Fabrizio Ferrandi - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Fabrizio Ferrandi

Description:

... from advanced audio and video systems that support the next ... Selected a meaningful subset of the GNU/GCC Torture Testsuite composed by 834 benchmarks. ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 17

Provided by: cast57

Category:

more less

Transcript and Presenter's Notes

Title: Fabrizio Ferrandi

1
hArtes SW Partitioning overview

Fabrizio Ferrandi
Politecnico di Milano
CASTNESS'08 15-18 January 2008 - ROMA - Italy

2
Summary

Objectives
Task partitioning
C based descriptions
The PandA framework
Preliminary results
Conclusions

3
hArtes partitioning objectives

An important step in the hArtes design space
exploration process is the identification of a
good partitioning of the given application
identifying fragments of parallel code on which
transformations and metric analysis can be
performed
starting from C based application coming from
advanced audio and video systems that support the
next-generation of communication and
entertainment facilities
targeting a combination of embedded processors,
digital signal processing and reconfigurable
hardware.

4
hArtes concurrency model

hArtes applications characteristics
Data intensive
Control intensive
Parallelism granularity
Fine grain (basic blocks)
Coarse grain (tasks)
Specifications
Sequential programs
Cuncurrency extracted
Explicit Concurrency with split/joint barriers

5
Detection of clusters/tasks

The detection of clusters of operations connected
by data exchanges
Uses dataflow analysis based on Control and data
Dependence Graphs (PDG)
Two strategies
Top-down
starting from high-level clusters of operations
specified by the system designers (the set of
functions) decompose each function in sub-tasks
that are connected by minimal sets of data
exchanges.
Bottom-up
starting from minimum size clusters of operations
(i.e., the individual instructions as specified
by the C intermediate representation), clusterise
instructions along the heavier communications.

6
Task granularity

The actual size of the tasks can be controlled
exploiting replication of the operations to
obtain more parallelism
exploiting loop based transformations
fusion/fission, loop unrolling, .

7
Task mapping goal

Goal identify the best trade-off in terms of
hardware and software tasks to satisfy designers
constraints
Parameters to be considered
Platform architecture
Performance required by the application
Reconfigurable area available
Profiling information starting from performance
costs evaluated on each specific platform
component

8
Task mapping how

The evaluation of the allocation of the modules
onto the different components of the platform is
performed as a step-by-step process.
Here the focus is either
on traditional reinforcement learning algorithms
based on dynamic programming, such as Q-Learning,
TD (lambda),
or on more advanced techniques that exploit both
reinforcement learning and evolutionary
computation, such as learning classifier systems,
which provide more sophisticated generalization
capabilities

9
Cost estimations and metrics

Metric evaluation for
partitioning
mapping
Input
C based annotated description
Application constraints Maximum size of task,
bandwidth, performance
Task decomposition (expressed as annotations or
external file)
Profiling information (expressed as annotations
or external file)
Target architecture constraints and description
Tool extracts
Inter-process synchronization and communication
Inter-procedural control and data flow dependency
Output
C based annotated description
Control and data characterization of task
exchanges
Behavior similarity
Closeness between tasks
Structural relationship between application and
target architecture

10
Task partitioning interactions
11
C based descriptions

C does not have any parallelism based instruction
Task described by C function
A notation to express parallelism is required
Many different notations are possible
Pragma based
Comment based
XML based
The notation must be powerful enough to express
that
two or more tasks can run in parallel (fork
operations)
the execution flow must wait for the termination
of one or more tasks (join operation)

12
Motivations for OpenMP

OpenMP is a collection of pre-processor
directives (pragmas) used to express parallelism
in C programs
Purposely created for a fork/join model
It is an open and widely adopted standard
supported in next release of GNU/gcc compiler
(gcc 4.2)
platform independent
mapping independent
easy functional verification of partitioned
programs on host machines

13
Supported Constructs in Source Code

We already support a large number of C constructs
(e.g. nested struct, union, pointer arithmetic,
)
Selected a meaningful subset of the GNU/GCC
Torture Testsuite composed by 834 benchmarks.
Currently, we cover 828/834 benchmarks (99.3).
Covering means
parsing
building internal representation
dumping back the C code
compiling the produced code
executing it without errors (most of them are
software fault tolerant to detect execution
errors)
Ongoing works to support var_args, computed goto,
not reducible loops.

14
The PandA framework
Integrated with GCC infrastructure The output of
the task partitioning is an OpenMP compliant C
code Group operations into tasks in an efficient
way, in order to meet performance
requirements Initial mapping on the platform
based on metrics under development
15
Preliminary results
ADPCM
JPEG
Results on ADPCM and JPEG algorithms (1.7-1.4 max
speedup) Correctly analyzed all the hArtes
applications
16
Questions?