G1: Profiling Applications for SWHW PASH Partitioning and Coscheduling presentation

About This Presentation

Transcript and Presenter's Notes

Title: G1: Profiling Applications for SWHW PASH Partitioning and Coscheduling

1
G1 Profiling Applications for SW/HW (PASH)
Partitioning and Co-scheduling

Faculty Dr. Tarek El-Ghazawi, Dr. Mohamed Taher
Student Leader Proshanta Saha

2
Motivation

To exploit the synergy between the µP and the
FPGA
Lack of a formal co-design methodology for
reconfigurable applications
Current methods are ad-hoc and often time
demanding
Lack of tools for hardware software co-design and
analysis for reconfigurable computers

3
Objectives

Propose algorithms for partitioning and
co-scheduling for RC systems
Create Tools for HW/SW Partitioning and
Co-Scheduling of Applications onto RC Systems
Automatic
To quickly generate an accelerated solution
To assist compiler developers with algorithms for
partitioning and co-scheduling
Semi-automatic
To explore what if scenarios
To leave it up to the end user to interactively
decide on a good partition

4
Automatic Partition
Execution Profiling
DAG Analysis
Application Code
Objective Functions
Constraint Analysis
Execution
Co-Design
5
Semi-Automatic Partition
DAG Analysis
Execution Profiling
Application Code
Objective Functions
Constraint Analysis
Execution
Co-Design
What-if Scenarios
Visualization
6
Application Profiling

Assumes HLL source code such as C
Utilizes open source profilers such as
GNU Profiler (gprof)
Standard tool
Limited thread support
Qprof from HP Labs
Thread and dynamic library support
OProfile
System level profiling
Other Profilers

7
Data Profiling

Analyze the accuracy requirements of the
application
Determine dynamic range required
Decide on a suitable implementation precision
Fixed point
Floating Point (Single/Double Precision)
Methods utilized
Full Precision Tracking
Quantization
Truncation
Wrapping
Overflow
Saturated Arithmetic
Rounded Arithmetic

8
(No Transcript)
9
Getting Application DAG

Source code is run through a series of steps
during compilation to annotate the Directed
Acyclic Graph
Intermediate Format (IF) Analysis using open
source tools such as
Stanford University Intermediate Format (SUIF)
Compiler System from Stanford
MachineSUIF from Harvard
Extract dependency graph from IF analysis
Extract dependency analysis from SUIF output
Use dependency analysis to generate DAG

10
DAG generation
11
System Characterization

Using benchmarks to measure throughput and
identify bottlenecks
Access to System Memory (SM)
Access to RP Local Memory (LM)
Data transfer bandwidth and overhead
System Overhead

12
System Characterization

Routing overhead
Core services overhead
Reconfiguration time
Resource constraints
CLB
LUT
FF
MULT
...

Resources
13
Visualization with an OTF Reader
Tau ParaProf
Vampir-NG
KOJAK
14
Application Co-Design Analysis

Cluster based on critical path
Examine hot zones (90-10 rule)
Partition based on basic blocks
Loops
Branches
Discover parallelism in application
Monitor communication requirements
Discover IO bottlenecks
Analyze all data access patterns
Computation and Communication overlapping
Reduce number of transfers

15
Co-Scheduling

Leverage scheduling algorithms from Heterogeneous
Computing (HC), Embedded Computing (EC) , and
early work on Reconfigurable Hardware (RH)
Static Scheduling
HC Algorithms
RH Algorithms
EC Algorithms
Proposed Algorithm
Dynamic Scheduling
HC Algorithms
Proposed Algorithms

16
Integrated Development Environment(IDE)

Using an open source visual editor
Extend to include widgets and functionality for
Setting objective functions
Profiling source code
Visualization of application
Automatic Partitioning
Selecting What-if scenarios
IDE will interface with the various tools
Co-Scheduling
Co-design Analysis
Profiling Tools
Visualization Tools

17
Profiling and HW/SW Co-Scheduling
Execution Profiling
Application Code
Resource
BUS
Memory
System Characterization
Application Profiling
User Interface
DAG Analysis
Co-Scheduling
18
Profiling and HW/SW Co-SchedulingDetails of the
Proposed Tool

Functionalities that the tool will offer
Application analysis
Application profiling
Data profiling
RC System analysis
Resource analysis
System overhead, Architecture constraints
Co-design guidance
Automatic co-scheduling
Performance Impact analysis (e.g. speed up)
Deliverables
Algorithms for partitioning and co-scheduling
Tool for profiling, partitioning and
co-scheduling applications
Case studies
Research papers

19
G2 Node Simulation and Architecture Studies
(NSAS)

Faculty Dr. Tarek El-Ghazawi, Dr. Ivan Gonzalez,
Dr. Sergio Lopez-Buedo
Student Leader Miaoqing Huang

20
Problems at the Reconfigurable Node Level

Reconfigurable hardware is very fast, but often
the end to end performance is problematic
Speed of transfers between microprocessor and
FPGAs were identified as limiting factors
Local memory architecture has a great impact on
the overall performance
Transfers and incoherence between microprocessor
memory and local FPGA memory are handled by
programmers

21
Goal

Understand the issues with current architectures
Provide the infrastructure to explore new
architectures
How the RP and Microprocessor should be
interconnected
Memory hierarchy (microprocessor and RP memory)
RP Local memory architectures

22
Objectives

Build a simulation framework
Support for µP, communication infrastructure, and
FPGA
Develop a compact benchmarking suite
Use existing HPCC benchmarks as starting point
Select a reduced set of applications to evaluate
main bottlenecks
Communication bandwidth
Memory throughput
Etc.
Conduct architectural exploration studies

23
Approach

Each component of the node needs its own
simulation and modeling tool
Microprocessor, FPGA, memories, buses, etc.
Build a simulation framework
Integrate different widely-used simulators into a
unified environment
Processor simulators
Simulation tools for reconfigurable logic devices
Integrate third-party simulation tools for
existing and/or new architecture components
Create models of reconfigurable devices
communication interfaces and other components
like memories or peripherals\
Architectural modeling tools
Simplify the design and simulation of complex
hardware-software architectures

24
Deliverables

Analysis - Select tools for the co-simulation
framework
Architectural modeling and simulation tools
Single and multiprocessor system simulators
Tools for simulating and modeling reconfigurable
hardware
Development - Framework
Leverage previous work from open source projects
Improve the integration, modeling and simulation
of processors, reconfigurable logic, and
communication/interconnection mechanisms
Proof-of-concept
Study the bottleneck in actual hardware-software
systems
Research new mechanism to connect reconfigurable
logic devices and standard processors

25
Deliverables

Analysis
Architectural modeling and simulation tools

MILAN Framework - multiple levels of granularity
Reference http//milan.usc.edu/
Liberty - component-based modeling tool
Reference http//liberty.cs.princeton.edu/Softwa
re/LSE/
26
Deliverables

Analysis
Single and multiprocessor system simulators

Reference http//m5.eecs.umich.edu/wiki/index.php
/Main_Page
Reference http//www.simplescalar.com/docs/hack_g
uide_v2.pdf
Reference Institute of Computing at the
University of Campinas, http//www.archc.org/
27
Deliverables

Analysis
Tools for simulating and modeling reconfigurable
hardware

Reconfigurable logic modeling tools System-C,
Simulink, etc.
Simulators Modelsim, ActiveHDL, etc.
28
Deliverables
HPCC benchmarks
FRAMEWORK
29
G3 Hardware Architectural Virtualization (HAV)
and Run-Time Systems for App Portability

Faculty Dr. Tarek El-Ghazawi, Dr. Mohamed Taher,
Dr. Sergio Lopez-Buedo
Student Leader Esam El-Araby

30
The HPRC Programmers Nightmare

Lack of application portability
Reconfigurable resources are managed using
vendor-specific APIs
Applications are completely dependent on the
underlying machine
Porting from one machine to another is hard
Adapting an application after a HW upgrade is
also non-trivial
Manual partitioning between HW and SW
Explicitly done at design time, so changes imply
redesign and recoding, not just recompilation
HW design skills are required
Difficult to optimize the synergy between HW and
SW
Multi-tasking/multi-user not supported
There is no centralized system in charge of
distributing the resources
Multi-tasking/multi-user support must be
explicitly included in the applications

31
How to Solve All these Drawbacks?
Programmers need to explicitly manage the
physical resources
Lack of support for multi-user/multi-tasking
applications
No standardized API to access the reconfigurable
HW
Explicit HW/SW partitioning in source code
Developers are required to design complex
abstractions to add portability to their apps
Virtualization of Hardware Resources
Run-Time System
32
Proposed SolutionVirtualization of Resources

Hardware Architectural Virtualization (HAV)
Applications request virtual resources
The virtual resources are the processing elements
that solve tasks
Tasks are defined as an operation that processes
data and obtains results, for example FFT, DCT,
etc.
The programmer only cares about tasks, not about
device details
A run-time system maps virtual resources to
physical resources subject to underlying
constraints
Physical resources (µP, FPGA, etc.) available in
the system
Co-scheduler decisions
A library contains the implementations of the
tasks for the different physical processors
Portable library Each task will have at least
one SW (µP) and one HW (FPGA) implementation per
architecture
Many HW implementations are possible
(speed/area/power tradeoff)

33
Benefits from Hardware Virtualization

Programmer does not have to worry about HW
details
Ratio of FPGAs to µP
Number and location of RP memory banks
Separates actual processing tasks from the
processors where they will be executed
Enhance portability and device utilization
If there are no available reconfigurable
resources, HW tasks can be either queued for
later execution or executed in SW
Explicit HW/SW partitioning no longer required
Abstraction based on device-independent tasks
There is no need to define HW mapping at
design-time, thus making development
significantly easier

34
The Two Sides of HW Virtualization

Virtualize reconfigurable processing resources
Virtualize the resources used by the hardware
cores

Software User
Virtualize
OS
FPGA
Vendor-Specific Resources
Hardware User
Virtualize
35
Proposed SolutionRun-Time System

The run-time system maps virtual resources to
physical resources, and provides
Centralized management of resources
Abstraction Layer
Can be implemented at either kernel or user
levels
User-level daemons are easier to implement, but
communicate via sockets or IPC primitives with
noticeable overhead
As a kernel module asynchronous notifications
become of inexpensive
Leverage previous work on the LUCITE LSF project

Enable multi-tasking/multi-user support
Enable application portability
36
Extended LSFNetworks of reconfigurable computers
(NORCs)

Job management system for networks of
reconfigurable computers
Reconfigurable resources expensive and
underutilized
Many of these resources available over the
network
Need for S/W system to remotely schedule and
monitor reconfigurable tasks
An extension of LSF supporting several popular
FPGA accelerator boards was implemented
Parallel DES Breaker Example
500x speedup (relative to Pentium4)
95 utilization was demonstrated
New boards could be easily incorporated

37
Proposed Run-Time System GOAST

Run-time system in charge of offloading tasks to
the processing resources (µP, FPGA, etc.)
Centralized mechanism to enable transparent
resource sharing and load balancing in
multi-user, multi-tasking environments
Interface to an external scheduler that will
provide the benefits of µP-FPGA synergy
Includes an application portability layer
Static API/ABI allows for application portability
across different environments without
recompilation
Dedicated configuration manager
Where the actual virtual physical resource
mapping occurs
Configures physical devices (µP, FPGA, etc.) with
the code/bitstream that implements the task
mapped to them

38
GOAST An Initial Vision

Main features of GOAST architecture
The core provides a handle-based API to the
programs
Tasks are obtained from a portable library and
loaded to the physical processing resources using
the configuration manager
Using an external scheduler allows for the
implementation of different scheduling algorithms

Implementation details
Efforts will be focused on delivering a portable
user-space service with low overhead
How to implement inexpensive asynchronous
notifications
Management interface
Provided by the core so both advanced users and
system administrators can manage resource usage
and availability

39
Deliverables
Higher-level abstractions

Conceptual study of hardware virtualization
techniques
Alternatives, challenges, tradeoffs, etc.
GOAST Specification and API
Comprehensive analysis of challenges and
scalability
GOAST Reference implementation proof-of-concept
User-level implementation
Standardized API to manage SW and HW virtual
resources
Kernel-level implementation and support for
device-independent tasks are future work

Device-independent task-based API
Standardized API for SW and HW virtual resources
Run-Time System
Virtualization Layer
Physical Resources
µP
FPGA
40
Future work / Synergies

Further extend GOAST API
Device-independent tasks
Seamless support for legacy code by using a
preprocessor
Integrate the run-time system in the OS kernel
Reduce the cost of asynchronous notifications
Use partial run-time reconfiguration to divide a
physical FPGA into virtual devices
Further improve virtualization by using virtual
memory management techniques

GWU Projects
G1 provides the co-scheduler
G5 provides the framework for the development of
the portable library of tasks, and an example
(biomedical)
G2 will provide new architectures to explore the
portability problem in a more comprehensive way
UF Projects
F3 will provide further case-studies to test the
proposed run-time system
F4 will provide the necessary RTR background for
the future work

41
G4 High-Level Languages Productivity (HLLP) an
HPC Perspective

Faculty Dr. Tarek El-Ghazawi, Dr. Mohamed Taher
Student Leader Kun Xi

42
Background

Many options for developing RC/FPGA applications
How are they really different?
Which one is the optimal for a given project and
a given set of developers?
HDLs, best results but
Steep learning curve
Long development cycle
HLLs, C-like and easy but
Limitations as compared to C are unknown
Very different from C and each other
There are also graphical tools
Seem easy to use, but is there any penalties or
hidden costs
All differences and design issues obscured by
marketing literature

43
Goals and Objectives

Understand the underlying differences among the
available tools
Guide the programmer in choosing the correct
language
Make intelligent selection of existing HLL tools
based on their features, the applications, and
the programmers strengths
Impact future HLL development for improved
productivity and portability of applications
Develop a formal methodology to
Understand the underlying differences among the
available tools
Guide the programmer in choosing the correct
language to solve a certain problem

44
Example Languages to Consider

Classic HDLs
VHDL
Verilog
Text-based HLLs
Impulse-C
Handel-C
Mitrion-C
Graphical tools
DSPLogic

Imperative Languages
Functional Language
Graphical/Dataflow Language
45
Typical Hardware Development Flow
46
Leverage Previous Work/Experience

Leverage evaluation methodology of the NORCs
(Extended LSF) project
RC HLL preliminary study with ARSC
Leverage work with IBM and PSC on productivity
under DARPA HPCS

47
Leverage Previous Work/Experience

Leverage evaluation methodology on of the NORCs
(Extended LSF) project
RC HLL preliminary study with ARSC
Leverage work with IBM and PSC on productivity
under DARPA HPCS

48
Methodology for Language Evaluation

Conceptual study aiming for a set of orthogonal
HLL features
Scoring system/metrics for understanding
productivity
Experimental study involving key applications
Instrumentation package to support experimental
studies
Similar to SUMS
Joint study by PSC, IBM, GWU, and DARPA
Productivity of programming languages
particularly
X10, UPC, and MPI

49
HLL Programming Paradigms Evaluation Metrics

Ease-of-Use
Acquisition time, i.e. learning and experience
gaining
Depends on the type of paradigm being adopted
Development time
Depends on both the paradigm as well as the
application being developed
The acquisition time and the development time
express directly an opposite effect to
ease-of-use that is difficulty-of-use
Captures the effects of the programming model
explicitness
The more explicit the programming model is, i.e.
the more architectural details that need to be
handled by the user/developer, the longer it will
take to both acquire the language and develop
applications
Prior user/developer experience can reduce both
the acquisition and the development times
Normalized to a similar corresponding metric for
a reference HDL language

50
HLL Programming Paradigms Evaluation Metrics

Efficiency of hardware generation
Combines the ability of each programming paradigm
to extract the maximum possible
parallelism/performance with the lowest cost
In terms of
End-to-end throughput
Synthesized clock frequency, and
Resource usage (e.g. slice utilization)
Prior user/developer experience should be taken
into consideration
The more experienced the user/developer is the
higher the throughput and frequency with less
resource usage he/she can achieve from a certain
language
Compared to a reference conventional HDL
approach assuming optimality of the HDL approach

51
HLL Programming Paradigms Evaluation Metrics

Statistical considerations
The sample size of the experiment population
should be as large as possible to achieve
accurate results and conclusions with minimum
variances
By the sample size we mean the number of
Users included in the experiment,
Applications considered,
Languages for each paradigm, and
Platforms being used as testbeds
Our previous experiments involved
Three independent users with different degrees of
experience in the field
Four different applications
One language per paradigm
One supporting platform
Our sample size was
Limited by the current status of the technology
Heavily dependent on the availability of all the
languages on common platforms

52
Deliverables

A conceptual study aiming for a set of
Orthogonal HLL Features
A formal methodology and a scoring system for
understanding productivity as it relates to
performance, ease of use, resource utilizations,
and power
An experimental study involving a few key
applications
An instrumentation package to support
experimental studies
Association of features and tools to application
classes
Studies reports
Research papers

53
G5 Library Portability and Acceleration Cores
(LPAC) Computational Biology and Medical
Imaging Case Studies
Faculty Dr. Tarek El-Ghazawi, Dr. Ivan
Gonzalez Student Leader Mohamed Abou-Ellail
54
Library Portability and Acceleration Cores (LPAC)

Background and Motivation
Building optimized libraries is critical for High
Performance Reconfigurable Computing (HPRC)
Portability will allow for
Increased productivity
Easier platform migration
Objectives
Define elements needed for increasing reusability
and interoperability of FPGA functional cores
across different vendor-specific platforms
Develop a framework for applications development
for HPRC
Show the applicability of the proposed
methodology through computational biology and
medical imaging case studies

55
Two Parts of DevelopmentsA Complete Framework
for Portable Cores

Software development
A unified framework to transfer data from and to
FPGA

Hardware development
A virtualized memory space viewed by FPGA

A standardized communication interface between
FPGA and virtualized memory

56
Two Parts of DevelopmentsA Complete Framework
for Portable Cores

Software development
A unified framework to transfer data from and to
FPGA

Hardware development
A virtualized memory space viewed by FPGA

A standardized communication interface between
FPGA and virtualized memory

57
Library Developers ViewA Complete Framework for
Portable Cores

Portability is a multi-layered problem
Computational layer
A set of hardware cores (e.g. , DWT, DES, )
A generic reconfigurable processor (RP)
Related vendor-specifics included
Source level format
Compiled once on every platform
Interface layer
Abstraction layer for physical platform-specifics
Interface to the physical resources, e.g. RP
local memory, µP memory, and communication
resources
Source or pre-compiled level format

58
Library Developers ViewA Complete Framework for
Portable Cores
59
Library Users ViewA Complete Framework for
Portable Cores

Portability is a single-layered problem
Computational layer
A set of hardware cores targeting a generic
reconfigurable processor (RC)
Unified virtual memory model

CPU
FPGA
Vendor-Specific Resources
Hardware User
Virtual Space
60
Leverage the previous experience in
multi-platform librariesA Complete Framework for
Portable Cores

Cryptography, Image Processing, Sorting,
Bioinformatics, etc
GWU/GMU/USC Carte-C library development

61
Case Studies
62
Medical Imaging and Bioinformatics

Medical Imaging and Bioinformatics Applications
Most algorithms in this area are computationally
intensive
Require real-time acquisition and processing
capabilities
Traditionally these applications are parallelized
across a cluster
Performance, efficiency, and costs associated
with these systems proved impractical for these
classes of applications
HPRC proved to be good candidates for similar
computationally intensive applications
Cryptography, image processing, etc
HPRCs Potentials
Performance improvement for medical/bioinformatics
applications
Maintaining the flexibility of conventional
systems

63
Medical Imaging Candidate Applications

Image Reconstruction
An important requirement for many medical imagery
systems
Computed Tomography (CT)
Positron Emission Tomography (PET)
Magnetic Resonance Imaging (MRI)
X-Ray
Ultrasonography
Reconstruction computational requirements are
growing at a tremendous rate
New CT scanners record data for thousands of
slices per scan
Hundreds of projections per slice are recorded
Must be performed in as short a period of time as
possible
3D-Image Modeling
Acquiring digital samples of objects distributed
in three dimensional space
Processing all the dimensions congruently to
construct a 3D image

64
Medical Imaging Candidate Applications

Image Registration
Nature of transformation
Rigid
Concentrates mainly on rotation, scaling and
translation between the two images
Non-rigid (elastic)
Takes local deformations into consideration
Modality
Mono-Modal
One type of image involved, e.g. CT to CT, MRI to
MRI,
Multi-Modal
Registering the same image acquired from
different systems, e.g. CT to MRI, MRI to PET,
Uses
Cancer screening, diagnosis, guided treatment,
etc ...
Issues
Large search space ? exhaustive search, iterative
refinement,
Different similarity measures with different
accuracy and feature emphasis

65
Related Experience in Image Processing

The Automatic Image Registration
SRC-6 ? 2 Chips (4 Engines)
4x speedup over µP implementations
(Intel Xeon P4, 2.8 GHz)
MAP-C Implementation
Floating Point Arithmetic (Single Precision)
Wavelet-Based Hyperspectral Dimension Reduction
SRC-6 ? 1 Chip (1 Engine)
32x speedup over µP implementations (Intel Xeon
P4, 1.8 GHz)
The Automatic Cloud Cover Assessment (ACCA)
SRC-6 ? 1 Chip (8 Engines)
16x speedup over previous hardware implementation
28x speedup over µP implementations (Intel Xeon
P4, 2.8 GHz)
Accuracy for Landsat-7 Images
Approximation Error ? 0.1028, (0.9102 over
water)
Pass Two
MAP-C Implementation
Floating Point Arithmetic (Single-Precision)

66
BioinformaticsDescription and Goals

Corner-stone in the field of molecular
biology
Topics
Sequence Alignment
DNA or Protein
Gene Finding
Algorithmically Identifying biologically
functional regions on the genome
Computational Evolutionary Biology
Identifying the origin and descent of species, as
well as their change and diversity over time
Building phylogenetic trees to illustrate the
relationships among various species
Most bioinformatics applications lend themselves
to reconfigurable computers due to their compute
intensive nature

67
Topics and Algorithms
Neil C. Jones, and Pavel A. Pevzner An
Introduction to Bioinformatics Algorithms, A
Bradford Book, The MIT Press, Cambridge,
Massachusetts, 2004.
68
Related Experience in Bioinformatics

Pairwise Sequence Alignment
Implementation of the Smith-Waterman Algorithm
Based on dynamic programming
Performs local sequence alignment that is, for
determining similar regions between two
nucleotide or protein sequences

69
Deliverables

A framework for the development of hardware
portable libraries
Interfaces, data management,
A set of tools for the automation of the
interface generation
Source level distribution
Reference implementations
Computational biology library
Medical imaging library

Write a Comment

User Comments (0)

About PowerShow.com

G1: Profiling Applications for SWHW PASH Partitioning and Coscheduling PowerPoint PPT Presentation