Title: G1: Profiling Applications for SWHW PASH Partitioning and Coscheduling
1G1 Profiling Applications for SW/HW (PASH)
Partitioning and Co-scheduling
- Faculty Dr. Tarek El-Ghazawi, Dr. Mohamed Taher
- Student Leader Proshanta Saha
2Motivation
- To exploit the synergy between the µP and the
FPGA - Lack of a formal co-design methodology for
reconfigurable applications - Current methods are ad-hoc and often time
demanding - Lack of tools for hardware software co-design and
analysis for reconfigurable computers
3Objectives
- Propose algorithms for partitioning and
co-scheduling for RC systems - Create Tools for HW/SW Partitioning and
Co-Scheduling of Applications onto RC Systems - Automatic
- To quickly generate an accelerated solution
- To assist compiler developers with algorithms for
partitioning and co-scheduling - Semi-automatic
- To explore what if scenarios
- To leave it up to the end user to interactively
decide on a good partition
4Automatic Partition
Execution Profiling
DAG Analysis
Application Code
Objective Functions
Constraint Analysis
Execution
Co-Design
5Semi-Automatic Partition
DAG Analysis
Execution Profiling
Application Code
Objective Functions
Constraint Analysis
Execution
Co-Design
What-if Scenarios
Visualization
6Application Profiling
- Assumes HLL source code such as C
- Utilizes open source profilers such as
- GNU Profiler (gprof)
- Standard tool
- Limited thread support
- Qprof from HP Labs
- Thread and dynamic library support
- OProfile
- System level profiling
- Other Profilers
7Data Profiling
- Analyze the accuracy requirements of the
application - Determine dynamic range required
- Decide on a suitable implementation precision
- Fixed point
- Floating Point (Single/Double Precision)
- Methods utilized
- Full Precision Tracking
- Quantization
- Truncation
- Wrapping
- Overflow
- Saturated Arithmetic
- Rounded Arithmetic
8(No Transcript)
9Getting Application DAG
- Source code is run through a series of steps
during compilation to annotate the Directed
Acyclic Graph - Intermediate Format (IF) Analysis using open
source tools such as - Stanford University Intermediate Format (SUIF)
Compiler System from Stanford - MachineSUIF from Harvard
- Extract dependency graph from IF analysis
- Extract dependency analysis from SUIF output
- Use dependency analysis to generate DAG
10DAG generation
11System Characterization
- Using benchmarks to measure throughput and
identify bottlenecks - Access to System Memory (SM)
- Access to RP Local Memory (LM)
- Data transfer bandwidth and overhead
- System Overhead
12System Characterization
- Routing overhead
- Core services overhead
- Reconfiguration time
- Resource constraints
- CLB
- LUT
- FF
- MULT
- ...
Resources
13Visualization with an OTF Reader
Tau ParaProf
Vampir-NG
KOJAK
14Application Co-Design Analysis
- Cluster based on critical path
- Examine hot zones (90-10 rule)
- Partition based on basic blocks
- Loops
- Branches
- Discover parallelism in application
- Monitor communication requirements
- Discover IO bottlenecks
- Analyze all data access patterns
- Computation and Communication overlapping
- Reduce number of transfers
15Co-Scheduling
- Leverage scheduling algorithms from Heterogeneous
Computing (HC), Embedded Computing (EC) , and
early work on Reconfigurable Hardware (RH) - Static Scheduling
- HC Algorithms
- RH Algorithms
- EC Algorithms
- Proposed Algorithm
- Dynamic Scheduling
- HC Algorithms
- Proposed Algorithms
16Integrated Development Environment(IDE)
- Using an open source visual editor
- Extend to include widgets and functionality for
- Setting objective functions
- Profiling source code
- Visualization of application
- Automatic Partitioning
- Selecting What-if scenarios
- IDE will interface with the various tools
- Co-Scheduling
- Co-design Analysis
- Profiling Tools
- Visualization Tools
17Profiling and HW/SW Co-Scheduling
Execution Profiling
Application Code
Resource
BUS
Memory
System Characterization
Application Profiling
User Interface
DAG Analysis
Co-Scheduling
18Profiling and HW/SW Co-SchedulingDetails of the
Proposed Tool
- Functionalities that the tool will offer
- Application analysis
- Application profiling
- Data profiling
- RC System analysis
- Resource analysis
- System overhead, Architecture constraints
- Co-design guidance
- Automatic co-scheduling
- Performance Impact analysis (e.g. speed up)
- Deliverables
- Algorithms for partitioning and co-scheduling
- Tool for profiling, partitioning and
co-scheduling applications - Case studies
- Research papers
19G2 Node Simulation and Architecture Studies
(NSAS)
- Faculty Dr. Tarek El-Ghazawi, Dr. Ivan Gonzalez,
- Dr. Sergio Lopez-Buedo
- Student Leader Miaoqing Huang
20Problems at the Reconfigurable Node Level
- Reconfigurable hardware is very fast, but often
the end to end performance is problematic - Speed of transfers between microprocessor and
FPGAs were identified as limiting factors - Local memory architecture has a great impact on
the overall performance - Transfers and incoherence between microprocessor
memory and local FPGA memory are handled by
programmers
21Goal
- Understand the issues with current architectures
- Provide the infrastructure to explore new
architectures - How the RP and Microprocessor should be
interconnected - Memory hierarchy (microprocessor and RP memory)
- RP Local memory architectures
22Objectives
- Build a simulation framework
- Support for µP, communication infrastructure, and
FPGA - Develop a compact benchmarking suite
- Use existing HPCC benchmarks as starting point
- Select a reduced set of applications to evaluate
main bottlenecks - Communication bandwidth
- Memory throughput
- Etc.
- Conduct architectural exploration studies
23Approach
- Each component of the node needs its own
simulation and modeling tool - Microprocessor, FPGA, memories, buses, etc.
- Build a simulation framework
- Integrate different widely-used simulators into a
unified environment - Processor simulators
- Simulation tools for reconfigurable logic devices
- Integrate third-party simulation tools for
existing and/or new architecture components - Create models of reconfigurable devices
communication interfaces and other components
like memories or peripherals\ - Architectural modeling tools
- Simplify the design and simulation of complex
hardware-software architectures
24Deliverables
- Analysis - Select tools for the co-simulation
framework - Architectural modeling and simulation tools
- Single and multiprocessor system simulators
- Tools for simulating and modeling reconfigurable
hardware - Development - Framework
- Leverage previous work from open source projects
- Improve the integration, modeling and simulation
of processors, reconfigurable logic, and
communication/interconnection mechanisms - Proof-of-concept
- Study the bottleneck in actual hardware-software
systems - Research new mechanism to connect reconfigurable
logic devices and standard processors
25Deliverables
- Analysis
- Architectural modeling and simulation tools
MILAN Framework - multiple levels of granularity
Reference http//milan.usc.edu/
Liberty - component-based modeling tool
Reference http//liberty.cs.princeton.edu/Softwa
re/LSE/
26Deliverables
- Analysis
- Single and multiprocessor system simulators
Reference http//m5.eecs.umich.edu/wiki/index.php
/Main_Page
Reference http//www.simplescalar.com/docs/hack_g
uide_v2.pdf
Reference Institute of Computing at the
University of Campinas, http//www.archc.org/
27Deliverables
- Analysis
- Tools for simulating and modeling reconfigurable
hardware
Reconfigurable logic modeling tools System-C,
Simulink, etc.
Simulators Modelsim, ActiveHDL, etc.
28Deliverables
HPCC benchmarks
FRAMEWORK
29G3 Hardware Architectural Virtualization (HAV)
and Run-Time Systems for App Portability
- Faculty Dr. Tarek El-Ghazawi, Dr. Mohamed Taher,
Dr. Sergio Lopez-Buedo - Student Leader Esam El-Araby
30The HPRC Programmers Nightmare
- Lack of application portability
- Reconfigurable resources are managed using
vendor-specific APIs - Applications are completely dependent on the
underlying machine - Porting from one machine to another is hard
- Adapting an application after a HW upgrade is
also non-trivial - Manual partitioning between HW and SW
- Explicitly done at design time, so changes imply
redesign and recoding, not just recompilation - HW design skills are required
- Difficult to optimize the synergy between HW and
SW - Multi-tasking/multi-user not supported
- There is no centralized system in charge of
distributing the resources - Multi-tasking/multi-user support must be
explicitly included in the applications
31How to Solve All these Drawbacks?
Programmers need to explicitly manage the
physical resources
Lack of support for multi-user/multi-tasking
applications
No standardized API to access the reconfigurable
HW
Explicit HW/SW partitioning in source code
Developers are required to design complex
abstractions to add portability to their apps
Virtualization of Hardware Resources
Run-Time System
32Proposed SolutionVirtualization of Resources
- Hardware Architectural Virtualization (HAV)
- Applications request virtual resources
- The virtual resources are the processing elements
that solve tasks - Tasks are defined as an operation that processes
data and obtains results, for example FFT, DCT,
etc. - The programmer only cares about tasks, not about
device details - A run-time system maps virtual resources to
physical resources subject to underlying
constraints - Physical resources (µP, FPGA, etc.) available in
the system - Co-scheduler decisions
- A library contains the implementations of the
tasks for the different physical processors - Portable library Each task will have at least
one SW (µP) and one HW (FPGA) implementation per
architecture - Many HW implementations are possible
(speed/area/power tradeoff)
33Benefits from Hardware Virtualization
- Programmer does not have to worry about HW
details - Ratio of FPGAs to µP
- Number and location of RP memory banks
- Separates actual processing tasks from the
processors where they will be executed - Enhance portability and device utilization
- If there are no available reconfigurable
resources, HW tasks can be either queued for
later execution or executed in SW - Explicit HW/SW partitioning no longer required
- Abstraction based on device-independent tasks
- There is no need to define HW mapping at
design-time, thus making development
significantly easier
34The Two Sides of HW Virtualization
- Virtualize reconfigurable processing resources
- Virtualize the resources used by the hardware
cores
Software User
Virtualize
OS
FPGA
Vendor-Specific Resources
Hardware User
Virtualize
35Proposed SolutionRun-Time System
- The run-time system maps virtual resources to
physical resources, and provides - Centralized management of resources
- Abstraction Layer
- Can be implemented at either kernel or user
levels - User-level daemons are easier to implement, but
communicate via sockets or IPC primitives with
noticeable overhead - As a kernel module asynchronous notifications
become of inexpensive - Leverage previous work on the LUCITE LSF project
Enable multi-tasking/multi-user support
Enable application portability
36Extended LSFNetworks of reconfigurable computers
(NORCs)
- Job management system for networks of
reconfigurable computers - Reconfigurable resources expensive and
underutilized - Many of these resources available over the
network - Need for S/W system to remotely schedule and
monitor reconfigurable tasks - An extension of LSF supporting several popular
FPGA accelerator boards was implemented - Parallel DES Breaker Example
- 500x speedup (relative to Pentium4)
- 95 utilization was demonstrated
- New boards could be easily incorporated
37Proposed Run-Time System GOAST
- Run-time system in charge of offloading tasks to
the processing resources (µP, FPGA, etc.) - Centralized mechanism to enable transparent
resource sharing and load balancing in
multi-user, multi-tasking environments - Interface to an external scheduler that will
provide the benefits of µP-FPGA synergy - Includes an application portability layer
- Static API/ABI allows for application portability
across different environments without
recompilation - Dedicated configuration manager
- Where the actual virtual physical resource
mapping occurs - Configures physical devices (µP, FPGA, etc.) with
the code/bitstream that implements the task
mapped to them
38GOAST An Initial Vision
- Main features of GOAST architecture
- The core provides a handle-based API to the
programs - Tasks are obtained from a portable library and
loaded to the physical processing resources using
the configuration manager - Using an external scheduler allows for the
implementation of different scheduling algorithms
- Implementation details
- Efforts will be focused on delivering a portable
user-space service with low overhead - How to implement inexpensive asynchronous
notifications - Management interface
- Provided by the core so both advanced users and
system administrators can manage resource usage
and availability
39Deliverables
Higher-level abstractions
- Conceptual study of hardware virtualization
techniques - Alternatives, challenges, tradeoffs, etc.
- GOAST Specification and API
- Comprehensive analysis of challenges and
scalability - GOAST Reference implementation proof-of-concept
- User-level implementation
- Standardized API to manage SW and HW virtual
resources - Kernel-level implementation and support for
device-independent tasks are future work
Device-independent task-based API
Standardized API for SW and HW virtual resources
Run-Time System
Virtualization Layer
Physical Resources
µP
FPGA
40Future work / Synergies
- Further extend GOAST API
- Device-independent tasks
- Seamless support for legacy code by using a
preprocessor - Integrate the run-time system in the OS kernel
- Reduce the cost of asynchronous notifications
- Use partial run-time reconfiguration to divide a
physical FPGA into virtual devices - Further improve virtualization by using virtual
memory management techniques
- GWU Projects
- G1 provides the co-scheduler
- G5 provides the framework for the development of
the portable library of tasks, and an example
(biomedical) - G2 will provide new architectures to explore the
portability problem in a more comprehensive way - UF Projects
- F3 will provide further case-studies to test the
proposed run-time system - F4 will provide the necessary RTR background for
the future work
41G4 High-Level Languages Productivity (HLLP) an
HPC Perspective
- Faculty Dr. Tarek El-Ghazawi, Dr. Mohamed Taher
- Student Leader Kun Xi
42Background
- Many options for developing RC/FPGA applications
- How are they really different?
- Which one is the optimal for a given project and
a given set of developers? - HDLs, best results but
- Steep learning curve
- Long development cycle
- HLLs, C-like and easy but
- Limitations as compared to C are unknown
- Very different from C and each other
- There are also graphical tools
- Seem easy to use, but is there any penalties or
hidden costs - All differences and design issues obscured by
marketing literature
43Goals and Objectives
- Understand the underlying differences among the
- available tools
- Guide the programmer in choosing the correct
language - Make intelligent selection of existing HLL tools
based on their features, the applications, and
the programmers strengths - Impact future HLL development for improved
productivity and portability of applications - Develop a formal methodology to
- Understand the underlying differences among the
available tools - Guide the programmer in choosing the correct
language to solve a certain problem
44Example Languages to Consider
- Classic HDLs
- VHDL
- Verilog
- Text-based HLLs
-
- Impulse-C
- Handel-C
- Mitrion-C
-
- Graphical tools
- DSPLogic
Imperative Languages
Functional Language
Graphical/Dataflow Language
45Typical Hardware Development Flow
46Leverage Previous Work/Experience
- Leverage evaluation methodology of the NORCs
(Extended LSF) project - RC HLL preliminary study with ARSC
- Leverage work with IBM and PSC on productivity
under DARPA HPCS
47Leverage Previous Work/Experience
- Leverage evaluation methodology on of the NORCs
(Extended LSF) project - RC HLL preliminary study with ARSC
- Leverage work with IBM and PSC on productivity
under DARPA HPCS
48Methodology for Language Evaluation
- Conceptual study aiming for a set of orthogonal
HLL features - Scoring system/metrics for understanding
productivity - Experimental study involving key applications
- Instrumentation package to support experimental
studies - Similar to SUMS
- Joint study by PSC, IBM, GWU, and DARPA
- Productivity of programming languages
particularly - X10, UPC, and MPI
49HLL Programming Paradigms Evaluation Metrics
- Ease-of-Use
- Acquisition time, i.e. learning and experience
gaining - Depends on the type of paradigm being adopted
- Development time
- Depends on both the paradigm as well as the
application being developed - The acquisition time and the development time
express directly an opposite effect to
ease-of-use that is difficulty-of-use - Captures the effects of the programming model
explicitness - The more explicit the programming model is, i.e.
the more architectural details that need to be
handled by the user/developer, the longer it will
take to both acquire the language and develop
applications - Prior user/developer experience can reduce both
the acquisition and the development times - Normalized to a similar corresponding metric for
a reference HDL language
50HLL Programming Paradigms Evaluation Metrics
- Efficiency of hardware generation
- Combines the ability of each programming paradigm
to extract the maximum possible
parallelism/performance with the lowest cost - In terms of
- End-to-end throughput
- Synthesized clock frequency, and
- Resource usage (e.g. slice utilization)
- Prior user/developer experience should be taken
into consideration - The more experienced the user/developer is the
higher the throughput and frequency with less
resource usage he/she can achieve from a certain
language - Compared to a reference conventional HDL
approach assuming optimality of the HDL approach
51HLL Programming Paradigms Evaluation Metrics
- Statistical considerations
- The sample size of the experiment population
should be as large as possible to achieve
accurate results and conclusions with minimum
variances - By the sample size we mean the number of
- Users included in the experiment,
- Applications considered,
- Languages for each paradigm, and
- Platforms being used as testbeds
- Our previous experiments involved
- Three independent users with different degrees of
experience in the field - Four different applications
- One language per paradigm
- One supporting platform
- Our sample size was
- Limited by the current status of the technology
- Heavily dependent on the availability of all the
languages on common platforms
52Deliverables
- A conceptual study aiming for a set of
Orthogonal HLL Features - A formal methodology and a scoring system for
understanding productivity as it relates to
performance, ease of use, resource utilizations,
and power - An experimental study involving a few key
applications - An instrumentation package to support
experimental studies - Association of features and tools to application
classes - Studies reports
- Research papers
53G5 Library Portability and Acceleration Cores
(LPAC) Computational Biology and Medical
Imaging Case Studies
Faculty Dr. Tarek El-Ghazawi, Dr. Ivan
Gonzalez Student Leader Mohamed Abou-Ellail
54Library Portability and Acceleration Cores (LPAC)
- Background and Motivation
- Building optimized libraries is critical for High
Performance Reconfigurable Computing (HPRC) - Portability will allow for
- Increased productivity
- Easier platform migration
- Objectives
- Define elements needed for increasing reusability
and interoperability of FPGA functional cores
across different vendor-specific platforms - Develop a framework for applications development
for HPRC - Show the applicability of the proposed
methodology through computational biology and
medical imaging case studies
55Two Parts of DevelopmentsA Complete Framework
for Portable Cores
- Software development
- A unified framework to transfer data from and to
FPGA
- Hardware development
- A virtualized memory space viewed by FPGA
- A standardized communication interface between
FPGA and virtualized memory
56Two Parts of DevelopmentsA Complete Framework
for Portable Cores
- Software development
- A unified framework to transfer data from and to
FPGA
- Hardware development
- A virtualized memory space viewed by FPGA
- A standardized communication interface between
FPGA and virtualized memory
57Library Developers ViewA Complete Framework for
Portable Cores
- Portability is a multi-layered problem
- Computational layer
- A set of hardware cores (e.g. , DWT, DES, )
- A generic reconfigurable processor (RP)
- Related vendor-specifics included
- Source level format
- Compiled once on every platform
- Interface layer
- Abstraction layer for physical platform-specifics
- Interface to the physical resources, e.g. RP
local memory, µP memory, and communication
resources - Source or pre-compiled level format
58Library Developers ViewA Complete Framework for
Portable Cores
59Library Users ViewA Complete Framework for
Portable Cores
- Portability is a single-layered problem
- Computational layer
- A set of hardware cores targeting a generic
reconfigurable processor (RC) - Unified virtual memory model
CPU
FPGA
Vendor-Specific Resources
Hardware User
Virtual Space
60Leverage the previous experience in
multi-platform librariesA Complete Framework for
Portable Cores
- Cryptography, Image Processing, Sorting,
Bioinformatics, etc - GWU/GMU/USC Carte-C library development
61Case Studies
62Medical Imaging and Bioinformatics
- Medical Imaging and Bioinformatics Applications
- Most algorithms in this area are computationally
intensive - Require real-time acquisition and processing
capabilities - Traditionally these applications are parallelized
across a cluster - Performance, efficiency, and costs associated
with these systems proved impractical for these
classes of applications - HPRC proved to be good candidates for similar
computationally intensive applications - Cryptography, image processing, etc
- HPRCs Potentials
- Performance improvement for medical/bioinformatics
applications - Maintaining the flexibility of conventional
systems
63Medical Imaging Candidate Applications
- Image Reconstruction
- An important requirement for many medical imagery
systems - Computed Tomography (CT)
- Positron Emission Tomography (PET)
- Magnetic Resonance Imaging (MRI)
- X-Ray
- Ultrasonography
- Reconstruction computational requirements are
growing at a tremendous rate - New CT scanners record data for thousands of
slices per scan - Hundreds of projections per slice are recorded
- Must be performed in as short a period of time as
possible - 3D-Image Modeling
- Acquiring digital samples of objects distributed
in three dimensional space - Processing all the dimensions congruently to
construct a 3D image
64Medical Imaging Candidate Applications
- Image Registration
- Nature of transformation
- Rigid
- Concentrates mainly on rotation, scaling and
translation between the two images - Non-rigid (elastic)
- Takes local deformations into consideration
- Modality
- Mono-Modal
- One type of image involved, e.g. CT to CT, MRI to
MRI, - Multi-Modal
- Registering the same image acquired from
different systems, e.g. CT to MRI, MRI to PET, - Uses
- Cancer screening, diagnosis, guided treatment,
etc ... - Issues
- Large search space ? exhaustive search, iterative
refinement, - Different similarity measures with different
accuracy and feature emphasis
65Related Experience in Image Processing
- The Automatic Image Registration
- SRC-6 ? 2 Chips (4 Engines)
- 4x speedup over µP implementations
- (Intel Xeon P4, 2.8 GHz)
- MAP-C Implementation
- Floating Point Arithmetic (Single Precision)
- Wavelet-Based Hyperspectral Dimension Reduction
- SRC-6 ? 1 Chip (1 Engine)
- 32x speedup over µP implementations (Intel Xeon
P4, 1.8 GHz) - The Automatic Cloud Cover Assessment (ACCA)
- SRC-6 ? 1 Chip (8 Engines)
- 16x speedup over previous hardware implementation
- 28x speedup over µP implementations (Intel Xeon
P4, 2.8 GHz) - Accuracy for Landsat-7 Images
- Approximation Error ? 0.1028, (0.9102 over
water) - Pass Two
- MAP-C Implementation
- Floating Point Arithmetic (Single-Precision)
66BioinformaticsDescription and Goals
- Corner-stone in the field of molecular
- biology
- Topics
- Sequence Alignment
- DNA or Protein
- Gene Finding
- Algorithmically Identifying biologically
functional regions on the genome - Computational Evolutionary Biology
- Identifying the origin and descent of species, as
well as their change and diversity over time - Building phylogenetic trees to illustrate the
relationships among various species - Most bioinformatics applications lend themselves
to reconfigurable computers due to their compute
intensive nature
67 Topics and Algorithms
Neil C. Jones, and Pavel A. Pevzner An
Introduction to Bioinformatics Algorithms, A
Bradford Book, The MIT Press, Cambridge,
Massachusetts, 2004.
68Related Experience in Bioinformatics
- Pairwise Sequence Alignment
- Implementation of the Smith-Waterman Algorithm
- Based on dynamic programming
- Performs local sequence alignment that is, for
determining similar regions between two
nucleotide or protein sequences
69Deliverables
- A framework for the development of hardware
portable libraries - Interfaces, data management,
- A set of tools for the automation of the
interface generation - Source level distribution
- Reference implementations
- Computational biology library
- Medical imaging library