AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS

Description:

initial design capture and algorithm verification using khoros khoros/cantata is a ... synchronized our wildforce acs used as a linear array partition ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 32
Provided by: Benja47
Learn more at: http://klabs.org
Category:

less

Transcript and Presenter's Notes

Title: AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS


1
AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO
ADAPTIVE COMPUTING SYSTEMS
  • MAPLD-99
  • Laurel, MD
  • September 29, 1999
  • Senthil Natarajan, Ben Levine, Chandra Tan, Danny
    Newport and Don Bouldin
  • Electrical Computer Engineering
  • University of Tennessee
  • Knoxville, TN 37996-2100
  • TEL (423)-974-5444
  • FAX (423)-974-8245
  • dbouldin_at_utk.edu

2
INITIAL DESIGN CAPTURE AND ALGORITHM
VERIFICATION USING KHOROS
3
KHOROS/CANTATA IS A VISUAL PROGRAMMING LANGUAGE
FOR PROTOTYPING ALGORITHMS
4
ADAPTIVE COMPUTING SYSTEMS CONSIST OF ACCELERATOR
BOARDS OF FPGAS
5
CURRENT STATE-OF-THE-ART
KHOROS
MISSING LINK
PARTITIONING ONTO MULTIPLE FPGAS
MISSING LINK
6
CHAMPION WILL AUTOMATICALLY MAP KHOROS DESIGNS
ONTO ADAPTIVE COMPUTING SYSTEMS
7
CHAMPION WILL IMPROVE PRODUCTIVITY
Manual Mapping Onto An Adaptive Computing System
KHOROS
  • GOAL Automate the mapping of
    Khoros-based applications onto adaptive
    computing systems to improve designer
    productivity by 100x.
  • IMPACT
  • More application designers will be able to
    achieve higher quality implementations in less
    time.
  • Adaptive computing systems will be utilized
    more effectively and by a wider audience.

ACS
TIME (WEEKS)
Champion Will Improve Productivity By Using
Estimation and Automatic Mapping of Precompiled
Library Primitives
KHOROS
ESTIMATION
ACS
TIME (WEEKS)
8
OUTLINE OF THIS PRESENTATION
  • Application Development Flow
  • Library Development and Verification
  • Manual Implementation
  • ATR Executions on ACS
  • Automated Partitioning Algorithms
  • Lessons Learned and Future Plans

9
APPLICATION DEVELOPMENT FLOW
APPLICATION
KHOROS/CANTATA
DATA WIDTH MATCHING SYNCHRONIZATION
PARTITIONING
Precompiled Libraries
SYNTHESIS PLACEMENT/ROUTING
Destination Hardware Architecture
ADAPTIVE COMPUTING SYSTEM
10
KHOROS/CANTATA IMPLEMENTATIONTOP LEVEL
11
KHOROS/CANTATA IMPLEMENTATION--gt FIND TARGETS
12
KHOROS/CANTATA IMPLEMENTATION--gt MARK FRAME
PIXELS
13
ALGORITHM STRUCTURE
Find Targets and Label Image The target pixel
map is then used to identify square regions that
are considered to contain targets. These target
regions are then masked off (it is assumed that
there is only one target per region). The target
region location is then used to draw a frame that
will identify the target in the output image.
This is repeated six times.
14
DEVELOP AND PRECOMPILE LIBRARY CELLS
Test Inputs
Responses
KHOROS--C Floating Point
KHOROS--C Fixed Point
VHDL
Each Library Primitive Will Be Developed at Each
Level, Verified, and Characterized.
FPGA
15
KHOROS AND CHAMPION LIBRARY CELLS
  • Champion Cells
  • Each hardware cell has only one specific
    function and one data type.
  • Hardware cells are parametrized to correspond to
    the desired data bit widths.
  • Data is transferred between hardware cells
    sequentially one pixel at a time per clock cycle.
  • Synchronization of data arrival to hardware
    cells is necessary through the insertion of delay
    elements by Champion.
  • Khoros Traditional Cells
  • Some Khoros cells have multiple functions for
    user to select.
  • A single cell can handle all input dimension
    sizes.
  • Cells can handle inputs of any data type.
  • Data between cells are stored on the host CPU
    system as temp files.
  • Khoros handles data movement between cells. Each
    cell begins its execution only after all its
    inputs have been written onto the host file
    system.


16
DATA BIT-WIDTHS MUST BE MATCHED
IN
CONVSTREAM_8_256_256
RIGHT SHIFT 3
ADD
ADD_8
ADD_9
CLIP HIGH 255
ADD_8
ADD_10
ADD_8
ADD_9
ADD_8
OUT
17
DATA MUST BE SYNCHRONIZED DUE TO DIFFERENT PATH
DELAYS
Data synchronization error! Input times are not
equal.
IN
T0
11
T 257
PAD_HIGH_8_11 L 0
CONVSTREAM_8_256_256 L 257
RIGHT_SHIFT_12_ 3
ADD_11
12
ADD_8 L 1
T 258
ADD_9 L 1
T260
CLIP_HIGH_12_ 255
T 259
ADD_8 L 1
T 258
ADD_10 L 1
ADD_8 L 1
T 258
T 259
ADD_9 L 1
TRUNCATE_HIGH_12_8
ADD_8 L 1
T 258
8
T257
OUT
18
ORIGINAL KHOROS TASK GRAPH
S32
RAM_Read_pf4_var_8
R
D-
8
S404
Sobel_8_8_256_256
M
S346
D262
8
Lowpass_8_8_256_256
M
D262
8
S354
START_Mean_SD
M
8
DS14
S354
START_Mean_SD
M
8
DS14
8
8
8
8
9
S0
shift_left_9_1
D0
10
S0
9
9
shift_left_10_2
D0
S12
8
add_9
10
D1
10
S13
10
add_10
D1
S4
and_1
11
D1
1
S168
Lowpass_1_4_256_256
M
D262
1
D0
4
8
8
8
S9
S11
gte_4_4
gte_8
D1
D1
S11
1
1
1
gte_8
S168
D1
S63
Lowpass_1_4_256_256
M
MITR
M
D262
D5
4
1
S9
gte_4_4
D1
A
19
HARDWARE TASK GRAPH WITH DATA BIT-WIDTH MATCHED
AND SYNCHRONIZED
S32
RAM_Read_pf4_var_8
R
D-
8
S404
Sobel_8_8_256_256
M
S346
D262
8
Lowpass_8_8_256_256
M
D262
8
S354
START_Mean_SD
M
DS14
8
S354
START_Mean_SD
M
8
DS14
8
S56
S0
RAM_buffer_pf4_8
R
8
8
pad_8_9
D16S
8
8
D0
S0
9
pad_8_9
S0
S0
D0
pad_8_10
pad_8_10
S0
D0
D0
shift_left_9_1
D0
10
S0
9
9
shift_left_10_2
D0
S12
8
add_9
10
D1
10
S13
10
add_10
S56
D1
S4
RAM_buffer_pf4_8
R
and_1
S11
D16S
11
D1
clip_high_10_8
D1
S11
clip_high_11_8
1
D1
10
11
S168
Lowpass_1_4_256_256
M
S0
S0
D262
1
trunc_high_11_8
trunc_high_10_8
D0
D0
4
8
8
8
S9
S11
gte_4_4
gte_8
D1
D1
S11
1
1
1
gte_8
S168
D1
S63
Lowpass_1_4_256_256
M
MITR
M
D262
D5
4
1
S9
gte_4_4
D1
A
20
OUR WILDFORCE ACS USED AS A LINEAR ARRAY
PCI Interface
Local Bus
32
36-bit Data Path
Crossbar
PE0
PE2
PE1
PE4
PE3
21
PARTITION EARLY INSTEAD OF LATE TO SHORTEN THE
HARDWARE MAPPING TIME
EARLY
Precompiled Library Cells
Place Route
P1
Merge
SUCCESS
Place Route
Design Input in Khoros
P2
Merge
K-way partitioning Global Place Route
Workspace to Netlist
Place Route
  • Coarser granularity -gt smaller netlist.
  • Hierarchical and functional flow information are
    preserved.
  • Timing Synchronization greatly facilitated.
  • Less resource utilization.

P3
Merge
Place Route
Merge
Pk
LATE
Optimizer
Flatten
Hardware Configuration
SUCCESS
P1
Place Route
K-way partitioning Global Place Route
P2
Place Route
VHDL
  • Finer granularity -gt larger netlist.
  • Functional and algorithmic flow of the design
    are lost.
  • Timing Synchronization can be a problem.
  • More resource utilization.
  • The resulted subcircuits are more likely to be
    placeable and routable.

P3
Place Route
Pk
Place Route
22
MULTI-FPGA PARTITIONING
23
TIMING RESULTS FOR atr ON OUR WILDFORCE
  • OUR WILDFORCE ACS IS 156X FASTER THAN
    KHOROS/CPU NOW.
  • IF WE HAVE SUFFICIENT LOGIC AND MEMORY SUCH
    THAT NO RECONFIGURATIONS ARE NEEDED, THE ACS
    COULD BE 667X FASTER.
  • IF FULLY PIPELINED, THE ACS COULD BE 32,000X
    FASTER.

Data Processing
33
Data Transfer
34
Host Code
1544
Reconfiguration
5159
0
1000
2000
3000
4000
5000
6000
24
PARTITIONING - 1st BOARD CONFIGURATION PHASE
Blank Frame Map
Compute Edge Stats
Find First Target Pixel
RAM
RAM
Mask Target Pixels
Input Image
Check Intensity Stats
Mark Frame Pixels
11
11
4
4
4
500
554
PE3
PE1
Low-Pass Filter
RAM
AND
Compute Intensity Stats
Sobel Filter
Check Edge Stats
Write to RAM - A
11
11
Low-Pass Filter Check gt 4
11
1296
Low-Pass Filter Check gt 4
CPE0
72
Mask Invalid Target Region
4
548
PE4
PE2
25
PARTITIONING - 2nd BOARD CONFIGURATION PHASE
Find First Target Pixel
Find First Target Pixel
RAM
RAM
Mask Target Pixels
Mask Target Pixels
Mark Frame Pixels
Mark Frame Pixels
4
4
4
4
500
500
PE3
PE1
Read from RAM - A
5
Find First Target Pixel
RAM
Write to RAM - B
53
Mask Target Pixels
CPE0
Mark Frame Pixels
4
4
72
PE4
PE2
500
26
PARTITIONING - 3rd BOARD CONFIGURATION PHASE
Find First Target Pixel
RAM
Mask Target Pixels
Mark Frame Pixels
4
4
0
500
PE3
PE1
Read from RAM - B
5
Find First Target Pixel
RAM
Write to RAM - C
53
Mask Target Pixels
CPE0
Mark Frame Pixels
4
4
72
PE4
PE2
500
27
PARTITIONING - 4th BOARD CONFIGURATION PHASE
RAM
Read from RAM - C
Find Max Intensity
Combine Image and Frames
4
11
11
11
119
75
PE3
PE1
Input Image
11
Output Image
RAM
53
CPE0
4
11
11
72
PE4
PE2
90
28
PRODUCTIVITY IMPROVEMENT IS 100X(250 hours
manually vs. 2.5 hours automatically)
Application
Khoros
Partitioning Suite
Data Matching Data Synchronization
WSP2NETLIST
NETLIST2STV
Synthesis/Place Route
ACS
Automatic
Manual
time
29
LESSONS LEARNED
  • Learned that the translation from KHOROS to
    hardware is complicated by several factors
    including
  • Differences in the way blocks of data are passed
    from operator to operator.
  • Parameters for data bit-widths must be specified
    for each cell.
  • Difference between data-driven KHOROS cells and
    clock-driven hardware cells creates a need for
    data synchronization.
  • Determined that reconfiguration time was the
    major obstacle to achieving high performance, and
    that RAM access conflicts required more
    reconfigurations than would be otherwise
    necessary.
  • Learned that manual implementation of KHOROS
    applications on WildForce is very time-consuming
    and tedious (250 hours).
  • Thus, great potential exists for making a
    significant (100x) improvement on productivity
    via automation.

30
SCHEDULE AND MILESTONES
May 98 Demonstrated the manual mapping of a
simple KHOROS network on a
Xilinx-based ACS (EVC-1). We also validated our
method for library development at
the KHOROS, VHDL and FPGA levels. Sep
98 Demonstrated the manual mapping of a more
complex KHOROS network on a
Xilinx-based ACS (Wildforce). Mar
99 Demonstrated the manual mapping of a complex
KHOROS network with some automated FPGA
partitioning on the Wildforce. Sep
99 Automated additional portions of the
application development flow. Jan 00 Will
demonstrate the Army Night Vision Lab challenge
problem with automatic mapping onto the
Wildforce. Mar 00 Will demonstrate two
additional challenge problems (e.g. Face
Detection and Image Backprojection on the
Wildforce). Sep 00 Will demonstrate all three
challenge problems on two additional ACS
platforms (e.g. Altera-based ACS and latest
Xilinx-Virtex ACS).
31
CHAMPION A SOFTWARE DESIGN ENVIRONMENT FOR
ADAPTIVE COMPUTING SYSTEMS
Write a Comment
User Comments (0)
About PowerShow.com