Approximation Algorithm for Data Mapping on Block Multithreaded Network Processor Architectures - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Approximation Algorithm for Data Mapping on Block Multithreaded Network Processor Architectures

Description:

Approximation Algorithm for Data Mapping on Block Multi-threaded ... Chris Ostler and Karam S. Chatha. Department of Computer Science and Engineering ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 14
Provided by: www296
Category:

less

Transcript and Presenter's Notes

Title: Approximation Algorithm for Data Mapping on Block Multithreaded Network Processor Architectures


1
Approximation Algorithm for Data Mapping on Block
Multi-threaded Network Processor Architectures
  • Chris Ostler and Karam S. Chatha
  • Department of Computer Science and Engineering
  • Arizona State University
  • Tempe, AZ.

2
Intel IXP 2400 Processor
  • Eight independent
  • micro-engines
  • Support for 8 threads
  • Block multi-threaded
  • Fast context switch
  • Available data memory
  • 2.5 KB local memory
  • 16 KB scratchpad
  • Off-chip SRAM
  • Off-chip DRAM
  • Complex architecture that is challenging to
    program in the absence of system-level design
    tools

3
Overall Design Flow
Application specification
MPSoC specification
Application profiling
Pragma specification
Process mapping (ASPDAC07)
Multi-threading aware data mapping
Code generation
Other research generalized application mapping
(Shirazi et al.), system-level design (De Micheli
et al.), NP centric (Plishker et al., Chen et
al.)
4
Block Multi-threading
  • Single thread completes once every 4800 cc
  • Throughput 1/4800 cc
  • Multiple (8) threads complete once every 600 cc
  • Throughput 1/600 cc

5
Multi-threading aware data mapping
  • Local memory has 3 cc access latency (Th 1/900)
  • Scratch pad has 60 cc access latency (Th 1/825)
  • Non-local memory access not completely amortized

4200
1800
6
Multi-threading aware data mapping
  • Data items in local memory and scratch pad (Th
    1/675)
  • Non-local memory access completely amortized
  • Performance improvements due to multi-threading
    strongly dependent on data mapping
  • Memory capacity constraints increase the problem
    complexity


7
Problem Definition
  • Given
  • Process with
  • execution time tp and
  • set D of data items characterized by size and
    number of accesses
  • Set K of possible multi-threading configurations
  • Set M of data memories characterized by
  • latencies, capacities and types (local /
    non-local)
  • Objective
  • Maximize process throughput by
  • obtaining a mapping of data items to memories and
  • selection of multi-threading configuration
  • Such that
  • Data memory capacities are not violated
  • Problem can be shown to be NP complete

8
Algorithm overview
  • for each possible number of threads
  • for every data partition into
    local/non-local mem.
  • perform data assignment to memories
  • determine solution throughput
  • save solution with best throughput
  • end for
  • end for
  • Step 1 is polynomial in problem size
  • Step 2 is exponential in problem size
  • Data items partitioning problem
  • Step 3 is exponential in problem size
  • Memory assignment problem

9
Non-local memory assignment
  • Sort data items by ratio of accesses to sizes
  • Assign to fastest available non-local memory
    until capacity is exceeded
  • Resolve capacity violation
  • Move either violating item to slowest memory or
  • All other item to slowest memory
  • Resulting solution has throughput at least ½
    times optimal

10
Data item classification
  • Create equivalent classes by scaling and rounding
  • Number of classes
  • log(amax/amin) log(smax/smin)

log(smax/smin)
(1 e)
(1 e)i
. . .
. . .
1
1
(1 e)
Accesses -gt
. . .
(1 e)i
. . .
log(amax/amin)
Sizes -gt
11
Data item partitioning
  • Number of possible partitions are
    pseudo-polynomial in problem size
  • Throughput at least 1/(1e) timesoptimal
    throughput
  • Memory usage no more than (1e) times
  • Overall
  • Throughput at least 1/2(1e) times optimal
  • Memory usage no more than (1e) times

12
Experimental results
13
Conclusion
  • Described the multi-threading aware data mapping
    problem
  • Approximation algorithm
  • Throughput no less than 1/2(1 e) times optimal
  • Utilizes no more than (1 e) times memory
  • Experimental results show speedup of up to
  • 130 in comparison to naïve mapping, and
  • are within 80 of optimal solution
Write a Comment
User Comments (0)
About PowerShow.com