Title: Approximation Algorithm for Data Mapping on Block Multithreaded Network Processor Architectures
1Approximation Algorithm for Data Mapping on Block
Multi-threaded Network Processor Architectures
- Chris Ostler and Karam S. Chatha
- Department of Computer Science and Engineering
- Arizona State University
- Tempe, AZ.
2Intel IXP 2400 Processor
- Eight independent
- micro-engines
- Support for 8 threads
- Block multi-threaded
- Fast context switch
- Available data memory
- 2.5 KB local memory
- 16 KB scratchpad
- Off-chip SRAM
- Off-chip DRAM
- Complex architecture that is challenging to
program in the absence of system-level design
tools
3Overall Design Flow
Application specification
MPSoC specification
Application profiling
Pragma specification
Process mapping (ASPDAC07)
Multi-threading aware data mapping
Code generation
Other research generalized application mapping
(Shirazi et al.), system-level design (De Micheli
et al.), NP centric (Plishker et al., Chen et
al.)
4Block Multi-threading
- Single thread completes once every 4800 cc
- Throughput 1/4800 cc
- Multiple (8) threads complete once every 600 cc
- Throughput 1/600 cc
5Multi-threading aware data mapping
- Local memory has 3 cc access latency (Th 1/900)
- Scratch pad has 60 cc access latency (Th 1/825)
- Non-local memory access not completely amortized
4200
1800
6Multi-threading aware data mapping
- Data items in local memory and scratch pad (Th
1/675) - Non-local memory access completely amortized
- Performance improvements due to multi-threading
strongly dependent on data mapping - Memory capacity constraints increase the problem
complexity
7Problem Definition
- Given
- Process with
- execution time tp and
- set D of data items characterized by size and
number of accesses - Set K of possible multi-threading configurations
- Set M of data memories characterized by
- latencies, capacities and types (local /
non-local) - Objective
- Maximize process throughput by
- obtaining a mapping of data items to memories and
- selection of multi-threading configuration
- Such that
- Data memory capacities are not violated
- Problem can be shown to be NP complete
8Algorithm overview
- for each possible number of threads
- for every data partition into
local/non-local mem. - perform data assignment to memories
- determine solution throughput
- save solution with best throughput
- end for
- end for
- Step 1 is polynomial in problem size
- Step 2 is exponential in problem size
- Data items partitioning problem
- Step 3 is exponential in problem size
- Memory assignment problem
9Non-local memory assignment
- Sort data items by ratio of accesses to sizes
- Assign to fastest available non-local memory
until capacity is exceeded - Resolve capacity violation
- Move either violating item to slowest memory or
- All other item to slowest memory
- Resulting solution has throughput at least ½
times optimal
10Data item classification
- Create equivalent classes by scaling and rounding
- Number of classes
- log(amax/amin) log(smax/smin)
log(smax/smin)
(1 e)
(1 e)i
. . .
. . .
1
1
(1 e)
Accesses -gt
. . .
(1 e)i
. . .
log(amax/amin)
Sizes -gt
11Data item partitioning
- Number of possible partitions are
pseudo-polynomial in problem size - Throughput at least 1/(1e) timesoptimal
throughput - Memory usage no more than (1e) times
- Overall
- Throughput at least 1/2(1e) times optimal
- Memory usage no more than (1e) times
12Experimental results
13Conclusion
- Described the multi-threading aware data mapping
problem - Approximation algorithm
- Throughput no less than 1/2(1 e) times optimal
- Utilizes no more than (1 e) times memory
- Experimental results show speedup of up to
- 130 in comparison to naïve mapping, and
- are within 80 of optimal solution