Title: Implementing DSP Algorithms with Networks on Chip
1Implementing DSP Algorithms with Networks on Chip
2Processing Elements and Networks on Chip
Dally01, Benini02
3Problem
- Mapping computations to processing elements
- Hu03, Murali04
- Scheduling switch fabric for traffic patterns
known at compile time
4Different Mapping
5Same Mapping, Different Schedule
6Different Results
- Performance
- Latency
- Throughput
- Power
- Reliability
7Outline
- Background Switch Fabrics, BvN
- Formulation Graph, Feasibility
- Mapping Heuristics
- Scheduling Heuristics
- Experiments
8BvN Decomposition
Any connection from inputs to outputs is
feasible Decompose traffic matrix into sum of
permutation matrices Chang01
Crossbar (Unbuffered, Input Queued)
9BvN Decomposition
- Optimum Least Number of Matrices
- Leads to fewest number of cycles
- Lower Bound
- Maximum of row and column sums
- Bound is tight
- Proof yields a polynomial time algorithm
- Assumptions
- Traffic known a priori and captured by a traffic
matrix - Rearrangeable fabric can implement all
permutations
10VLSI Implementation Issues
- Rearrangeable fabrics not scalable
- Excessive connection
- Irregular placement and routing
- Motivates study of simple topologies
- Tree, mesh, torus, fat tree Leighton91
- High quality schedules critical
11Outline
- Background Switch Fabrics, BvN
- Formulation Graph, Feasibility
- Mapping Heuristics
- Scheduling Heuristics
- Experiments
12Switch Fabric and Traffic Matrix
- Switch Fabrics
- Links joined by programmable crosspoints
- Simple graph representation
- Traffic Matrix
- Integer entries encoding number of packets from
PE i to PE j
1
2
6
3
5
4
13Feasibility 1
- Feasible Matrix Vertex Disjoint Path Set
- Can be transferred in one cycle
- Rearrangeable Fabrics every permutation matrix
is feasible - General Fabrics see the example below
1
2
3
4
5
6
1
2
3
4
5
6
14Feasibility 2
1
2
3
4
5
6
1
1
2
2
3
4
6
3
5
6
4
5
1?4 and 2?5 are not feasible because any paths
chosen will share a vertex.
15Schedule
- A collection of feasible matrices that sum to the
traffic matrix - Number of Cycles Number of Matrices
- Optimum schedule has the least number of cycles
16Optimum vs. Greedy
Greedy
Same color packets are scheduled in the same
cycle Greedy method takes one more cycle
Optimum
17Decision Problem
- VDPS Vertex Disjoint Path Set
- NP-Complete Garey79
- Given a graph representing the switch fabric, can
a traffic matrix be scheduled in L cycles? - L1 case Is the traffic matrix
feasible?Equivalently, is there a VDPS for the
traffic matrix? - Hardness of VDPS ? hardness of scheduling on a
general fabric
18Outline
- Background Switch Fabrics, BvN
- Formulation Graph, Feasibility
- Mapping Heuristics
- Scheduling Heuristics
- Experiments
19Mapping and Scheduling
- One-to-one mapping from DFG nodes to PEs done
before scheduling actual traffic - Given a mapping, scheduling step generates the
actual cycle by cycle scheme for communication
20Why mapping heuristics?
- Hard to evaluate a mapping
- Deep combinatorial problem in its nature
21Setup for Heuristic
- distance from u to v when each
edge in the graph are of length 1 - FOM (figure of
- merit) to minimize over all possible
- Finding best is still NP-hard
22Example Inputs
0
1
2
3
0
1
2
3
Initial Mapping
Traffic Matrix
23Example Exchange
0
1
2
3
0
-6
1
2
3
Source 0 originates 034311 packets Source 3
originates 03104 packets Intuitively better
to place source 0 closer to destinations
24Example Result Series
-6
-2
-6
-4
-4
-7
-4
25Outline
- Background Switch Fabrics, BvN
- Formulation Graph, Feasibility
- Mapping Heuristics
- Scheduling Heuristics
- Experiments
26Congestion Metric
- Design Criteria
- Fast to calculate
- Captures hot spots
- Congestion on edge
- is the row sum
- is the column sum
- is the distance
- Based on the current traffic matrix
27Generate Schedule
- Pick the source or destination with maximum
packets flowing in or out - BvN and Tree
- Avoid congested links
- Good metric of congestion
- Shortest path based on congestion with fine tune
- Keep adding paths to get a vertex disjoint path
set - Record the VDPS in the schedule
- Update traffic matrix, recalculate congestion
values
28Outline
- Background Switch Fabrics, BvN
- Formulation Graph, Feasibility
- Mapping Heuristics
- Scheduling Heuristics
- Experiments
29LDPC
- 96 coders and 48 checkers
- 23x23 mesh
- Spare horizontal and vertical tracks to help
routing
Number of Cycles
30Distance Inverted Congestion
Number of Cycles
31Mapping
Number of Cycles
32Discussion Future Work
- Explored mapping and scheduling for NoCs
- Statically scheduled fabrics
- Heuristics beating manual solutions, approaching
optimum - One VDPS may take multiple clocks to finish
- Fast networks are pipelined
- Fixed time cycle not practical
- Tweak heuristics to pack short transfers together
33Thank You
34End