Title: Design and Implementation of a NoC-Based Cellular Computational System
1Design and Implementation of a NoC-Based Cellular
Computational System
- By Shervin Vakili
- Supervisors Dr. Sied Mehdi Fakhraie
- Dr. Siamak
Mohammadi
February 09, 2009
2Outline
- Introduction and Motivations
- Basics of Evolvable Multiprocessor System (EvoMP)
- EvoMP Operational View
- EvoMP Architectural View
- Simulation and Synthesis Results
- Summary
3- Introduction and Motivations
- Basics of Evolvable Multiprocessor System (EvoMP)
- EvoMP Operational View
- EvoMP Architectural View
- Simulation and Synthesis Results
- Summary
4Introduction and Motivations (1)
- Computing systems have played an important role
in advances of human life in last four decades. - Number and complexity of applications are
countinously increasing. - More computational power is required.
- Three main hardware design approaches
- ASIC (hardware realization)
- Reconfigurable Computing
- Processor-Based Designs (software realization)
Flexibility
Performance
5Introduction and Motivations (2)
- Microprocessors are the most pupular approach.
- Flexibility and reprogramability
- Low performance
- Architectural techniques to improve processor
performance - Pipeline, out of order execution, Super Scalar,
VLIW, etc. - Seems to be saturated in recent years.
6Introduction and Motivations (3)
- Emerging trends aim to achieve
- More performance
- Preserving the classical software development
process.
1
7Why Multi-Proseccor?
- One of the main trends is to increase number of
processors. - Uses Thread-level Parallelism (TLP)
- Similarity to single-processor
- Short time-to market
- Post-fabricate reusability
- Flexibility and programmability
- Moving toward large number of simple processors
on a chip.
8Number of Processing Cores in Different Products
3
3
9MPSoC Development Challenges (1)
- MP systems faces some major challenges.
- Programming models
- MP systems require concurrent software.
- Concurrent software development requires two
operations - Decomposition of the program into some tasks
- Scheduling the tasks among cooperating processors
- Both are NP-complete problems
- Strongly affects the performance
10MPSoC Development Challenges (2)
- Two main solutions
- 1. Software development using parallel
programming libraries. - e.g. MPI and OpenMP
- Manually by the programmer.
- Requires huge investment to re-develop existing
software. - 2. Automatic parallelization at compile-time
- Does not require reprogramming but requires
re-compilation. - Compiler performs both Task decomposition and
scheduling.
11MPSoC Development Challenges (3)
- Control and Synchronization
- To Address inter-processor data dependencies
- Debugging
- Tracking concurrent execution is difficult.
- Particularly in heterogeneous architecture with
different ISA processors.
12MPSoC Development Challenges (4)
- All MPSoCs can be divided into two categories
- Static scheduling
- Task scheduling is performed before execution.
- Predetermined number of contributing processors.
- Has access to entire program.
- Dynamic scheduling
- A run-time scheduler (in hardware or OS) performs
task scheduling. - Does not depend on number of processors.
- Only has access to pending tasks and available
resources.
13- Introduction and Motivations
- Basics of Evolvable Multiprocessor System
- EvoMP Operational View
- EvoMP Architectural View
- Simulation and Synthesis Results
- Summary
14Proposal of Evolvable Multi-processor System (1)
- This thesis introduces a novel MPSoC
- Uses evolutionary strategies for run-time task
decomposition and scheduling. - Is called EvoMP (Evolvable Multi-Processor
system). - Features
- Can directly execute classical sequential codes
on MP platform. - Uses a hardware evolutionary algorithm core to
perform run time task decomposition and
scheduling. - Distributed control and computing
- Flexibility
- NoC-Based, 2D mesh, and homogeneous
15Proposal of Evolvable Multi-processor System (2)
- All computational units have one copy of the
entire program - EvoMP architecture exploits a hardware
evolutionary core - to generates a bit-string (chromosome).
- This bit-string determines the processor which is
in charge of executing each instruction. - Primary version of EvoMP uses a genetic algorithm
core.
16Target Applications
- Target Applications
- Applications, which perform a unique computation
on a stream of data, e.g. - Digital signal processing
- Packet processing in network applications
- Huge sensory data processing
-
17Streaming Applications Code Style
Initial 1- MOV R1, 0 2- MOV R2,
0 L1 Loop 3- MOV R1, Input 4- MUL R3, R1,
Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7-
MOV Output, R1 8- MOV R1, R2 9-
Genetic 10-JUMP L1
- Streaming programs have two main parts
- Initialization
- Infinite (or semi-infinite) Loop
Two-Tap FIR Filter
18- Introduction and Motivations
- Basics of Evolvable Multiprocessor System (EvoMP)
- EvoMP Operational View
- EvoMP Architectural View
- Simulation and Synthesis Results
- Summary
19EvoMP Top View
- Genetic core produces a bit-string (chromosome)
- Determines the location of execution of each
instruction
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1,
R2 9-JUMP L1
SW00
SW01
P-01
P-00
Chromosome 011011011
Genetic Core
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
SW10
SW11
P-11
P-10
20How EvoMP Works? (1)
- Following process is repeated in each iteration
- At the beginning of each iteration
- genetic core generates and sends the bit-string
(chromosome) to all processors. - Processors execute this iteration with the
determined decomposition and scheduling scheme. - A counter in genetic core counts number of spent
clock cycles. - When all processors reached end of the loop
- The genetic core uses the output of this counter
as the fitness value.
21How EvoMP Works? (2)
Terminate
Initialize
Evolution
Final
Fault detected
- Three main working states
- Initialize
- Just in first population
- Genetic core generates random particles.
- Evolution
- Uses recombination to produce new populations .
- When the termination condition is met, system
goes to final state. - Final
- The best chromosome is used as constant output of
the genetic core. - When one of the processors becomes faulty, the
system returns to evolution stage
22How Chromosome Codes the Scheduling Data? (1)
- Each chromosome consists of some small words
(gene). - Each word contains two fields
- A processor number
- Number of instructions
23How Chromosome Codes the Scheduling Data (2)
- Assume that we have a 2X2 mesh
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV
R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2,
Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8-
MOV R1, R2 9- GENETIC 10-JUMP L1
Chromosome
10
001
Word1
00
01
01
010
Word2
11
000
Word3
10
101
Word4
10
11
10
of Instructions
24Data Dependency Problem
- Data dependencies are the main challenge.
- Must be detected dynamically at run-time.
- Is addressed using
- Particular machine code style
- Architectural techniques
25EvoMP Machine Code Style
- Source operands are replaced by line-number of
the most recent instructions that has changed it
(ID). - Will enormously simplify dependency detection.
10. ADD R1,R2,R3 R3R1R2 11. AND
R2,R6,R7 R7R2R6 12. SUB R7,R3,R4
R4R7-R3
12. SUB (11), (10), R4
26- Introduction and Motivations
- Basics of Evolvable Multiprocessor System (EvoMP)
- EvoMP Operational View
- EvoMP Architectural View
- Simulation and Synthesis Results
- Summary
27Architecture of each Processor
- Number of FUs is configurable.
- Homogeneous or heterogeneous policies can be used
for FUs. - Supports out of order execution.
- First free FU grabs the instruction from Instr
bus (Daisy Chain).
28Fetch_Issue Unit
- PC1-Instr bus is used for executive instructions.
- PC2-Invalidate_Instr bus is used for data
dependency detection.
29Functional Unit
- Can be configured to execute different
operations - Arithmetic Operations
- Add
- Sub
- Shift/Rotate Right/Left
- Multiply Add and shift
- Logical Operations
30Genetic Core
SW00
SW01
Cell-01
Cell-00
Genetic Core
SW10
SW11
Cell-11
Cell-10
- Population size and mutation rate are
configurable. - Elite count is constant and equal to two in order
to reduce the hardware complexity
31EvoMP Challenges
- Current versions uses centralized memory unit.
- In 00 address.
- This address does not contain computational
circuits. - Major issue for scalability
- Search space of genetic algorithm is very large.
- Exponentially grows up with linear increase of
number of processors.
32PSO Core 8
33- Introduction and Motivations
- Basics of Evolvable Multiprocessor System (EvoMP)
- EvoMP Operational View
- EvoMP Architectural View
- Simulation and Synthesis Results
- Summary
34Configurable Parameters
- There are some configurable parameters in EvoMP
- Word-length of the system
- Size of the mesh (number of processors)
- Flit length bit-length of NoC switch links
- Population size
- Crossover rate
35Simulation Results
- Two sets of applications are used for performance
evaluation. - Some DSP programs
- Some sample neural Network
- Two other decomposition and scheduling methods
are implemented enabling the comparison - Static Decomposition Genetic Scheduler (SDGS)
- Decomposition is performed statically i.e. tasks
are predetermined manually - Genetic core only specifies scheduling scheme
- Static Decomposition First Free Scheduler (FF)
- Assigns the first task in job-queue to the first
free processor in the system
3616-Tap FIR Filter
-
- Parameters
- 16 bit mode
- Population size16
- Crossover Rate8
- NoC connection width16
-
Best fitness shows number of clock cycles
required to execute one iteration using the best
particle which has been found yet.
- 74 Instructions
- 16 multiplication
-
378-Point DCT
-
- Parameters
- 16 bit mode
- Population size16
- Crossover Rate8
- NoC connection width16
-
- 88 Instructions
- 32 multiplication
-
3816-point DCT
-
- Parameters
- 16 bit mode
- Population size16
- Crossover Rate6
- NoC connection width16
- 320 Instructions
- 128 multiplication
395x5 Matrix Multiplication
-
- Parameters
- 16 bit mode
- Population size16
- Crossover Rate6
- NoC connection width16
-
- 406 Instructions
- 125 multiplication
-
40FIR-16 DCT-8 DCT-16 MATRIX-5x5
Number of Instructions Number of Instructions Number of Instructions 74 88 324 406
Number of Multiply Instructions Number of Multiply Instructions Number of Multiply Instructions 16 32 128 125
1x2 mesh (One Proc.) In all three schemes Fitness (clock cycles) 350 671 2722 3181
1x2 mesh (One Proc.) In all three schemes Speed-up 1 1 1 1
1x3 mesh Main Design Fitness (clock cycles) 214 403 1841 2344
1x3 mesh Main Design Speed-up 1.63 1.66 1.47 1.37
1x3 mesh Main Design Evolution Time (us) 27342 42807 74582 198384
1x3 mesh SDGS Fitness (clock cycles) 202 401 1812 2218
1x3 mesh SDGS Speed-up 1.73 1.67 1.50 1.43
1x3 mesh SDGS Evolution Time (us) 1967 29315 84365 65119
1x3 mesh First Free Fitness (clock cycles) 293 733 2529 2487
1x3 mesh First Free Speed-up 1.19 0.91 1.08 1.27
2x2 mesh Main Design Fitness (clock cycles) 171 319 1460 1868
2x2 mesh Main Design Speed-up 2.04 2.10 1.86 1.70
2x2 mesh Main Design Evolution Time (us) 30174 54790 23319 294828
2x2 mesh SDGS Fitness (clock cycles) 161 306 1189 1817
2x2 mesh SDGS Speed-up 2.17 2.19 2.28 1.75
2x2 mesh SDGS Evolution Time (us) 10739 52477 536565 10092
2x2 mesh First Free Fitness (clock cycles) 239 681 1933 2098
2x2 mesh First Free Speed-up 1.46 0.98 1.40 1.51
41FIR-16 DCT-8 DCT-16 MATRIX-5x5
Number of Instructions Number of Instructions Number of Instructions 74 88 324 406
Number of Multiply Instructions Number of Multiply Instructions Number of Multiply Instructions 16 32 128 125
1x2 mesh (One Proc.) In all three schemes Fitness (clock cycles) 350 671 2722 3181
1x2 mesh (One Proc.) In all three schemes Speed-up 1 1 1 1
2x3 mesh Main Design Fitness (clock cycles) Unevaluated 285 1213 1596
2x3 mesh Main Design Speed-up Unevaluated 2.33 2.25 1.99
2x3 mesh Main Design Evolution Time (us) Unevaluated 93034 630482 546095
2x3 mesh SDGS Fitness (clock cycles) Unevaluated 256 1106 1575
2x3 mesh SDGS Speed-up Unevaluated 2.62 2.46 2.01
2x3 mesh SDGS Evolution Time (us) Unevaluated 41023 111118 178219
2x3 mesh First Free Fitness (clock cycles) Unevaluated 496 1587 1815
2x3 mesh First Free Speed-up Unevaluated 1.35 1.71 1.75
42Neural Network Case Study
of Instr. of Multiplies 1x2 mesh 1x2 mesh 1x3 mesh 1x3 mesh 1x3 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x3 mesh 2x3 mesh 2x3 mesh
of Instr. of Multiplies Fitness Speed-up Fitness Speed-up Time Fitness Speed-up Time Fitness Speed-up Time Â
  4-4-1 58 20 450 1 281 1.60 125 245 1.83 52 207 2.17 262
  3-9-2 95 45 905 1 570 1.59 52 503 1.80 163 463 1.95 342
  12-20-10 924 440 8304 1 5153 1.61 892 4365 1.90 1832 3813 2.18 3436
43Fault Tolerance Results
- When a fault is detected in a processor, the
evolutionary core eliminates it of contribution
in next iterations. - It also returns to evolution stage to find the
suitable solution for the new situation. - Best obtained fitness in a 2x3 EvoMP for 16-point
DCT program is evaluated. - Faults are injected into 010, 001 and 101
processors in 1000000us, 2000000us and 3000000us
respectively
44Genetic vs. PSO
-
- Population size in both experiments is 16
-
of Instr. of Multi-plies Particle length (bits) 1x2 mesh 1x3 mesh 1x3 mesh 1x3 mesh 1x3 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x3 mesh  2x3 mesh  2x3 mesh  2x3 mesh Â
of Instr. of Multi-plies Particle length (bits) Both Genetic Genetic PSO PSO Genetic Genetic PSO PSO Genetic Genetic PSO PSO
of Instr. of Multi-plies Particle length (bits) Both Fit Time Fit Time Fit Time Fit Time Fit Time Fit Time
 FIR-16 74 16 240 350 214 23.7 211 12.3 171 30.1 174 14.3 unevaluated unevaluated unevaluated unevaluated
 DCT-8 88 32 280 671 403 93.0 393 6.2 319 99.8 308 21.8 285 138.1 203 15.6
 DCT-16 324 128 720 2722 1841 74.5 1831 41.7 1460 23.3 1439 45.3 1213 633.7 1191 98.3
 MAT-5x5 406 125 800 3181 2344 198.3 2312 86.3 1868 294.8 1821 148.3 1596 546.7 1518 240.9
45Synthesis Results
- Synthesis results on VIRTEX II (XC2V3000) FPGA
using Sinplify Pro.
NoC switch Genetic Core PSO Core MMU Processor Total System
Area (Total LUTs) 729 (2) 1864 (6) 1642 (5) 3553 (12) 4433 (15) 20112 (70)
Max Freq. (MHz) - 68.4 94.6 - - 61.4
46- Introduction and Motivations
- Basics of Evolvable Multiprocessor System (EvoMP)
- EvoMP Operational View
- EvoMP Architectural View
- Simulation and Synthesis Results
- Summary
47Summary
- The EvoMP which is a novel MPSoC system was
studied. - EvoMP exploits evolvable strategies to perform
run-time task decomposition and scheduling. - EvoMP does not require concurrent codes because
it can parallelize th sequential codes at
run-time. - Exploits particular and novel processor
architecture in order to address data dependency
problem. - Experimental results confirm the applicability of
EvoMP novel ideas.
48Main References
- 1 N. S. Voros and K. Masselos, System Level
Design of Reconfigurable Systems-on-Chip.
Netherlands Springer, 2005. - 2 G. Martin, Overview of the MPSoC design
challenge, Proc. Design and Automation Conf.,
July 2005, pp. 274-279. - 3 S. Amarasinghe, Multicore programming primer
and programming competition, class notes for
6.189, Computer Architecture Group, Massachusetts
Institute of Technology, Available
www.cag.csail.mit.edu/ps3/lectures/6.189-lecture1-
intro.pdf. - 4 M. Hubner, K. Paulsson, and J. Becker,
Parallel and flexible multiprocessor
system-on-chip for adaptive automotive
applications based on Xilinx MicroBlaze
soft-cores, Proc. Intl. Symp. Parallel and
Distributed Processing, 2005. - 5 D. Gohringer, M. Hubner, V. Schatz, and J.
Becker, Runtime adaptive multi-processor
system-on-chip RAMPSoC, Proc. Intl. Symp.
Parallel and Distributed Processing, April 2008,
pp. 1-7. - 6 A. Klimm, L. Braun, and J. Becker, An
adaptive and scalable multiprocessor system for
Xilinx FPGAs using minimal sized processor
cores, Proc. Symp. Parallel and Distributed
Processing, April 2008, pp. 1-7. - 7 Z.Y. Wen and Y.J. Gang, A genetic algorithm
for tasks scheduling in parallel multiprocessor
systems, Proc. Intl. Conf. Machine Learning and
Cybernetics, Nov. 2003, pp.1785-1790. - 8 A. Farmahini-Farahani, S. Vakili, S. M.
Fakhraie, S. Safari, and C. Lucas, Parallel
scalable hardware implementation of asynchronous
discrete particle swarm optimization, Elsevier
J. of Engineering Applications of Artificial
Intelligence, submitted for publication.
49Main References (2)
- 9 A. A. Jerraya and W. Wolf, Multiprocessor
Systems-on-Chips. San Francisco Morgan Kaufmann
Publishers, 2005. - 10 A.J. Page and T.J. Naughton, Dynamic task
scheduling using genetic algorithms for
heterogeneous distributed computing, Proc. Intl.
Symp. Parallel and Distributed Processing, April
2005, pp. 189.1. - 11 E. Carvalho, N. Calazans, and F. Moraes,
Heuristics for dynamic task mapping in NoC based
heterogeneous MPSoCs, Proc. Int. Rapid System
Prototyping Workshop, pp. 34-40, 2007. - 12 R. Canham, and A. Tyrrell, An embryonic
array with improved efficiency and fault
tolerance, Proc. NASA/DoD Conf. on Evolvable
Hardware, July 2003, pp. 265-272. - 13 W. Barker, D. M. Halliday, Y. Thoma, E.
Sanchez, G. Tempesti, and A. Tyrrell, Fault
tolerance using dynamic reconfiguration on the
POEtic Tissue, IEEE Trans. Evolutionary
Computing, vol. 11, num. 5, Oct. 2007, pp.
666-684.
50Related Publications
- Journal
- S. Vakili, S. M. Fakhraie, and S. Mohammadi,
EvoMP a novel MPSoC architecture with evolvable
task decomposition and scheduling, Submitted to
IET Comp. Digital Tech., (Under Revision). - S. Vakili, S. M. Fakhraie, and S. Mohammadi,
Low-cost fault tolerance in evolvable
multiprocessor system a graceful degradation
approach, Submitted to Journal of Zhejiang
University SCIENCE A (JZUS-A). - Conference
- S. Vakili, S. M. Fakhraie, and S. Mohammadi,
Designing an MPSoC architecture with run-time
and evolvable task decomposition and scheduling,
Proc. 5th IEEE Intl. Conf. Innovations in
Information Technology, Dec. 2008. - S. Vakili, S. M. Fakhraie, S. Mohammadi, and Ali
Ahmadi, Particle swarm optimization for run-time
task decomposition and scheduling in evolvable
MPSoC, Proc. IEEE. Intl. conf. Computer
Engineering and Technology, Jan. 2009.