Title: A Reconfigurable Signal Processing IC with embedded FPGA and Multi-Port Flash Memory
1A Reconfigurable Signal Processing IC with
embedded FPGA and Multi-Port Flash Memory
M. Borgatti, L. Calì, G. De Sandre, B. Forêt, D.
Iezzi, F. Lertora, G. Muzzi, M. Pasotti, M.
Poles, P.L. Rolandi
STMicroelectronics - Central RD - Italy
2Outline of Presentation
- Project motivation and background
- System architecture
- Reconfigurable core
- Memory subsystem
- System performance
- Application example embedded face recognition
system - Energy efficiency, measurements
- SoC integration and design flow
- System 2 RTL and RTL 2 Layout
- Summary
2
3Project motivation and background
- Conflicting industry trends
- Economics of system integration
- Even more complex SoC
- More integration
- Cost effectiveness and performance (per unit)
- Increasing design complexity and risks
- Increasing NREs
- Shorter time-to-market and product life
- Strong need for
- Faster project turnaround
- Lower risk
- Usage of re-configurable silicon fabrics
3
4Project motivation and background
- Pragmatic approach proposed
- Reconfigurable architecture
- Joins a statically extensible processor with
e-FPGA - Tight connection to Flash memory subsystem
- Open architecture with flexible programmable I/O
- Programmable platform approach
- Simple model for programmers
4
5Programmable Platform Approach
System Applications Family
System Application
Application Compilation
Platform Compilation
Config. Proc e-FPGA
Silicon process Enabling technologies
Programmable platform
5
6System Architecture
48 kB SRAM
8KB D
8KB I
bus bridge
Extensible MPU
64 bit AHB BUS
8KB D
M/S AHB I/F
DMA FPGA Prog. I/F
FP
CP
DP
e-FPGA
INTs
Instr. Ext.
Flash Mem
Inst. Ext I/F
Buffer I/F
AHB/APB Bridge
1kB Buffer
GP I/O
64 bit APB BUS
I2C BUS
General Purpose I/O Lines
I2C Master
I/O registers
6
7e-FPGA Purposes
- Processor ISA extensions
- Simplest programmers model
- Specific interface to the MPU datapath
- Impact on processor performance
- Impact on processor energy efficiency
- Efficiency limited by instruction stream decoding
- Bus-mapped co-processor
- Maximum benefits in speed/power
- Flexible I/O
7
8e-FPGA Microprocessor interface
e-FPGA Clock
Microprocessor clock
Clock Ctrl
Instruction
Other FPGA Purposes
Pipe Control
Decode
Register File
R
Instruction extension
E
Result
8
9Flash Memory Architecture
DFT
2Mb 0
2Mb 1
2Mb 2
2Mb 3
PMA
Power Block
128-bit Memory Sub-System Crossbar
128
128
128
128
?P I/F
DP
CP
FP
64
64
32
8-bit ?P
FPGA Port
Code Port
Data Port
9
10Flash Memory Subsystem
- Modular approach
- Customizable array of N independent 2Mb modules
- 3 content-specific ports (CP, DP, FP)
- HW support for filesystem implem. (DP)
- Defrag
- Compression
- Virtual erase
- 2Mb Module features
- 128b I/O
- 40ns access time (400MB/s peak throughput)
- Power management and arbitration
10
11System Memory Hierarchy
32-bit uP RegisterFile
AHB Bridge
64-bit AHB Bus
32-bit FPGA PI/F
- AHB Peak Throughput
- 800MB/s
- e-FPGA
- 400MB/s
- (50MB/s sustained)
- Total Aggregate Peak
- 1.2GB/s
64-bit AHB
32-bit
64-bit CP I/F
64-bit DP I/F
DMA
64 bit Port CP
32-bit Port FP
64-bit Port DP
512-B Buffer
2 x 64- 1 x 32-bit Memory Port I/Fs
6x4 128-bit Crossbar
4 x Flash Memory Controller Logic
4 x 16384 x 128-bit Memory Module
11
12Application Ex. Face Recognition
- Target application
- Recognize a face out of twenty
- low-resolution images from CMOS cameras
- Potential applications
- Low cost smart toys
- Advanced human-machine interfaces
- Color CMOS camera processors
- Image preprocessing Bayer filter
- Face location based on Hough transform
- Face recognition Line-Based
- Recognition rates over 90
- Scale-invariant
- Tolerant to changes in illumination intensity
12
13Processor Extension (I)
8
16
?Processor Load Unit
4-segm.
4-segm.
- 8-issue, 8-bit L2 distance
- Complexity
- 23 8-bit OPS
- 6 64-bit OPS
- 1GOPS peak throughput
- Distance computation
- 10k equiv. ASIC gates
- Mapped to e-FPGA
_
x
64-bit register
Result
13
14Processor Extension (II)
Number
Remaind.
root
gtgt1
ltlt2
gtgt2
gtgt30
1
- Fixed-point square root kernel
- Complexity
- 12 32-bit OPS
- 2k equiv. ASIC gates
- Mapped to e-FPGA
_
gt
2
ltlt 1
Result
14
15Performance Processing Time _at_ 100 MHz
Algorithm Stage RISC w/ basic DSP RISC w/ basic DSP uP Ext. Speed-Up
Bayer Filter 58 msec 24.7 msec x 2.3
Edge Detection 4.5 msec 2.5 msec x 1.8
Face Detection 1.5 sec 382 msec x 4
Face Recognition (20-face database) 9.15 sec 860 msec x 10.6
Totals 10.7 sec 1.26 sec x 8.5
16Energy Efficiency vs. Flexibility
FPGA-mapped CoProcessors
1000
Dedicated HW
uP FPGA Instructions
100
Energy Efficiency (MOPS/mW)
Energy-Flexibility Gap !
10
ASIPs, DSPs
1
Embedded Processors
0.1
Flexibility (Coverage)
from Zhang et Al., ISSCC 2000
16
17Performance Energy Efficiency
Algorithm Stage Speed-Up Energy Gain Energy x Delay Gain
Bayer Filter x 2.3 x 1.4 x 3.2
Edge Detection x 1.8 x 0.95 x 1.7
Face Detection x 4 x 2.9 x 11.6
Face Recognition (20-face database) x 10.6 x 9 x 95.4
Totals x 8.5 x 6.7 x 57
17
18Functional model (untimed)
Partitioning / I/F Synthesis / Refinement
uP ISS
Cycle Accurate Simulation Performance Analysis
Libraries HW/SW
VHDL (e-FPGA)
Inst.Ext. Verilog
HW (RTL) uP, AHB/APB Bus Peripherals
C
Soft Hardware (eFPGA)
SW Apps
eFPGA mapping
eFPGA HARD MACRO
SoC Integration
18
19CPU core, IPs
Interface RTL code
Flash RAM
eFPGA core
Inst. Ext.
Coproc.
I/O I/F
Synthesis
Floorplanning / PR
Synthesis
Static Timing Analysis, Dynamic Verification
Con.
Mapping (PR)
FPGA Timing DB
Bit-stream
Netlist Timing Database
Static Timing Analysis (SoC eFPGA)
Silicon fab
19
20Chip Layout
DFT
1MB FLASH Memory
Embedded FPGA
88 KB I D
TAGS
Process 0.18um CMOS 2P/6M Embedded Flash
Flash Memory (x4) 256kB x 9 sectors 128-bit word 1MB/s write through. 400MB/s read through.
SRAM Memory Main 48kB (64-bit) I 8kB (64-bit) D 8kB (64-bit) Buffers 4x256B
Chip size 8.4 x 8.4 mm2 (e-FPGA size 8.2 mm2)
I/O 24 inputs 24 outputs (tristate) 8 bidirs
Supply 2.7-3.6V (external), 1.8V(core)
32b uP AHB APB 250k GATES
Flash Ports Buffers
uP AHB/APB
FPGA
48 KB SRAM
BUFFER
48kB SRAM
88 kB ID
20
21Chip Performances and Power Consumption
Processor maximum speed 125MHz (WCMIL)
Reconfiguration speed 500us _at_ 100MHz clock
Chip average power consumption 300mW _at_ 100MHz, 1.8V
21
22Summary
- e-FPGAs allow architectural tradeoffs for
reconfigurable embedded systems - Processor ISA extensions
- Bus-mapped co-processor
- Flexible I/O
- Modular, content-specific, multiport e-Flash
- Performance figures
- Up to 10x speedup
- Up to 9x energy reduction
- Dynamic reconfiguration in 500 us
- Specific design-flow for system and RTL
22
23Acknowledgements
The authors thank all the colleagues of NVM-DP
Dept. A. Maurelli, F. Piazza and L. Fumagalli.
23