Title: Introduction to FPGA Technology, Devices and Tools
 1Introduction to FPGATechnology, Devices and Tools 
 2FPGA Devices  Technology 
 3World of Integrated Circuits
Full-Custom ASICs
Semi-Custom ASICs
User Programmable
PLD
FPGA 
 4FPGA Field Programmable Gate Array
ASIC Application Specific Integrated Circuit
-  designs must be sent 
-  for expensive and time 
-  consuming fabrication 
-  in semiconductor foundry
- Small development 
-  overhead 
- No NRE (non-recurring 
-  engineering) costs 
- Quick time to market 
- No minimum quantity 
-  order 
- Reprogrammable
-  designed all the way 
-  from behavioral description 
-  to physical layout
5How can we make a programmable logic?
- One time programmable 
- Fuses (destroy internal links with current) 
- Anti-fuses (grow internal links) 
- PROM 
- Reprogrammable 
- EPROM 
- EEPROM 
- Flash 
- SRAM - volatile
6What is an FPGA?
Configurable Logic Blocks
I/O Blocks
Block RAMs 
 7Which Way to Go? 
ASICs
FPGAs
Off-the-shelf
High performance
Low development cost
Low power
Short time to market
Low cost in high volumes
Reconfigurability 
 8Other FPGA Advantages
- Manufacturing cycle for ASIC is very costly, 
 lengthy and engages lots of manpower
- Mistakes not detected at design time have large 
 impact on development time and cost
- FPGAs are perfect for rapid prototyping of 
 digital circuits
- Easy upgrades like in case of software 
- Unique applications 
- reconfigurable computing
9Major FPGA Vendors
- SRAM-based FPGAs 
- Xilinx, Inc. 
- Altera Corp. 
- Atmel 
- Lattice Semiconductor 
- Flash  antifuse FPGAs 
- Actel Corp. 
- Quick Logic Corp.
Share over 60 of the market 
 10  11Xilinx
- Primary products FPGAs and the associated CAD 
 software
- Main headquarters in San Jose, CA 
- Fabless Semiconductor and Software Company 
- UMC (Taiwan) Xilinx acquired an equity stake in 
 UMC in 1996
- Seiko Epson (Japan) 
- TSMC (Taiwan)
ISE Alliance and Foundation Series Design 
Software 
 12Xilinx FPGA Families
- Old families 
- XC3000, XC4000, XC5200 
- Old 0.5µm, 0.35µm and 0.25µm technology. Not 
 recommended for modern designs.
- High-performance families 
- Virtex (0.22µm) 
- Virtex-E, Virtex-EM (0.18µm) 
- Virtex-II, Virtex-II PRO (0.13µm) 
- Low Cost Family 
- Spartan/XL  derived from XC4000 
- Spartan-II  derived from Virtex 
- Spartan-IIE  derived from Virtex-E 
- Spartan-3
13Basic Spartan-II FPGA Block Diagram 
 14CLB Structure
- Each slice has 2 LUT-FF pairs with associated 
 carry logic
- Two 3-state buffers (BUFT) associated with each 
 CLB, accessible by all CLB outputs
15CLB Slice Structure
- Each slice contains two sets of the following 
- Four-input LUT 
- Any 4-input logic function, 
- or 16-bit x 1 sync RAM 
- or 16-bit shift register 
- Carry  Control 
- Fast arithmetic logic 
- Multiplier logic 
- Multiplexer logic 
- Storage element 
- Latch or flip-flop 
- Set and reset 
- True or inverted inputs 
- Sync. or async. control
16LUT (Look-Up Table) Functionality
- Look-Up tables are primary elements for logic 
 implementation
- Each LUT can implement any function of 4 inputs
175-Input Functions implemented using two LUTs
- One CLB Slice can implement any function of 5 
 inputs
- Logic function is partitioned between two LUTs 
- F5 multiplexer selects LUT
185-Input Functions implemented using two LUTs
OUT 
 19Dedicated Expansion Multiplexers
- MUXF5 combines 2 LUTs to create 
- Any 5-input function (LUT5) 
- Or selected functions up to 9 inputs 
- Or 4x1 multiplexer 
- MUXF6 combines 2 slices to form 
- Any 6-input function (LUT6) 
- Or selected functions up to 19 inputs 
- 8x1 multiplexer 
- Dedicated muxes are faster and more space 
 efficient
20Distributed RAM
- CLB LUT configurable as Distributed RAM 
- A LUT equals 16x1 RAM 
- Implements Single and Dual-Ports 
- Cascade LUTs to increase RAM size 
- Synchronous write 
- Synchronous/Asynchronous read 
- Accompanying flip-flops used for synchronous read
21Shift Register
- Each LUT can be configured as shift register 
- Serial in, serial out 
- Dynamically addressable delay up to 16 cycles 
- For programmable pipeline 
- Cascade for greater cycle delays 
- Use CLB flip-flops to add depth
22Shift Register 
- Register-rich FPGA 
- Allows for addition of pipeline stages to 
 increase throughput
- Data paths must be balanced to keep desired 
 functionality
23Carry  Control Logic
COUT
YB
Look-Up Table
Carry  Control Logic
Y
G4 G3 G2 G1
S
D
Q
O
CK
EC
R
F5IN
BY SR
XB
Look-Up Table
Carry  Control Logic
X
S
F4 F3 F2 F1
D
Q
O
CK
EC
R
CIN CLK CE
SLICE 
 24Fast Carry Logic
- Each CLB contains separate logic and routing for 
 the fast generation of sum  carry signals
- Increases efficiency and performance of adders, 
 subtractors, accumulators, comparators, and
 counters
- Carry logic is independent of normal logic and 
 routing resources
MSB
Carry Logic Routing
LSB 
 25Accessing Carry Logic
- All major synthesis tools can infer carry logic 
 for arithmetic functions
- Addition (SUM lt A  B) 
- Subtraction (DIFF lt A - B) 
- Comparators (if A lt B then) 
- Counters (count lt count 1)
26Block RAM
- Most efficient memory implementation 
- Dedicated blocks of memory 
- Ideal for most memory requirements 
- 4 to 14 memory blocks 
- 4096 bits per blocks 
- Use multiple blocks for larger memories 
- Builds both single and true dual-port RAMs
27Dual Port Block RAM 
 28Dual-Port Bus Flexibility
RAMB4_S4_S16
WEA
Port A Out 4-Bit Width
Port A In 1K-Bit Depth
ENA
RSTA
DOA30
CLKA
ADDRA90
DIA30
WEB
Port B Out 16-Bit Width
Port B In 256-Bit Depth
ENB
RSTB
DOB150
CLKB
ADDRB70
DIB150
- Each port can be configured with a different data 
 bus width
- Provides easy data width conversion without any 
 additional logic
29Two Independent Single-Port RAMs
RAMB4_S1_S1
Port A In 2K-Bit Depth
Port A Out 1-Bit Width
VCC, ADDR100
Port B In 2K-Bit Depth
Port B Out 1-Bit Width
GND, ADDR100
- To access the lower RAM 
- Tie the MSB address bit to Logic Low 
- To access the upper RAM 
- Tie the MSB address bit to Logic High
- Added advantage of True Dual-Port 
- No wasted RAM Bits 
- Can split a Dual-Port 4K RAM into two Single-Port 
 2K RAM
- Simultaneous independent access to each RAM
30I/O Banking 
 31Basic I/O Block Structure
Q
D
Three-State
EC
FF Enable
Three-StateControl
Clock
SR
Set/Reset
Q
D
Output
EC
FF Enable
Output Path
SR
Direct Input
FF Enable
Input Path
Q
D
Registered Input
EC
SR 
 32IOB Functionality
- IOB provides interface between the package pins 
 and CLBs
- Each IOB can work as uni- or bi-directional I/O 
- Outputs can be forced into High Impedance 
- Inputs and outputs can be registered 
- advised for high-performance I/O 
- Inputs can be delayed
33Routing Resources 
 34Clock Distribution 
 35FPGA Nomenclature 
 36  37Device Families  Tools 
 38Logic Element FLEX10K 
 39Logic Array Block FLEX10K 
 40FLEX10K Architecture 
 41Stratix Architecture 
 42Stratix Device Family
Feature EP1S10 EP1S20 EP1S25 EP1S30 EP1S40 EP1S60 EP1S80 EP1S120
Logic Elements (LEs) 10,570 18,460 25,660 32,470 41,250 57,120 79,040 114,140
M512 RAM Blocks( 512 Bits  Parity) 94 194 224 295 384 574 767 1,118
M4K RAM Blocks(4 Kbits  Parity) 60 82 138 171 183 292 364 520
M512 RAM Blocks(512 Kbits  Parity) 1 2 2 4 4 6 9 12
Total RAM bits 920,448 1,669,248 1,944,576 3,317,184 3,423,744 5,215,104 7,427,520 10,118,016
DSP Blocks 6 10 10 12 14 18 22 28
Embedded Multipliers 48 80 80 96 112 144 176 224
PLLS 6 6 6 10 12 12 12 12
Maximum User I/O Pins 426 586 706 726 822 1,022 1,238 1,314
Engineering Sample Availability Now Use Production Use Production N/A Now N/A Now 2003
Production Device Availability March 2003 Now Now Now March 2003 April 2003 January 2003 2003 
 43FPGA Technology Roadmap
year 1995 1996 1997 2000 2003 2004 ?
Technology 0.6µ 0.35 µ 0.25 µ 0.18 µ 0.13 µ 0.07µ
Gate count 25K 100K 250K 1 M 100K LC 8Mb RAM 400 18X18 multipliers 
Transistor count 3.5M 12M 23M 75M 430M 1B
note Xilinx Virtex-II Pro XC2VP100 (9/16/2003) 
 44- Advance architecture on 
-  modern FPGAs
45More guts
- Additional components 
- RAM blocks 
- Dedicated multipliers 
- Tri-state buffers 
- Transceivers 
- Processor cores 
- DSP blocks 
46Dedicate Arithmetic Blocks
QuickLogic
Altera
Xilinx 
 47Processor Cores 
 48PowerPC on Vertex II Pro
- Embedded 300 MHz Harvard Architecture Core 
-  Low Power Consumption 0.9 mW/MHz 
-  Five-Stage Data Path Pipeline 
-  Hardware Multiply/Divide Unit 
-  Thirty-Two 32-bit General Purpose Registers 
-  16 KB Two-Way Set-Associative Instruction Cache 
-  16 KB Two-Way Set-Associative Data Cache 
-  Memory Management Unit (MMU) 
- - 64-entry unified Translation Look-aside Buffers 
 (TLB)
- - Variable page sizes (1 KB to 16 MB) 
-  Dedicated On-Chip Memory (OCM) Interface 
-  Supports IBM CoreConnect Bus Architecture 
-  Debug and Trace Support 
-  Timer Facilities
49ARM in Excalibur
- Industry-standard ARM922T 32-bit RISC processor 
 core operating up to 200MHz
-  ARMv4T instruction set with Thumb extensions 
-  Memory management unit (MMU) included for 
 real-time operating systems (RTOS) support
-  Harvard cache architecture with 64-way set 
 associative separate 8-Kbyte instruction and
 8-Kbyte data caches
- Embedded programmable on-chip peripherals 
-  ETM9 embedded trace module to assistant software 
 debugging
-  Flexible interrupt controller 
-  Universal asynchronous receiver/transmitter 
 (UART)
-  General-purpose timer 
-  Watchdog timer
50FPGA Tools 
 51Design process (1)
Design and implement a simple unit permitting to 
speed up encryption with RC5-similar cipher with 
fixed key set on 8031 microcontroller. Unlike in 
the experiment 5, this time your unit has to be 
able to perform an encryption algorithm by 
itself, executing 32 rounds..
Specification (Lab Experiments)
VHDL description (Your Source Files)
Library IEEE use ieee.std_logic_1164.all use 
ieee.std_logic_unsigned.all entity RC5_core is 
 port( clock, reset, 
encr_decr in std_logic 
data_input in std_logic_vector(31 downto 0) 
 data_output out std_logic_vector(31 
downto 0) out_full in 
std_logic key_input in 
std_logic_vector(31 downto 0) 
key_read out std_logic ) end 
AES_core
Functional simulation
Synthesis
Post-synthesis simulation 
 52Design process (2)
Implementation
Timing simulation
Configuration
On chip testing 
 53Active-HDL 
 54Simulation Tools
Synthesis Tools 
 55Logic Synthesis
VHDL description
Circuit netlist
architecture MLU_DATAFLOW of MLU is signal 
A1STD_LOGIC signal B1STD_LOGIC signal 
Y1STD_LOGIC signal MUX_0, MUX_1, MUX_2, MUX_3 
STD_LOGIC begin A1ltA when (NEG_A'0') 
else not A B1ltB when (NEG_B'0') else not 
B YltY1 when (NEG_Y'0') else not 
Y1 MUX_0ltA1 and B1 MUX_1ltA1 or 
B1 MUX_2ltA1 xor B1 MUX_3ltA1 xnor 
B1 with (L1  L0) select Y1ltMUX_0 when 
"00", MUX_1 when "01", MUX_2 when 
"10", MUX_3 when others end MLU_DATAFLOW 
 56Features of synthesis tools
- Interpret RTL code 
- Produce synthesized circuit netlist in a standard 
 EDIF format
- Give preliminary performance estimates 
- Some can display circuit schematics corresponding 
 to EDIF netlist
57Implementation
- After synthesis the entire implementation process 
 is performed by FPGA vendor tools
- Xilinx ISE foundation 6.2i 
- Altera Quartus II 4.0 
- 3rd party tools for alliance version
58Circuit Compilation
 1. Technology Mapping
2. Placement
Assign a logical LUT to a physical location.
3. Routing
Select wire segments And switches 
for Interconnection. 
 59Routing Example
FPGA
Programmable Connections 
 60Static Timing Analyzer
- Performs static analysis of the circuit 
 performance
- Reports critical paths with all sources of delays 
- Determines maximum clock frequency
61Static Timing Analysis
- Critical Path  The Longest Path From Outputs of 
 Registers to Inputs of Registers
- Min. Clock Period  Length of The Critical Path 
- Max. Clock Frequency  1 / Min. Clock Period
62Configuration
- Once a design is implemented, you must create a 
 file that the FPGA can understand
- This file is called a bit stream a BIT file 
 (.bit extension)
- The BIT file can be downloaded directly to the 
 FPGA, or can be converted into a PROM file which
 stores the programming information