FPGA: From Flashing LED to Reconfigurable Computing - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

FPGA: From Flashing LED to Reconfigurable Computing

Description:

... Logic Elements What Can Be Done With a Lookup Table Xilinx Look-Up Table Pipeline ... At least design an LED for an FPGA. When a board is first powered up, ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 83
Provided by: jywu2
Learn more at: https://www-ppd.fnal.gov
Category:

less

Transcript and Presenter's Notes

Title: FPGA: From Flashing LED to Reconfigurable Computing


1
FPGA From Flashing LED to Reconfigurable
Computing
  • Wu, Jinyuan
  • Fermilab
  • IIT
  • Mar, 2009

2
Outline
  • Electronic Aspect of FPGA
  • LED Flashing
  • Logic Elements in a Nutshell
  • TDC and ADC
  • FPGA as a Computing Fabric
  • Moores Law Forever?
  • Space Charge Computing with FPGA Cores
  • Doublet Matching Hash Sorter
  • Triplet Matching Tiny Triplet Finder
  • Enclosed Loop Micro-Sequencer (ELMS)

3
Flashing LED, The First Thing First
Counter
Q23..0
  • At least design an LED for an FPGA.
  • When a board is first powered up, first test the
    LED flashing function.
  • Many things have to be right so that the LED
    flashes
  • Power pins must be all connected.
  • Configuration devices must be in correct mode.
  • Design software must be correct.

4
LED Brightness Variation
Counter
A
Q23..0
AltB
  • The LED brightness is varied by changing the
    output pulse duty-cycle.
  • Comparator input A is the brightness and B is the
    clock cycle count.
  • Look-up table can be added to input A for
    different brightness variation curve.

B
5
Duty-Cycle Based Single-Pin DAC (1)
  • The duty-cycle or pulse width of the comparator
    output is proportional to the DAC input at port
    A.
  • Use external RC as low-pass filter.
  • Output voltage of an ideal LP filter is
    proportional to the DAC input.

6
LED Brightness Exponential Drop
if (CO1) Q Q - Q/32
S(-)
SET
Q
D
  • Narrow pulse are typically stretched for LED
    display with fix brightness.
  • The circuit here provides gradually dim of the
    LED for better visual effect.

A
AltB
Counter
CO
B
Q
7
Exponential Sequence Generator
if (CO1) Q Q - Q/32
S(-)
SET
Q
D
Possible Student Lab
  • An exponential sequence is generated using an
    accumulator shown above.
  • Note that not even one multiplier is used.
  • Other function sequences sine, co-sine, tangent,
    co-tangent etc. can also be generated similarly.

8
Duty-Cycle Based Single-Pin DAC (2)
Possible Student Lab
  • Use carry-out of the accumulator as the output.
  • The number of pulses is proportional to the DAC
    input.
  • Rounding error is carried to later cycles.
  • Output is smoother.

9
Outline
  • Electronic Aspect of FPGA
  • LED Flashing
  • Logic Elements in a Nutshell
  • TDC and ADC
  • FPGA as a Computing Fabric
  • Moores Law Forever?
  • Space Charge Computing with FPGA Cores
  • Doublet Matching Hash Sorter
  • Triplet Matching Tiny Triplet Finder
  • Enclosed Loop Micro-Sequencer (ELMS)

10
Logic Elements
A B C D
LUT4 (16 RAM Cells)
Normal Mode
LUT4 DFF
LUT Look-Up Table
CI
LUT3 8 Cells
A
Arithmetic Mode
2 x LUT3 DFF
LUT3 8 Cells
B
CO
11
What Can Be Done With a Lookup Table
A B C D
12
Xilinx Look-Up Table
16-bit Distributed RAM
RAM16
16-bit Shift Register
SRL16
LUT4
4-input Look-Up Table
13
Pipeline Structure
LUT4 (16 RAM Cells)
LUT4 (16 RAM Cells)
LUT4 (16 RAM Cells)
LUT4 (16 RAM Cells)
Logic cells are usually designed in pipeline
structures.
14
Logic Element as a Full Adder Bit
CI
LUT3 8 Cells
A
LUT3 8 Cells
B
LUT3 8 Cells
A
LUT3 8 Cells
B
A Logic cell resembles a full adder bit.
CO
15
Myths on FPGA
  • We commonly heard about FPGA
  • FPGA is cheap.
  • FPGA is fast.
  • FPGA is large.
  • FPGA can do anything.
  • Not really, at least it is not always the case.
  • The reality is
  • FPGA is ultra-flexible.
  • As the cost of the flexibility, the transistor
    usage in FPGA is NOT efficient.
  • Good design tricks are needed.

16
4-Input NAND, 4-Input NOR, 4-Input NAOR
8 transistors each
A B C D
A B C D
A B C D
Y
Y
Y
A
B
C
D
A
C
A
B
B
D
Y
C
Y
A
In ASIC
D
B
Y
C
C
D
A
B
C
D
D
A
B
17
Transistor Usage of Logic Element
At least 96 transistors
LUT 16-bit
X 16
6-transistor RAM bit
In FPGA
18
The Mirror Adder (Weste93)
In ASIC
24-28 transistors
19
Full Adder
At least 96 transistors
LUT 8-bit
LUT 8-bit
In FPGA
20
Other FPGA Resources
  • Other resources are available in FPGA devices
  • RAM Blocks
  • Multipliers
  • Serial Data Receivers, Power PC, etc.

Multipliers
RAM Blocks
16 Logic Elements
21
Outline
  • Electronic Aspect of FPGA
  • LED Flashing
  • Logic Elements in a Nutshell
  • TDC and ADC
  • FPGA as a Computing Fabric
  • Moores Law Forever?
  • Space Charge Computing with FPGA Cores
  • Doublet Matching Hash Sorter
  • Triplet Matching Tiny Triplet Finder
  • Enclosed Loop Micro-Sequencer (ELMS)

22
TDC Using FPGA Logic Chain Delay
  • This scheme uses current FPGA technology ?
  • Low cost chip family can be used. (e.g.
    EP2C8T144C6 31.68) ?
  • Fine TDC precision can be implemented in slow
    devices (e.g., 20 ps in a 400 MHz chip). ?

IN
CLK
23
Two Major Issues In a Free Operating FPGA
  1. Widths of bins are different and varies with
    supply voltage and temperature.
  2. Some bins are ultra-wide due to LAB boundary
    crossing

24
Auto Calibration Using Histogram Method
  • It provides a bin-by-bin calibration at certain
    temperature.
  • It is a turn-key solution (bin in, ps out)
  • It is semi-continuous (auto update LUT every 16K
    events)

16K Events
DNL Histogram
S
LUT
In (bin)
Out (ps)
25
The Test Module
Data Output via Ethernet
FPGA with 8ch TDC
Two NIM inputs
BNC Adapter to add delay _at_ 150ps step.
26
Test ResultNIM Inputs
As good as ASIC TDC
RMS 10ps
140ps
0
1
2
Wave Union TDC B
Wave Union TDC B
BNC adapters to add delays _at_ 140ps step.

NIM/ LVDS
Wave Union TDC B
Wave Union TDC B
-
LeCroy 429A NIM Fan-out
Wave Union TDC B
NIM/ LVDS
Wave Union TDC B

Wave Union TDC B
Wave Union TDC B
27
Multi-Sampling TDC FPGA
Clock Domain Changing
Multiple Sampling
Q3
QF
c0
c0
Q2
QE
  • Ultra low-cost 48 channels in 18.27
    EP2C5Q208C7.
  • Sampling rate 360 MHz x4 phases 1.44 GHz.
  • LSB 0.69 ns.

c90
Q1
QD
c180
Q0
c90
c270
DV
T0 T1
Trans. Detection Encode
4Ch
Coarse Time Counter
TS
Logic elements with non-critical timing are
freely placed by the fitter of the compiler.
This picture represent a placement in Cyclone FPGA
28
ADC Using FPGA
FPGA
AMP Shaper
ADC
AMP Shaper
ADC
  • Analog signals from AMP Shapers are directly
    fed to FPGA pins.
  • FPGA outputs and passive RC network are used to
    generate ramping reference voltage VREF.
  • The input voltages and VREF are compared using
    FPGA differential input receivers.
  • The times of transitions representing input
    voltage values are digitized by TDC blocks in
    FPGA.

AMP Shaper
ADC
AMP Shaper
ADC
FPGA
AMP Shaper
TDC
AMP Shaper
TDC
AMP Shaper
TDC
AMP Shaper
TDC
VREF
R1
R1
C
R2
29
ADC Test Waveform Digitization on BD3_19
FPGA
TDC
TDC
Possible Student Lab
VREF
50
50
Input Waveform, Overlap Trigger Reference
Voltage
1000pF
100
Raw Data
Converted
30
Outline
  • Electronic Aspect of FPGA
  • LED Flashing
  • Logic Elements in a Nutshell
  • TDC and ADC
  • FPGA as a Computing Fabric
  • Moores Law Forever?
  • Space Charge Computing with FPGA Cores
  • Doublet Matching Hash Sorter
  • Triplet Matching Tiny Triplet Finder
  • Enclosed Loop Micro-Sequencer (ELMS)

31
Moores Law
Taken from www.intel.com
  • Number of transistors in a package
  • x2 /18months

32
Status of Moores Law an Inconvenient Truth
Taken from www.intel.com
  • of transistors
  • Yes, via multi-core.
  • Clock Speed
  • ?

33
The Fever of Moores Law vs. Maxwells Equations
Op/sec
WRW
MIT, 2002
1998 2000 2002 2004 2006 2008 2010
  • During the hot days of Moores Law, the rules of
    thumb are
  • BRB Buy Rather than Build
  • URU Use Rather than Understand
  • WRW Wait Rather than Work
  • From fundamental principles like Maxwells
    Equations, it is known limits of Moores Law
    exist. The technology advance comes from hard
    work.

34
The Execution Non-Execution Cycles
From MIT 6.823 Open Course Site
  • In current micro-processors
  • Each instruction takes one clock cycle to
    execute.
  • It takes many clock cycles to prepare for
    executing an instruction.
  • Pipelined? Yes. But the non-execution pipeline
    stages consume silicon area, power etc.
  • To execute an instruction ! to do useful
    calculation.
  • Can we do something different?

35
Outline
  • Electronic Aspect of FPGA
  • LED Flashing
  • Logic Elements in a Nutshell
  • TDC and ADC
  • FPGA as a Computing Fabric
  • Moores Law Forever?
  • Space Charge Computing with FPGA Cores
  • Doublet Matching Hash Sorter
  • Triplet Matching Tiny Triplet Finder
  • Enclosed Loop Micro-Sequencer (ELMS)

36
The Space Charge Computing
Number of Electrons Number of Calculations/Iteration Computing Time/1000 Iterations _at_107 Calculations/s
103 106 100 s
104 108 2.7 hours
105 1010 11.6 days
106 1012 3.2 years
  • Each electron sees sum of Coulomb forces from
    other N-1 electrons.
  • The total number of calculations is about N2 and
    each calculation of the Coulomb force requires a
    square root, a division and several
    multiplications.
  • Regular sequential computers are not fast enough.

37
The FPGA Board
  • Up to 16 FPGA devices (32 ea) can be installed
    onto each board.
  • Each FPGA host one core.

38
The 16-bit Demo Core
39
The Lookup Table
LUT 10b in 16b out
40
Two Electrons with Natural Scales
256 nm
28ps
41
256 Charged Particles, Iteration 0
42
256 Charged Particles, Iteration 5
43
256 Charged Particles, Iteration 10
44
256 Charged Particles, Iteration 15
45
256 Charged Particles, Iteration 20
46
256 Charged Particles, Iteration 25
47
256 Charged Particles, Iteration 30
48
256 Charged Particles, Iteration 35
49
256 Charged Particles, Iteration 40
50
Speed Comparison with Regular CPU
  • The FPGA core is x10 faster than a typical 2.2
    GHz CPU core.
  • The FPGA core runs at 200 MHz or 200 M Coulomb
    force calculations/s.
  • It seems the CPU core needs 80-100 clock cycles
    for each Coulomb force calculation.

51
One Board 8 FPGA Cores
One Core/FPGA 5 Dual Core CPUs
One Core/FPGA 5 Dual Core CPUs
8 Cores/Board 40 Dual Core CPUs
  • One board has a calculation capacity as 40 dual
    core CPUs.
  • The power consumption of one board is lt 4.5 W.
  • Newer FPGAs capable of hosting 4 cores/FPGA are
    available.

52
Outline
  • Electronic Aspect of FPGA
  • LED Flashing
  • Logic Elements in a Nutshell
  • TDC and ADC
  • FPGA as a Computing Fabric
  • Moores Law Forever?
  • Space Charge Computing with FPGA Cores
  • Doublet Matching Hash Sorter
  • Triplet Matching Tiny Triplet Finder
  • Enclosed Loop Micro-Sequencer (ELMS)

53
Example of Doublet Match, PET
T
D
DTltA?
Group 1
-
Group 2
DTgt(-A)?
T
D
  • Positrons and electrons annihilate to produce
    pairs of photons. The back-to-back photons hit
    the detector at nearly the same time.
  • Detector hits are digitized and hits at nearly
    the same time are to be matched together.
  • The process takes O(n2) clock cycles.

54
Hash Sorter
  • Pass 1
  • Data in Group 1 are stored in the hash sorter
    bins based on key number K.
  • Pass 2
  • Data in Group 2 are fetched though and paired up
    with corresponding Group 1 data with same key
    number K.

K
D
Group 1
K
Group 2
K
D
55
Link List Structure of Hash Sorter
DIN
DOUT
DATA RAM
Pointer RAM
Index RAM
K
56
Hash Sorter
Using hash sorter, matching pairs can be grouped
together using 2n, rather than n2 clock cycles.
K
57
Outline
  • Electronic Aspect of FPGA
  • LED Flashing
  • Logic Elements in a Nutshell
  • TDC and ADC
  • FPGA as a Computing Fabric
  • Moores Law Forever?
  • Space Charge Computing with FPGA Cores
  • Doublet Matching Hash Sorter
  • Triplet Matching Tiny Triplet Finder
  • Enclosed Loop Micro-Sequencer (ELMS)

58
Hits, Hit Data Triplets
  • Hit data come out of the detector planes in
    random order.
  • Hit data from 3 planes generated by same particle
    tracks are organized together to form triplets.

59
Triplet Finding
  • Three data items must satisfy the condition xA
    xC 2 xB.
  • A total of n3 combinations must be checked (e.g.
    5x5x5125).
  • Three layers of loops if the process is
    implemented in software.
  • Large silicon resource may be needed without
    careful planning O(N2)

Plane A
Plane B
Plane C
60
Tiny Triplet Finder OperationsPass I Filling
Bit Arrays
Bit Array/Shifters
Note Flipped Bit Order
  • xA xC 2 xB
  • xA - xC constant

Physical Planes
Fill a corresponding logic cell.
For any hit
61
Tiny Triplet Finder Operations Pass II Making
Match
Bit Array/Shifters
Triplet is found.
Logically shift the bit array.
Perform bit-wise AND in this range.
Physical Planes
For any center plane hit
62
Tiny? Yes, Tiny! Logic Cell Usage
AM, CAM, Hough Transform etc., O(N2)
Tiny Triplet Finder O(NlogN)
63
Hit Matching
Software FPGA Typical FPGA Resource Saving Approaches
O(n2) for() for() O(n)O(N) Comparator Array Hash Sorter O(n)O(N) in RAM
O(n3) for() for() for() O(n)O(N2) CAM, Hugh Trans. Tiny Triplet Finder O(n)O(NlogN)
O(n4) for() for() for() for()
64
The Winning Line of FPGA Computing
O Freunde, nicht diese Töne!
  • We commonly heard
  • FPGA devices contains millions gate.
  • High parallelism can be implemented in FPGA.
  • FPGA cost drops by half every 18 months.
  • We want to emphasize, especially to our young
    students
  • Creativity,
  • Creativity,
  • Creativity, on Arithmetic ops, on Algorithms, on
    Architectures on All Aspects.

65
Outline
  • Electronic Aspect of FPGA
  • LED Flashing
  • Logic Elements in a Nutshell
  • TDC and ADC
  • FPGA as a Computing Fabric
  • Moores Law Forever?
  • Space Charge Computing with FPGA Cores
  • Doublet Matching Hash Sorter
  • Triplet Matching Tiny Triplet Finder
  • Enclosed Loop Micro-Sequencer (ELMS)

66
The End
  • Thanks

67
Micro-computing vs. Reconfigurable Computing
(1003-4)57 ?
100
3
Data 100,3,4,5,7
4
5
7
Control
LD
(-)
()
()
()
FPGA
Data
CPU
Data
Program
Program
Configuration
  • In microprocessor, the users specify program on
    fixed logic circuits.
  • In FPGA, the users specify logic circuits (as
    well as program).
  • The FPGA computing needs not to follow
    microprocessor architectures. (But useful
    experiences can be borrowed.)
  • The usefulness of FPGA reconfigurable computing
    is still to be fully appreciated.

68
FPGA Process Sequencing Options
Program Type Program Length (CLK cycles) Reprogram Resource Usage
Finite State Machine (FSM) Fixed Wired 10 Hard Small
Enclosed Loop Micro-Sequencer (ELMS) Memory Stored Program 10-1000 Easy Small
Microprocessor (MP) Memory Stored Program gt1000 Easy Large
69
The Between Counter
PC0 instr0 PC1 instr1 PC2 instr2 PC3
instr3 PC4 instr4 PC5 instr5 PC6 instr6 PC7
instr7 PC8 instr8 PC9 instr9 PCA instrA PCB
instrB PCC instrC PCD instrD
0,1,2,3,4,5,6,7,8,9,A
5,6,7,8,9,A
5,6,7,8,9,A
5,6,7,8,9,A
5,6,7,8,9,A
T
ROM
Between Counter
Control Signals
5,6,7,8,9,A,B,C,D,E,F
70
ELMS Enclosed Loop Micro-Sequencer
Allows jump back as in microprocessors
Special in ELMS Supports FOR loops at machine
code level
PC Control Signals Opration 00 000000000000000
01 001000100011010 LD R1, n 02 000010001000000
LD R2, addr_a 03 000000000000100 LD R3,
addr_X 04 000000010001000 LD R7,
0 05 000000000100001 BckA1 LD R4,
(R2) 06 000100000010000 INC R2 07 000001000100000
LD R5, (R3) 08 000100010000001 INC R3 09 001001
000100000 MUL R6, R4, R5 0a 000000010001000 EndA1
ADD R7, R7, R6 0b 000010000010000 DEC R1 0c 0000
00100000100 BRNZ BckA1
  • PCROM is a good sequencer in FPGA.
  • Adding Conditional Branch Logic allows the
    program to loop back.
  • Loop Return Logic Stack is a special feature
    in ELMS that supports FOR loops at machine code
    level.

71
ELMS Detailed Block Diagram
FOR BckA1 EndA1 n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6 LD R8, R7
User Control Signals
The Stack supports nested loops and sub-routing
calls up to 128 layers.
72
Software Using Spread Sheet as Compiler
73
Whats Good About ELMS FOR Loops at Machine
Code Level w/ Zero-Over Head
Microprocessor
The ELMS
LD R1, n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6 DEC R1 BRNZ BckA1
FOR BckA1 EndA1 n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6
25
Conditional Branch
  • Looping sequence is known in this example before
    entering the loop.
  • Regular micro-processor treat the sequence as
    unknown.
  • ELMS supports FOR loops with pre-defined
    iterations at machine code level.
  • Execution time is saved and micro-complexities
    (branch penalty, pipeline bubble, etc.)
    associated with conditional branches are avoided.

74
ELMS as a Hardware Loop Sequencer
From http//www.analog.com/
  • There are DSP devices that support hardware loop
    for zero-overhead loop implementation.
  • The emphasis of ELMS is that the FOR loop and
    subroutine calls/return are treated the same.
  • Any program passage can be used as a subroutine
    without needing a return instruction.
  • The ELMS uses as less resource as possible for
    FPGA implementation.

75
No ALU gt Small Resource Usage
The von Neumann Architecture
Princeton Architecture
Harvard Architecture
Fermilab (?) Architecture
Program DATA Memory
Program Control
Program Memory
Program Control
Program Memory
Sequencer (ELMS)
ALU
ALU
DATA Memory
DATA Memory
Data Processor
  • The Princeton Architecture is more suitable at
    system level while Harvard Architecture is better
    suited at micro-structure level.
  • Regular microprocessors cannot run looped program
    without an ALU.
  • The ALU takes large amount of resource while may
    not be efficiently utilized for data processing
    tasks in FPGA.
  • The ELMS can run nested loop program without an
    ALU.
  • Further separation of Program and data is
    therefore possible.
  • The ELMS is kept small.

76
The Frequency Spectrum of DAC (2)
Possible Student Lab
  • The first harmonic may be suppressed.
  • Works better with regular low-pass filters.

77
The Frequency Spectrum of DAC (1)
  • The first harmonic has dominate concentration.
  • Works better with notch filter.

78
Digital Calibration Using Twice-Recording Method
IN
  • Use longer delay line.
  • Some signals may be registered twice at two
    consecutive clock edges.

N2-N1(1/f)/Dt
  • The two measurements can be used
  • to calibrate the delay.
  • to reduce digitization errors.

CLK
1/f Clock Period Dt Average Bin Width
79
Digital Calibration Result
  • Power supply voltage changes from 2.5 V to 1.8 V,
    (about the same as 100 oC to 0 oC).
  • Delay speed changes by 30.
  • The difference of the two TDC numbers reflects
    delay speed.

2nd TDC
  • Warning the calibration is based on average bin
    width, not bin-by-bin widths.

1st TDC
Corrected Time
80
Indirect Cost of Complexity
If something like this can do the job
why do these?
81
Tiny Triplet FinderReuse Coincident Logic via
Shifting Hit Patterns
C3
C2
C1
One set of coincident logic is implemented.
For an arbitrary hit on C3, rotate, i.e., shift
the hit patterns for C1 and C2 to search for
coincidence.
82
Tiny Triplet Finder for Circular Tracks
Also works with more than 3 layers
Shifter
Shifter
Bit-wise Coincident Logic
Bit Array
Bit Array
  1. Fill the C1 and C2 bit arrays. (n1 clock cycles)
  2. Loop over C3 hits, shift bit arrays and check for
    coincidence. (n3 clock cycles)

R1/R3
R2/R3
Triplet Map Output To Decoder
Write a Comment
User Comments (0)
About PowerShow.com