Title: ECE 545 Project 1 Introduction
1ECE 545 Project 1Introduction Specification
2Schedule
Project 1 RTL design for FPGAs (30 points) Due
date Tuesday, November 21, midnight Final
choice of the project topic Thursday,
October 19 Progress reports
Thursday-Friday, November 2-3
Thursday-Friday, November 16-17
3Groups
- ONE-person and TWO-person teams allowed
- Teams must be formed at the moment when the
project - topic is selected, i.e., by Thursday, October
19 - TWO-person teams work on more complex versions
- of each project topic
- One final grade per entire team
4Honor Code Rules
- Using somebodys else code and presenting it as
your own is a serious Honor Code violation and
may result in an F grade for the entire course. - All student teams are expected to write and debug
their codes by themselves and are not allowed to
share their codes with other teams. - Students are encouraged to help and support each
other in all problems related to the - basic understanding of the problem
- operation of the CAD tools.
5Project 1 - Platform tools
- Target devices Xilinx FPGA Spartan 3 family
- Tools
- VHDL Simulation Aldec Active HDL or ModelSim
- VHDL Synthesis Synplify Pro or Xilinx XST
- Implementation Xilinx ISE or Xilinx WebPack
6Project 1 - Final Deliverables
- All block diagrams and ASM chartsdescribing the
entire circuit and its components(electronic
form, PDF) - All synthesizable VHDL source codes
- All testbenches used to verify the operation of
the entire circuit and its components, and the
correspondinginput files containing test
vectors, and output files containing results - Timing waveforms demonstrating the correct
operationof the entire circuit and its
components - Final report
7Final Report (1)
- Short description of the block diagrams and ASM
charts. - Discussion of any alternative architectures
and solutions. - 2. List of source codes and a short description
of major - modules.
- 3. Source of test vectors and a way of generating
- these test vectors.
- 4. Format of input output files.
- Short description of a testbench.
8Final Report (2)
- 5. Results
- resource utilization (CLB slices, LUTs,
FFs,BRAMs, etc.) - post-synthesis timing
- clock frequency
- throughput
- latency
- critical path
- post placing routing timing
- clock frequency
- throughput
- latency
- critical path
9Final Report (3)
6. Discussion of the obtained results and and
any optimizations applied in order to obtain
the optimum design. 7. Speed-up vs. software
implementation. 8. Discussion of dependence of
results on parameters of the application. 9.
Deviations from the original specification,
encountered problems, and unresolved issues.
10Two topics from two different areas to choose
from
Cryptography
Stream cipher qualified to Phase 2 of the eSTREAM
contest
Digital Signal Processing
Finite Impulse Response Filter
11Stream cipher qualified to Phase 2 of the eSTREAM
contest
12Cipher
Message / Ciphertext
m bits
Cryptographic Key
Encrypt/Decrypt
k bits
1 bit
m bits
Ciphertext / Message
13Secret-Key Ciphers
key of Alice and Bob - KAB
key of Alice and Bob - KAB
Network
Decryption
Encryption
Bob
Alice
14Block vs. stream ciphers
M1, M2, , Mn
m1, m2, , mn
memory
Block cipher
K
K
Stream cipher
C1, C2, , Cn
c1, c2, , cn
CifK(Mi)
ci fK(mi, mi-1, , m2, m1)
Every block of ciphertext is a function of only
one corresponding block of plaintext
Every block of ciphertext is a function of the
current and all proceeding blocks of plaintext
15Typical stream cipher
Sender
Receiver
initialization vector (seed)
initialization vector (seed)
key
key
Pseudorandom Key Generator
Pseudorandom Key Generator
keystream
ki
keystream
ki
mi
ci
ci
mi
plaintext
ciphertext
plaintext
ciphertext
16eSTREAM - Contest for a new stream cipher
standard, 2004-2008
PROFILE 1
- Stream cipher suitable for
- software implementations optimized for high
speed - Key size - 128 bits
- Initialization vector 64 bits or 128 bits
PROFILE 2
- Stream cipher suitable for
- hardware implementations with limited memory,
- number of gates, or power supply
- Key size - 80 bits
- Initialization vector 32 bits or 64 bits
17eSTREAM - Contest for a new stream cipher
standard, 2004-2008
Schedule of the contest
November 2004 Request for proposals 29 April
2005 Deadline for submissions 34
ciphers, 23 candidates for PROFILE 1
26 candidates for PROFILE 2 26-27
May 2005 Stream Cipher Workshop, Danmark March
2006 End of Phase I July 2006
Beginning of the evaluation part of Phase
II September 2007 End of Phase II January 2008
Final report
time
http//www.ecrypt.eu.org/stream/timetable.html
1810 focus candidates
PROFILE 1 (Software) Dragon - Ed Dawson, Kevin
Chen, Matt Henricksen, William Millan,
Leonie Simpson, HoonJae Lee, SangJae
Moon HC-256 - Hongjun Wu LEX - Alex
Biryukov Phelix - Doug Whiting, Bruce
Schneier, Stefan Lucks,
Frédéric Muller Py - Eli Biham and
Jennifer Seberry Salsa20 - Daniel
Bernstein SOSEMANUK - Come Berbain, Olivier
Billet, Anne Canteaut,
Nicolas Courtois, Henri Gilbert, Louis Goubin,
Aline Gouget, Louis
Granboulan, Cédric Lauradoux, Marine Minier,
Thomas Pornin, Hervé
Sibert PROFILE 2 (Hardware) Grain - Martin
Hell, Thomas Johansson and Willi Meier MICKEY-128
- Steve Babbage and Matthew Dodd Phelix - Doug
Whiting, Bruce Schneier, Stefan Lucks,
Frédéric Muller Trivium -
Christophe De Cannière and Bart Preneel
19Your task
For groups of the size ONE
implement ONE out of the following FIVE ciphers
For groups of the size TWO
implement TWO out of the following FIVE ciphers
Grain, MICKEY-128, Phelix, Salsa, Trivium
20Optimization Criteria
I. Minimum area
II.
Throughput divided by Total Circuit Area CLB
slices
21Required interface
clk
eSTREAM cipher
reset
k
key_IV
key_IV_ready
data_out
d
key_IV_write
write
d
full
data_in
data_in_ready
data_in_write
enc_dec
k1, 2, 4, 8, 16, 32, 64 d set of allowed
values specific to a given algorithm
22Tasks of a TWO-person team
- Implement TWO ciphers
- Compare TWO ciphers against each other
23eSTRAM Implementation Hints
24Example of an eSTRAM cipher
25Linear Feedback Shift Register (LFSR)
? L, C(D) ?
Connection polynomial, C(D)
Length
C(D) 1 c1D c2D2 . . . cLDL
26Example of LFSR
? 4, 1DD4?
Connection polynomial, C(D)
Length
27sj-L
sj-1
sj-2
sj-(L-1)
Initial state
sL-1, sL-2, . . . , s1, s0
LSFR recursion
sj c1sj-1 ? c2sj-2 ? . . . ? cL-1sj-(L-1) ?
cLsj-L
for j ? L
28LFSR State Sequence
29Non-linear Feedback Shift Register (NFSR)
30Doubling the speed of Grain
31Resources
eSTREAM PHASE 2 the ECRYPT Stream Cipher Project
available at http//www.ecrypt.eu.org/stream
/
Source of test vectors
Reference C implementations provided by the
authors of the algorithms.
32Finite Impulse Response Filter
33Topic proposed and co-advised by
Dr. David Hwang
Dr. Kathleen Wage
34DSP Project FIR Digital Filter Design
- Digital filters are widely used in digital
communications and audio/video processing. - In particular, finite impulse response (FIR)
filters are used for their ease of implementation
and stability. - In this project, you will investigate different
FIR filter structures and their VLSI
implementations - Step 1 Implement and compare direct form versus
direct form transposed structures - Step 2 Implement and compare fast FIR structures
which reduce the number of required
multiplications per sample
35Example Gigabit Ethernet Transceiver
- As seen above digital filters, boxed in blue,
play a crucial role in digital communication
chips such as Ethernet transceivers, cable
modems, DSL modems, satellite receivers, mobile
phones, etc.
36Step 1a Direct Form FIR Filter
x(n)
Z-1
Z-1
Z-1
h0
h1
h2
hN-1
y(n)
- An FIR filter implements a convolution in the
time-domain - Critical path of N-tap filter
- N-1 adds 1 multiply
- Arithmetic complexity of N-tap filter modeled as
- N multiplications/sample N-1 adds/sample
- Problem 1a Design a parametrizable direct form
FIR filter
37Step 1b Direct Form Transpose FIR Filter
x(n)
hN-1
hN-2
hN-3
h0
Z-1
Z-1
Z-1
y(n)
- Use a signal flow graph reversal to reduce the
critical path ? transpose structure - Critical path of N-tap transposed filter
- 1 add 1 multiply
- Arithmetic complexity of N-tap filter modeled as
- N multiplications/sample N-1 adds/sample
- Problem 1b Design a parametrizable direct form
transpose FIR filter
38Step 2 Power Reduction via Parallel
Subexpression Sharing
N/2 taps
x(2n)
H0(z)
y(2n)
H0(z)H1(z)
y(2n1)
x(2n1)
H1(z)
Z-1
- Direct form and transpose form structures
(running at the same rate) require N
multiplications/sample and N-1 adds/sample - Methods exist to reduce this complexity by
parallel processing and subexpression sharing.
See 1 and 2 for details and derivation. - In the 2-parallel structure above, two inputs
arrive at half the original clock rate and are
processed in parallel by three ceil(N/2)-tap
filters ceil() is the ceiling function - Arithmetic complexity of the 2-parallel filter is
approximately - 3 x N/2 multiplications / two samples 3 x
(N/2-1) adds / two samples 4 adds / two samples
- 3/4 N multiplications/sample (3N/4 1/2)
adds/sample - If power is dominated by multipliers, 25 power
savings over traditional structures! - Problem 2a Design a 2-parallel parametrizable
FIR filter
39Obtaining Coefficients of 2-Parallel Subfilters
- Example for N 8
- H(z) h0, h1, h2, h3, h4, h5, h6, h7
- Subfilter coefficients obtained by performing a
polyphase decomposition by 2. Each subfilter has
N/2 4 coefficients - H0(z) h0, h2, h4, h6
- H1(z) h1, h3, h5, h7
- H0(z) H1(z) h0h1, h2h3, h4h5, h6h7
403-parallel filter
- In the 3-parallel filter, three inputs arriving
at a third of the original rate are processed by
six parallel ceil(N/3)-tap filters - Arithmetic complexity of the 3-parallel filter is
approximately - 2/3 N multiplications/sample (2/3N 4/3) adds
- 33 reduction in multiplications/sample
- Problem 2b Design a 3-parallel parametrizable
FIR filter
41Obtaining Coefficients of 3-Parallel Subfilters
- Example for N 9
- H(z) h0, h1, h2, h3, h4, h5, h6, h7, h8
- Subfilter coefficients obtained by performing a
polyphase decomposition by 3. Each subfilter has
N/3 3 coefficients - H0(z) h0, h3, h6
- H1(z) h1, h4, h7
- H2(z) h2, h5, h8
- H0(z) H1(z) h0h1, h3h4, h6h7
- H1(z) H2(z) h1h2, h4h5, h7h8
- H0(z) H1(z) H2(z) h0h1h2, h3h4h5,
h6h7h8
42Further parallelism
- These parallel structures introduce issues such
as increased area, adder overhead (pre- and
post-processing), etc. which eventually become
prohibitive as the subsampling rate increases
43Assumptions
All coefficients are loaded to the circuit before
the start of processing and do not change during
the runtime. Registers storing coefficients are
connected in chain, so coefficients must be
loaded serially, in the proper order,
starting from the ones with the smallest indices.
44Parameters of the design
N number of taps (N8, 12, 16, 24, 32) M
fractional wordlength of input (M8..10) K
fractional wordlength of output (K8..10) L
fractional wordlength of coefficients
(L7-11)
45Required interface - basic architecture
clk
FIR Filter
reset_datapath
1.K
reset_coeff
1.M
d_out
d_in
1.L
load_coeff_done
coeff
filt_mode
( 0load coefficients, 1run filter)
load_begin
( 0idle, 1start to load coefficients)
46Required interface 2-parallel structure
FIR Filter
clk
reset_datapath
1.K
reset_coeff
1.M
d_out_1
d_in_1
1.K
1.M
d_in_2
d_out_2
1.L
coeff
load_coeff_done
filt_mode
( 0load coefficients, 1run filter)
load_begin
( 0idle, 1start to load coefficients)
47One-Person Team Requirements
- Matlab code will be given for five different
configurations (A, B, C, D, E), each with
different values of N, M, L, and K. - CASE A N 8, M 8, K 8, L 7
- CASE B N 12, M 9, K 9, L 8
- CASE C N 16, M 9, K 10, L 9
- CASE D N 24, M 10, K 11, L 10
- CASE E N 32, M 11, K 12, L 11
- Step 1 Direct form and transpose form
structures - Generate parametrizable VHDL code round output
of each multiplier to K fractional bits - Generate test vectors using Matlab and verify the
test vectors in RTL for configurations A-E - Implement configurations B and D on FPGA
- Optimize for minimum area
- Optimize for maximum ratio of throughput / area
(CLB slices) - Step 2 2-parallel and 3-parallel fast FIR
structures - Generate parametrizable VHDL code round output
of each multiplier to K fractional bits - Generate test vectors using Matlab and verify the
test vectors in RTL for configurations B and D - Implement configurations B and D on FPGA
- Optimize for minimum area
- Optimize for maximum ratio of throughput / area
(CLB slices)
48Two-Person Team Additional Requirements
- Step 3 4-parallel and 6-parallel fast FIR
structures. See ref 2 for block diagrams. - Generate parametrizable VHDL code round output
of each multiplier to K fractional bits - Generate test vectors using Matlab and verify the
test vectors in RTL for configurations B and D - Implement configurations B and D on FPGA
- Optimize for minimum area
- Optimize for maximum ratio of throughput / area
(CLB slices) - Step 4 Quantization studies
- For the 6-parallel filter and configurations B
and D, implement truncation instead of rounding
after the multipliers. - Optimize for minimum area
- Optimize for maximum ratio of throughput / area
(CLB slices) - For the 4-parallel filter and configurations B
and D, round to K4 bits after the multipliers.
Round again to K bits right before the filter
outputs to produce a 1.K output. - Optimize for minimum area
- Optimize for maximum ratio of throughput / area
(CLB slices)
49Required reading
1 Z. Mou and P. Duhamel, Short-length
FIR filters and their use in fast
nonrecursive filtering, IEEE Transactions
on Signal Processing, vol. 39, no. 6, pp.
1322-1332, June 1991. 2 K.K. Parhi, VLSI
Digital Signal Processing Systems Design
and Implementation, John Wiley, pp.
256-275, 1999.
Source of test vectors
Matlab implementation provided by Dr. Hwang
50Important Notes on Twos Complement Arithmetic
51Project Notation
- For this project, we are using twos complement
fractional notation - An m.M number indicates a twos complement mM
bit word with m integer bits and M fractional
bits - Example 1.4 number
- 0.111 0.875
- 1.000 -1
- 1.111 -0.125
- Example 2.2 number
- 00.11 0.75
- 10.00 -2
- 01.01 1.25
- The dynamic range of an m.M number is -2m-1,
2m-1)
52Twos Complement Multiplication
a
1.M
1.L
b
1.ML
- The wordlength required for the product of 1.M x
1.L numbers - 2.(ML) if we assume -1 x -1 1 may occur
- 1.(ML) if we assume -1 x -1 1 will never
occur - In general a product of m.M x l.L numbers
- (ml).(ML) if assume (most neg value of a) x
(most neg value of b) may occur - (ml-1).(ML) if assume (most neg value of a) x
(most neg value of b) will never occur - In this project, we assume that (most neg value
of a) x (most neg value of b) will never occur
for any multiplier in any filter structure. This
is guaranteed by scaling the inputs and
coefficients properly in Matlab. - Examples 1.5 x 2.5 2.10, 1.4 x 1.6 1.10, 3.4
x 2.3 4.7
53Twos Complement Truncation versus Rounding
- In this project, we ask you to round the output
of each multiplier to K fractional bits. - To round a k.K number to a k.K number (K lt K)
- Truncate the k.K number to become a k.K number
- Add the former fractional K1 bit to fractional
position K - For information purposes, to truncate a k.K
number to a k.K number (K lt K) - Truncate the k.K number to become a k.K number
- Rounding and truncation produce equal noise
variance, whereas rounding is (approximately)
unbiased and truncation is biased
54Truncation versus RoundingExample 2.5 number
to a 2.3 number
ROUNDING
00.01110 1 00.100
11.01000 0 11.010
10.00110 1 10.010
TRUNCATION
00.01110 00.011
10.00110 10.001
11.01000 11.010
55Twos Complement Addition
1.M
1.M
. . .
x(n)
Z-1
Z-1
h0
h1
1.N
1.MN
ROUND
ROUND
1.K
1.K
1.K
1.K
y(n)
- FIR filters perform chains of additions
- A k.K number plus a k.K number requires a (k1).K
number to represent the sum - Ex. 0.111 (0.75) 0.111 (0.75) 01.100 (1.5)
- Ex. 1.000 (-1) 1.000 (-1) 10.000 (-2)
- In general, an adder chain summing J numbers,
each of wordlength k.K, requires a wordlength of
(k ceil(log2(J)).K after the final adder - This grows for a large number of coefficients N
56Twos Complement Adder Chain Trick using Modulo
Arithmetic
1.K
1.K
1.K
1.K
y(n)
2.K
3.K
3.K
1.K
1.K
1.K
1.K
1.K
1.K
1.K
y(n)
- Trick if we know output of adder is bounded
within a k.K value (where k is some known
value), then all intermediate addition nodes only
require k.K bit wordlengths - Provides hardware savings for large number of
coefficients N! - This is only true if we know the output of the
adder chain is bounded - Be careful, because x(2n) x(2n1) is not
guaranteed to be bounded in 1.M you need the
full 2.M - h(0) h(1) is not guaranteed to be bounded in
1.L you need the full 2.L - In this project, this trick helps after
multiplier outputs, not on multiplier inputs - In our project, the final output y(n) is bounded
within a 1.K bit wordlength. This has been
controlled by scaling the inputs and coefficients
in Matlab. - To learn about more helpful hardware tricks
take ECE 645 next semester!