ECE 545 Project 1 Introduction

About This Presentation

Title:

ECE 545 Project 1 Introduction

Description:

ONE-person and TWO-person teams allowed. Teams must be ... MICKEY-128 - Steve Babbage and Matthew Dodd. Phelix - Doug Whiting, Bruce Schneier, Stefan Lucks, ... – PowerPoint PPT presentation

Number of Views:293

Avg rating:3.0/5.0

Slides: 57

Provided by: Krzysz1

Category:

more less

Transcript and Presenter's Notes

Title: ECE 545 Project 1 Introduction

1
ECE 545 Project 1Introduction Specification
2
Schedule
Project 1 RTL design for FPGAs (30 points) Due
date Tuesday, November 21, midnight Final
choice of the project topic Thursday,
October 19 Progress reports
Thursday-Friday, November 2-3
Thursday-Friday, November 16-17
3
Groups

ONE-person and TWO-person teams allowed
Teams must be formed at the moment when the
project
topic is selected, i.e., by Thursday, October
19
TWO-person teams work on more complex versions
of each project topic
One final grade per entire team

4
Honor Code Rules

Using somebodys else code and presenting it as
your own is a serious Honor Code violation and
may result in an F grade for the entire course.
All student teams are expected to write and debug
their codes by themselves and are not allowed to
share their codes with other teams.
Students are encouraged to help and support each
other in all problems related to the
basic understanding of the problem
operation of the CAD tools.

5
Project 1 - Platform tools

Target devices Xilinx FPGA Spartan 3 family
Tools
VHDL Simulation Aldec Active HDL or ModelSim
VHDL Synthesis Synplify Pro or Xilinx XST
Implementation Xilinx ISE or Xilinx WebPack

6
Project 1 - Final Deliverables

All block diagrams and ASM chartsdescribing the
entire circuit and its components(electronic
form, PDF)
All synthesizable VHDL source codes
All testbenches used to verify the operation of
the entire circuit and its components, and the
correspondinginput files containing test
vectors, and output files containing results
Timing waveforms demonstrating the correct
operationof the entire circuit and its
components
Final report

7
Final Report (1)

Short description of the block diagrams and ASM
charts.
Discussion of any alternative architectures
and solutions.
2. List of source codes and a short description
of major
modules.
3. Source of test vectors and a way of generating
these test vectors.
4. Format of input output files.
Short description of a testbench.

8
Final Report (2)

5. Results
resource utilization (CLB slices, LUTs,
FFs,BRAMs, etc.)
post-synthesis timing
clock frequency
throughput
latency
critical path
post placing routing timing
clock frequency
throughput
latency
critical path

9
Final Report (3)
6. Discussion of the obtained results and and
any optimizations applied in order to obtain
the optimum design. 7. Speed-up vs. software
implementation. 8. Discussion of dependence of
results on parameters of the application. 9.
Deviations from the original specification,
encountered problems, and unresolved issues.
10
Two topics from two different areas to choose
from
Cryptography
Stream cipher qualified to Phase 2 of the eSTREAM
contest
Digital Signal Processing
Finite Impulse Response Filter
11
Stream cipher qualified to Phase 2 of the eSTREAM
contest
12
Cipher
Message / Ciphertext
m bits
Cryptographic Key
Encrypt/Decrypt
k bits
1 bit
m bits
Ciphertext / Message
13
Secret-Key Ciphers
key of Alice and Bob - KAB
key of Alice and Bob - KAB
Network
Decryption
Encryption
Bob
Alice
14
Block vs. stream ciphers
M1, M2, , Mn
m1, m2, , mn
memory
Block cipher
K
K
Stream cipher
C1, C2, , Cn
c1, c2, , cn
CifK(Mi)
ci fK(mi, mi-1, , m2, m1)
Every block of ciphertext is a function of only
one corresponding block of plaintext
Every block of ciphertext is a function of the
current and all proceeding blocks of plaintext
15
Typical stream cipher
Sender
Receiver
initialization vector (seed)
initialization vector (seed)
key
key
Pseudorandom Key Generator
Pseudorandom Key Generator
keystream
ki
keystream
ki
mi
ci
ci
mi
plaintext
ciphertext
plaintext
ciphertext
16
eSTREAM - Contest for a new stream cipher
standard, 2004-2008
PROFILE 1

Stream cipher suitable for
software implementations optimized for high
speed
Key size - 128 bits
Initialization vector 64 bits or 128 bits

PROFILE 2

Stream cipher suitable for
hardware implementations with limited memory,
number of gates, or power supply
Key size - 80 bits
Initialization vector 32 bits or 64 bits

17
eSTREAM - Contest for a new stream cipher
standard, 2004-2008
Schedule of the contest
November 2004 Request for proposals 29 April
2005 Deadline for submissions 34
ciphers, 23 candidates for PROFILE 1
26 candidates for PROFILE 2 26-27
May 2005 Stream Cipher Workshop, Danmark March
2006 End of Phase I July 2006
Beginning of the evaluation part of Phase
II September 2007 End of Phase II January 2008
Final report
time
http//www.ecrypt.eu.org/stream/timetable.html
18
10 focus candidates
PROFILE 1 (Software) Dragon - Ed Dawson, Kevin
Chen, Matt Henricksen, William Millan,
Leonie Simpson, HoonJae Lee, SangJae
Moon HC-256 - Hongjun Wu LEX - Alex
Biryukov Phelix - Doug Whiting, Bruce
Schneier, Stefan Lucks,
Frédéric Muller Py - Eli Biham and
Jennifer Seberry Salsa20 - Daniel
Bernstein SOSEMANUK - Come Berbain, Olivier
Billet, Anne Canteaut,
Nicolas Courtois, Henri Gilbert, Louis Goubin,
Aline Gouget, Louis
Granboulan, Cédric Lauradoux, Marine Minier,
Thomas Pornin, Hervé
Sibert PROFILE 2 (Hardware) Grain - Martin
Hell, Thomas Johansson and Willi Meier MICKEY-128
- Steve Babbage and Matthew Dodd Phelix - Doug
Whiting, Bruce Schneier, Stefan Lucks,
Frédéric Muller Trivium -
Christophe De Cannière and Bart Preneel
19
Your task
For groups of the size ONE
implement ONE out of the following FIVE ciphers
For groups of the size TWO
implement TWO out of the following FIVE ciphers
Grain, MICKEY-128, Phelix, Salsa, Trivium
20
Optimization Criteria
I. Minimum area
II.

Maximum ratio

Throughput divided by Total Circuit Area CLB
slices
21
Required interface
clk
eSTREAM cipher
reset
k
key_IV
key_IV_ready
data_out
d
key_IV_write
write
d
full
data_in
data_in_ready
data_in_write
enc_dec
k1, 2, 4, 8, 16, 32, 64 d set of allowed
values specific to a given algorithm
22
Tasks of a TWO-person team

Implement TWO ciphers
Compare TWO ciphers against each other

23
eSTRAM Implementation Hints
24
Example of an eSTRAM cipher
25
Linear Feedback Shift Register (LFSR)
? L, C(D) ?
Connection polynomial, C(D)
Length
C(D) 1 c1D c2D2 . . . cLDL
26
Example of LFSR
? 4, 1DD4?
Connection polynomial, C(D)
Length
27
sj-L
sj-1
sj-2
sj-(L-1)
Initial state
sL-1, sL-2, . . . , s1, s0
LSFR recursion
sj c1sj-1 ? c2sj-2 ? . . . ? cL-1sj-(L-1) ?
cLsj-L
for j ? L
28
LFSR State Sequence
29
Non-linear Feedback Shift Register (NFSR)
30
Doubling the speed of Grain
31
Resources
eSTREAM PHASE 2 the ECRYPT Stream Cipher Project
available at http//www.ecrypt.eu.org/stream
/
Source of test vectors
Reference C implementations provided by the
authors of the algorithms.
32
Finite Impulse Response Filter
33
Topic proposed and co-advised by
Dr. David Hwang
Dr. Kathleen Wage
34
DSP Project FIR Digital Filter Design

Digital filters are widely used in digital
communications and audio/video processing.
In particular, finite impulse response (FIR)
filters are used for their ease of implementation
and stability.
In this project, you will investigate different
FIR filter structures and their VLSI
implementations
Step 1 Implement and compare direct form versus
direct form transposed structures
Step 2 Implement and compare fast FIR structures
which reduce the number of required
multiplications per sample

35
Example Gigabit Ethernet Transceiver

As seen above digital filters, boxed in blue,
play a crucial role in digital communication
chips such as Ethernet transceivers, cable
modems, DSL modems, satellite receivers, mobile
phones, etc.

36
Step 1a Direct Form FIR Filter
x(n)
Z-1
Z-1
Z-1
h0
h1
h2
hN-1
y(n)

An FIR filter implements a convolution in the
time-domain
Critical path of N-tap filter
N-1 adds 1 multiply
Arithmetic complexity of N-tap filter modeled as
N multiplications/sample N-1 adds/sample
Problem 1a Design a parametrizable direct form
FIR filter

37
Step 1b Direct Form Transpose FIR Filter
x(n)
hN-1
hN-2
hN-3
h0
Z-1
Z-1
Z-1
y(n)

Use a signal flow graph reversal to reduce the
critical path ? transpose structure
Critical path of N-tap transposed filter
1 add 1 multiply
Arithmetic complexity of N-tap filter modeled as
N multiplications/sample N-1 adds/sample
Problem 1b Design a parametrizable direct form
transpose FIR filter

38
Step 2 Power Reduction via Parallel
Subexpression Sharing
N/2 taps
x(2n)
H0(z)
y(2n)
H0(z)H1(z)
y(2n1)
x(2n1)
H1(z)
Z-1

Direct form and transpose form structures
(running at the same rate) require N
multiplications/sample and N-1 adds/sample
Methods exist to reduce this complexity by
parallel processing and subexpression sharing.
See 1 and 2 for details and derivation.
In the 2-parallel structure above, two inputs
arrive at half the original clock rate and are
processed in parallel by three ceil(N/2)-tap
filters ceil() is the ceiling function
Arithmetic complexity of the 2-parallel filter is
approximately
3 x N/2 multiplications / two samples 3 x
(N/2-1) adds / two samples 4 adds / two samples
3/4 N multiplications/sample (3N/4 1/2)
adds/sample
If power is dominated by multipliers, 25 power
savings over traditional structures!
Problem 2a Design a 2-parallel parametrizable
FIR filter

39
Obtaining Coefficients of 2-Parallel Subfilters

Example for N 8
H(z) h0, h1, h2, h3, h4, h5, h6, h7
Subfilter coefficients obtained by performing a
polyphase decomposition by 2. Each subfilter has
N/2 4 coefficients
H0(z) h0, h2, h4, h6
H1(z) h1, h3, h5, h7
H0(z) H1(z) h0h1, h2h3, h4h5, h6h7

40
3-parallel filter

In the 3-parallel filter, three inputs arriving
at a third of the original rate are processed by
six parallel ceil(N/3)-tap filters
Arithmetic complexity of the 3-parallel filter is
approximately
2/3 N multiplications/sample (2/3N 4/3) adds
33 reduction in multiplications/sample
Problem 2b Design a 3-parallel parametrizable
FIR filter

41
Obtaining Coefficients of 3-Parallel Subfilters

Example for N 9
H(z) h0, h1, h2, h3, h4, h5, h6, h7, h8
Subfilter coefficients obtained by performing a
polyphase decomposition by 3. Each subfilter has
N/3 3 coefficients
H0(z) h0, h3, h6
H1(z) h1, h4, h7
H2(z) h2, h5, h8
H0(z) H1(z) h0h1, h3h4, h6h7
H1(z) H2(z) h1h2, h4h5, h7h8
H0(z) H1(z) H2(z) h0h1h2, h3h4h5,
h6h7h8

42
Further parallelism

These parallel structures introduce issues such
as increased area, adder overhead (pre- and
post-processing), etc. which eventually become
prohibitive as the subsampling rate increases

43
Assumptions
All coefficients are loaded to the circuit before
the start of processing and do not change during
the runtime. Registers storing coefficients are
connected in chain, so coefficients must be
loaded serially, in the proper order,
starting from the ones with the smallest indices.
44
Parameters of the design
N number of taps (N8, 12, 16, 24, 32) M
fractional wordlength of input (M8..10) K
fractional wordlength of output (K8..10) L
fractional wordlength of coefficients
(L7-11)
45
Required interface - basic architecture
clk
FIR Filter
reset_datapath
1.K
reset_coeff
1.M
d_out
d_in
1.L
load_coeff_done
coeff
filt_mode
( 0load coefficients, 1run filter)
load_begin
( 0idle, 1start to load coefficients)
46
Required interface 2-parallel structure
FIR Filter
clk
reset_datapath
1.K
reset_coeff
1.M
d_out_1
d_in_1
1.K
1.M
d_in_2
d_out_2
1.L
coeff
load_coeff_done
filt_mode
( 0load coefficients, 1run filter)
load_begin
( 0idle, 1start to load coefficients)
47
One-Person Team Requirements

Matlab code will be given for five different
configurations (A, B, C, D, E), each with
different values of N, M, L, and K.
CASE A N 8, M 8, K 8, L 7
CASE B N 12, M 9, K 9, L 8
CASE C N 16, M 9, K 10, L 9
CASE D N 24, M 10, K 11, L 10
CASE E N 32, M 11, K 12, L 11
Step 1 Direct form and transpose form
structures
Generate parametrizable VHDL code round output
of each multiplier to K fractional bits
Generate test vectors using Matlab and verify the
test vectors in RTL for configurations A-E
Implement configurations B and D on FPGA
Optimize for minimum area
Optimize for maximum ratio of throughput / area
(CLB slices)
Step 2 2-parallel and 3-parallel fast FIR
structures
Generate parametrizable VHDL code round output
of each multiplier to K fractional bits
Generate test vectors using Matlab and verify the
test vectors in RTL for configurations B and D
Implement configurations B and D on FPGA
Optimize for minimum area
Optimize for maximum ratio of throughput / area
(CLB slices)

48
Two-Person Team Additional Requirements

Step 3 4-parallel and 6-parallel fast FIR
structures. See ref 2 for block diagrams.
Generate parametrizable VHDL code round output
of each multiplier to K fractional bits
Generate test vectors using Matlab and verify the
test vectors in RTL for configurations B and D
Implement configurations B and D on FPGA
Optimize for minimum area
Optimize for maximum ratio of throughput / area
(CLB slices)
Step 4 Quantization studies
For the 6-parallel filter and configurations B
and D, implement truncation instead of rounding
after the multipliers.
Optimize for minimum area
Optimize for maximum ratio of throughput / area
(CLB slices)
For the 4-parallel filter and configurations B
and D, round to K4 bits after the multipliers.
Round again to K bits right before the filter
outputs to produce a 1.K output.
Optimize for minimum area
Optimize for maximum ratio of throughput / area
(CLB slices)

49
Required reading
1 Z. Mou and P. Duhamel, Short-length
FIR filters and their use in fast
nonrecursive filtering, IEEE Transactions
on Signal Processing, vol. 39, no. 6, pp.
1322-1332, June 1991. 2 K.K. Parhi, VLSI
Digital Signal Processing Systems Design
and Implementation, John Wiley, pp.
256-275, 1999.
Source of test vectors
Matlab implementation provided by Dr. Hwang
50
Important Notes on Twos Complement Arithmetic
51
Project Notation

For this project, we are using twos complement
fractional notation
An m.M number indicates a twos complement mM
bit word with m integer bits and M fractional
bits
Example 1.4 number
0.111 0.875
1.000 -1
1.111 -0.125
Example 2.2 number
00.11 0.75
10.00 -2
01.01 1.25
The dynamic range of an m.M number is -2m-1,
2m-1)

52
Twos Complement Multiplication
a
1.M
1.L
b
1.ML

The wordlength required for the product of 1.M x
1.L numbers
2.(ML) if we assume -1 x -1 1 may occur
1.(ML) if we assume -1 x -1 1 will never
occur
In general a product of m.M x l.L numbers
(ml).(ML) if assume (most neg value of a) x
(most neg value of b) may occur
(ml-1).(ML) if assume (most neg value of a) x
(most neg value of b) will never occur
In this project, we assume that (most neg value
of a) x (most neg value of b) will never occur
for any multiplier in any filter structure. This
is guaranteed by scaling the inputs and
coefficients properly in Matlab.
Examples 1.5 x 2.5 2.10, 1.4 x 1.6 1.10, 3.4
x 2.3 4.7

53
Twos Complement Truncation versus Rounding

In this project, we ask you to round the output
of each multiplier to K fractional bits.
To round a k.K number to a k.K number (K lt K)
Truncate the k.K number to become a k.K number
Add the former fractional K1 bit to fractional
position K
For information purposes, to truncate a k.K
number to a k.K number (K lt K)
Truncate the k.K number to become a k.K number
Rounding and truncation produce equal noise
variance, whereas rounding is (approximately)
unbiased and truncation is biased

54
Truncation versus RoundingExample 2.5 number
to a 2.3 number
ROUNDING
00.01110 1 00.100
11.01000 0 11.010
10.00110 1 10.010
TRUNCATION
00.01110 00.011
10.00110 10.001
11.01000 11.010
55
Twos Complement Addition
1.M
1.M
. . .
x(n)
Z-1
Z-1
h0
h1
1.N
1.MN
ROUND
ROUND
1.K
1.K
1.K
1.K
y(n)

FIR filters perform chains of additions
A k.K number plus a k.K number requires a (k1).K
number to represent the sum
Ex. 0.111 (0.75) 0.111 (0.75) 01.100 (1.5)
Ex. 1.000 (-1) 1.000 (-1) 10.000 (-2)
In general, an adder chain summing J numbers,
each of wordlength k.K, requires a wordlength of
(k ceil(log2(J)).K after the final adder
This grows for a large number of coefficients N

56
Twos Complement Adder Chain Trick using Modulo
Arithmetic
1.K
1.K
1.K
1.K
y(n)
2.K
3.K
3.K
1.K
1.K
1.K
1.K
1.K
1.K
1.K
y(n)

Trick if we know output of adder is bounded
within a k.K value (where k is some known
value), then all intermediate addition nodes only
require k.K bit wordlengths
Provides hardware savings for large number of
coefficients N!
This is only true if we know the output of the
adder chain is bounded
Be careful, because x(2n) x(2n1) is not
guaranteed to be bounded in 1.M you need the
full 2.M
h(0) h(1) is not guaranteed to be bounded in
1.L you need the full 2.L
In this project, this trick helps after
multiplier outputs, not on multiplier inputs
In our project, the final output y(n) is bounded
within a 1.K bit wordlength. This has been
controlled by scaling the inputs and coefficients
in Matlab.
To learn about more helpful hardware tricks
take ECE 645 next semester!