Title: FPGA Implementation of Advanced Encryption standards
1FPGA Implementation of Advanced Encryption
standards
- Srihari Sridharan
- October 22nd 2007
2Efficient Implementation of Rijndael Encryption
in Reconfigurable Hardware Improvements and
Design Tradeoffs
- Francois-Xavier Standaert,Gael Rouvroy,Jean-Jacque
s Quisquater, and Jean-Didler Legat - CHES Springer-Verlag Berlin Heidelberg 2003
3OUTLINE
- Performance Evaluation of AES Algorithm
- Effective FPGA implementation
- Heuristics to evaluate hardware efficiency
- Derive at optimum throughput/area efficiency
- Optimum Throughput 18.5 Gbps , Area 542
slices , 10 RAM blocks
4Hardware Description
5Hardware Description
- XILINX VIRTEX E
- 32448 slices
- 64986 LUTs,F.Fs
- 208 RAM Blocks
- Synthesis Synopsys
- Circuit modeling - VHDL
6Hardware Description
- 2 Slices per CLB
- Slice 2 L.C
- L.C one 4-I/p LUT storage additional logic
- Storage element Latch/Edge Triggered D F.F
- Additional Logic Mux F5,F6
- Arithmetic logic CY logic XOR AND
7Evaluation Paramaters
- 2 Types of Performance evaluation parameters.
- In terms of performance
- Throughput bits processed per sec
- Area Slices
- Ratio is an evaluation parameter
- In terms of resource
- Nbr of LUTs
- Nbr of Registers
- Ratio is Evaluation parameter
8Encryption Block
9Plain Text - Block Ciphers
- Input 128 bit blocks
- State transformed
- Src inr4c
- Outr4c Src
- 0ltrlt4, oltcltNb(4)
10Implementation
- 2 Types of Optimization
- Algorithmic
- SBox
- Multiplexer Model
- RAM based
- Composite field
- MixColumns
- MixColumns transform
- Mixadd transform
- Architectural
- Loop unrolling
- Pipelining
- Sub-Pipelining
11SBOX - Mux Model
12Mux Model - Background
- N i/p boolean function G(x) represented by
- In AES
- Which is bit representation
- Implemented as
13Mux Model
14Mux Model
- Realization on FPGA
- LUT based
- 4 I/p 4 o/p Lookup
- Four 4 I/p 1 o/p LUT
- Coupled 41 Mux
- Realizing 41 Mux through three 21 Mux
15Mux Model - Implementation
16Mux Model - Analysis
- 1 Bit output
- Repeated 16 times and looped 16 times
- Critical path LUT4 MUXF5 MUXF6
- 2 level pipelining
- 12 clock pulses
17Implementation
- 2 Types of Optimization
- Algorithmic
- SBox
- Multiplexer Model
- RAM based
- Composite field
- MixColumns
- MixColumns transform
- Mixadd transform
- Architectural
- Loop unrolling
- Pipelining
- Sub-Pipelining
18SBOX RAM Based
- Lookup type
- BRAM two single port 256x8 bit
- Write enable of RAM made low
- Input held low
- ROM implemented
- 1 clock
- Design
- SBOX 16x16x8 2048 bits 2Kbits
- 16 SBOx for each state
- 1 BRAM two 2Kbit RAM
- Hence 8 BRAM required
19Implementation
- 2 Types of Optimization
- Algorithmic
- SBox
- Multiplexer Model
- RAM based
- Composite field
- MixColumns
- MixColumns transform
- Mixadd transform
- Architectural
- Loop unrolling
- Pipelining
- Sub-Pipelining
20Composite field - Math Basics
- Byte representation in Galois Field GF(28)
- For e.g. 01100011 is x6 x5 x 1.
- Addition Modulo 2 Arithmetic (No subtraction)
- Multiplication polynomial multiplication modulo
irreducible polynomial (deg 8) m(x) x8 x4
x3 x 1 - Multiplicative inverse
- b(x)a(x) m(x)c(x) 1.
- b-1 (x) a(x) mod m(x) because
- a(x) b(x) mod m(x) 1,
- E.g 3m 1 (mod 11) , 3-1 m (mod 11)
21Composite model equations Multiplicative Inverse
- GF(28) GF(24) 2
- GF(24) a1x a0
- Inverse given by
- X belongs to x2 x ? 0
- b0(a0a1)?-1
- b1a1?-1
- ? a0.(a0a1)? a12
22Composite field - Affine Transformation
- Linear transformation Translation
- Transformation rotations, scaling, shear
- Translation shift
- In AES
23Composite field - implementation
24Implementation
- 2 Types of Optimization
- Algorithmic
- SBox
- Multiplexer Model
- RAM based
- Composite field
- MixColumns
- MixColumns transform
- Mixadd transform
- Architectural
- Loop unrolling
- Pipelining
- Sub-Pipelining
25Mixcolumns transform - Background
- Four-term polynomials
- Coefficients are bytes
- M(x) X4 1
- Product defined as a(x) X b(x) d(x)
26Mixcolumns transform - Equations
- Solution
- Multiplication of GF(28) polynomial with X
multiplication by 02 left shift plus
Conditional XOR (based on MSB)
27Mixcolumns transform - Implementation
28Mixcolumns transform - Implementation
- To implement
- 03a1 (02 01)a1 02a1 a1
- Hence we have
- 2 multiplication with x (a0,a1)
- 5 XOR addition Above two a1a2a3
- 2 level pipelined
29Mixcolumns transform - Implementation
30Implementation
- 2 Types of Optimization
- Algorithmic
- SBox
- Multiplexer Model
- RAM based
- Composite field
- MixColumns
- MixColumns transform
- Mixadd transform
- Architectural
- Loop unrolling
- Pipelining
- Sub-Pipelining
31Mixadd transform - Principle
- Inside X(a0) or X(a1) Mostly shift operator
- In both the bytes XOR is done only to 3 bits
- So these three bits separately added
- Now pipelined
- Combined with Key addition
32Mixadd transform -Implementation
33Implementation
- 2 Types of Optimization
- Algorithmic
- SBox
- Multiplexer Model
- RAM based
- Composite field
- MixColumns
- MixColumns transform
- Mixadd transform
- Architectural
- Loop unrolling
- Pipelining
- Sub-Pipelining
34Unrolled Architecture
- 10 AES round unrolled
- Lots of hardware
- Area is increased
- Throughput is Increased
35Implementation
- 2 Types of Optimization
- Algorithmic
- SBox
- Multiplexer Model
- RAM based
- Composite field
- MixColumns
- MixColumns transform
- Mixadd transform
- Architectural
- Loop unrolling
- Pipelining
- Sub-Pipelining
36Pipelined Architecture - I
- At a time only one round
- Hardware reduced
- Throughput reduced
- Area reduced
37Pipelined Architecture - II
- All 10 rounds taken inside loop
- Loss of mixadd combination
- Additional Mux
- Good choice in ASIC
38Heuristic optimization
39Results
- Pipelined -I architecture
- Unrolled Architecture
40Results Contd
RAM/unrolled
RAM/pipelined
Mux/pipelined
composite/pipelined
41Summary
- http//www.cs.bc.edu/straubin/cs381-05/blockciphe
rs/rijndael_ingles2004.swf
42Conclusion
- Algorithmic and Architectural Design Tradeoffs
were evaluated - Optimum Design principle found through heuristics
- Throughput 1563Mbps
- Performance (throughput/Area) .69
43Phase 2 preview
- Implement SBOX RAM based
- Implement Mixcoloumn Mixcoloumn transform
- Implement Addkey Direct XOR
- Implement ShiftRow Simple cyclic shift