An Efficient Reconfigurable Architecture for Asymmetric and Symmetrickey Cryptography presentation

About This Presentation

Transcript and Presenter's Notes

Title: An Efficient Reconfigurable Architecture for Asymmetric and Symmetrickey Cryptography

1
An Efficient Reconfigurable Architecture for
Asymmetric and Symmetric-key Cryptography

Masters Thesis Defense
Ali Mustafa Zaidi

2
Research Motivation

Need to minimize security risks in the
Information Age
Future economies critically dependent on digital
communication.
Security Protocols (SSL, IPSec) are Algorithm
Independent
For future-proofing, enhanced security.
Ciphers negotiated b/w communicating entities
Modern line rates are in the Gbps
Ethernet (LAN), VPNs, ATM.
System performance not keeping pace with
communication capacity.

3
Research Motivation

Essential security services for Modern Digital
Communication
Confidentiality
Data Integrity
Authenticity
Non-Repudiation
Both Symmetric and Asymmetric-key ciphers used to
provide these services
Examples of Need
SSL IPSec Servers require
High Throughput, High Flexibility, Low Cost
implementations.
Portable PCs, embedded systems etc. require
Low Cost, Low Power, Reasonable Flexibility
implementations.

4
Implementing Ciphers

Ciphers may be implemented either in Hardware or
in Software
ASIC implementations
Very High Throughput,
Low Power Requirements,
More secure (if designed properly),
Inflexible
Software implementations
Highly flexible
Low Throughput,
High Power Requirements
Less secure
Reconfigurable Hardware Third option

5
What is Reconfigurable Computing?

Programmable Hardware
Spatial vs. Temporal implementation of
computation
Ideal for exploiting parallelism at all levels
Thus a reconfigurable device
Functions like hardware implementation, but is
Programmable like software
Cost of flexibility area inefficiency
May require 10 x greater area than dedicated
ASIC implementations.

6
Why Reconfigurable Hardware?

Reconfigurable Hardware can provide
Flexibility equivalent to software
implementations
Algorithm Agility
Algorithm Modification
Algorithm Upload
Throughput close to ASIC implementations
Exploits parallelism at multiple levels
Cost Advantages
Better power consumption than software
Cheaper to develop than ASICs (at least for low
volumes)
Lower turnaround-time and time-to-market than
ASICs

7
Why Domain-Specific Reconfigurability?

Commercial FPGAs used extensively for
implementation of Ciphers
General-purpose reconfigurable device
Provides very good throughput values for most
ciphers
Poor power area characteristics
Efficiency/Throughput may be improved by
Optimizing resources for a single application
domain.
Greater potential for implementing Run-time
Reconfiguration
Design Challenges
Requires careful consideration of entire
application domain, in order to understand
trade-offs involved when designing reconfigurable
resources.

8
Related Work

USC Mark II s Advanced Cryptographic Engine for
IpSec 11 based on General Purpose FPGA

Cavium Networks Security Processor 13 ASIC
with multiple Dedicated cipher cores.

9
Related Work

COBRA 8 Reconfigurable architecture for
Secret-Key ciphers

10
Our Work

A specialized reconfigurable architecture for
both secret-key and public key cryptography.
Design Goals
Better area efficiency and throughput than
commercial FPGAs for symmetric and asymmetric-key
ciphers
Throughput performance close to reported results
for ASICs
Architecture Flexibility sufficient for
implementing most (if not all) current and future
ciphers.
Reduce configuration data requirements to
facilitate Run-Time Reconfigurability

11
Presentation Overview

Our Design Methodology
Review of Ciphers Rijndael, Serpent, IDEA, and
Public-Key
Requirements of Application Domain
Reconfigurable Resources to meet these
requirements
Mapping of Cryptographic Algorithms
Rijndael, and Long Integer Modular Multiplication
Results
Functional Simulation
Standard Cell Synthesis
Throughput/Area Comparison with Published Results
Conclusion and Future Directions

12
Our Design Methodology

Divide-and-conquer
Identify independently the
Logic,
Interconnect, and
Memory
requirements of different ciphers.
Prioritize ciphers for influence on architectural
decisions
Ensure architectural support for all ciphers
(except ECC), but
Focus on optimizing support for most important
ciphers
Rijndael, Serpent (best two AES candidates)
DES (due to popularity)
IDEA (due to different nature, also popularity)
Long-integer Modular Multiplication (required by
all public-key ciphers except ECC)

13
Overview of Ciphers Rijndael

Rijndael
AddRoundKey 128-bit XOR
SubBytes 16, 8x8 Memory lookups
ShiftRows 128-bit permutation of 8-bit
sub-words.
MixColumns ?
Most complex operation.

14
Overview of Ciphers Serpent, IDEA

Serpent

IDEA

15
Public Key Ciphers

RSA, Diffie-Hellman, Elgamal
Require Long Integer Modular Exponentiation
Elliptic Curve Cryptography
GF(p) requires Long Integer Modular
Multiplication
GF(2m) requires support for Finite Field
Arithmetic
Not supported in current architecture

16
Overview of Ciphers RSA

Square and Multiply Algorithm for Modular
Exponentiation

Radix 2k Long Integer Montgomery Modular
Multiplication (LIMM)

Ripple-Carry Addition of Long Integers !
17
Overview of Ciphers RSA

Fastest Implementation of LIMM 42
Throughput 5 Mb/s for both 512-bit and 1024-bit
RSA

Five-to-two Multiplier Modular Exponentiation (X,
E, M) K 22k mod M computed
externally 1. P10 , P20 5to2_MontMult(K , 0 , 1
, 0 , M), Z10 , Z20 5to2_MontMult(K , 0 , X ,
0 , M) 2. FOR i 0 to n-1 DO 3. Z1i1 , Z2i1
5to2_MontMult(Z1i , Z2i , Z1i , Z2i , M) 4. IF
ei 1 THEN P1i1 , P2i1 5to2_MontMult(P1i
, P2i , Z1i , Z2i , M) ELSE P1i1 , P2i1
P1i , P2i 5. ENDFOR 6. P1n , P2n
5to2_MontMult(1 , 0 , P1n-1 , P2n-1 , M) 7. P
P1n P2n 8. RETURN P
Five-to-two CSA Montgomery Multiplication (A1 ,
A2 , B1 , B2 , M) 1. S10 , S20 0 , 0 2. FOR i
0 to m-1 DO 3. qi (S1i S2i) Ai(B1B2)
mod 2 4. S1i1 , S2i1 CSR (S1i S2i)
Ai(B1B2) qiM div 2 5. ENDFOR
Carry Save Addition of Long Integers
18
Overview of Ciphers RSA

Five-to-two CSA Montgomery Multiplication

19
Requirements of Application Domain

After an extensive survey of the logic,
interconnect, and memory requirements of both
symmetric-key and asymmetric-key ciphers, the
following requirements were identified

20
Requirements of Application Domain

Logic Requirements
Symmetric-Key Ciphers
Bitwise XOR, AND, OR
Addition and Subtraction Modulo 2n
Modular Multiplication (mod 232, 264, 216 1,
232 - 1)
Galois field constant multiplication.
Fixed Permutations on 32-bit, 64-bit and 128-bit
words
Both with and without mappings, constants
Includes fixed shifts and rotates.
Variable rotations.
Asymmetric-Key Ciphers (for chosen implementation
of LIMM)
Long word Carry-Save Addition
Random-Logic functionality for generating a few
1-bit global signals
Single-bit Shifting of long words.

21
Configurable Logic Block

Configurable Logic Block ? Reconfigurable Carry
Save Adder cell

Composed of 64 Reconfigurable CSA Cells

Outputs optionally registered.
Inputs may be AND/XOR masked.

22
Configurable Logic Block

Each CLB has an associated Random logic block
Takes bit-wise signals from the outputs of the
Logic Block
Used to implement Bit signals for
Long integer Modular Multiplication
IDEA multiplication (generation of partial
products)

23
Configurable Logic Block

Overall architecture of Configurable Logic Block

24
Permutation Unit

Most demanding functionality requirements
Bitwise permutation of up to 128-bits (Serpent,
DES)
Permutations with repetitions within 8-bit words
(Rijndael)
1-bit shifts of long integer values (LIMM)
Potential approaches for implementing a Permute
Unit
Crossbar
Low latency
Supports permutations with repetitions
High area requirements (n2 switches required)
Benes Networks, Omega-flip Networks, etc. 14
Low area requirements
No support for repetitions
High latency (2 log n logic levels)

25
Permutation Unit

The Benes Network
Our design
Replace each middle Benes stage with equivalent
crossbar.
For e.g., if each of eight 16x16 Benes networks
in a 128-bit network is replaced by a 16x16
crossbar
Mappings within 8-bit words now supported.
Latency reduced 14 logic levels reduced to 5.
Modest Area overhead incurred.

26
Permutation Unit

Overall architecture of Permutation Unit

27
Requirements of Application Domain

Memory Requirements
Symmetric-Key Ciphers
Table Lookups for implementing S-Boxes. Most
Common Configurations include
8-to-8 (e.g. Rijndael)
4-to-4 (e.g. Serpent)
6-to-4 (e.g. DES)
8-to-32 (e.g. MARS)
Asymmetric-Key Ciphers (for chosen method)
Primarily for storage of input values and
buffering of intermediate results.

28
Configurable Memory Block

Overall architecture of Configurable Memory Block

29
Requirements of Application Domain

Interconnect Requirements
Symmetric-Key Ciphers
Implies hierarchical interconnect organization
Straightforward produce-to-consumer data transfer
b/w functional blocks (i.e. rounds, or
sub-operations within rounds). Requires
Simple interconnect, with
Coarse-granularity
Diverse communication requirements within
functional blocks. Requires
High Flexibility,
Finer Granularity.
Asymmetric-Key Ciphers (for chosen method)
Straightforward producer-to-consumer
communication
Coarse-Granularity
Few global bitwise interconnect wires

30
Reconfigurable Interconnect Design

Overall Interconnect Design
Two levels of hierarchy
Top Level
Simple 4 NN Local Interconnect, augmented with a
Locality-aware Global Interconnect Binary
Fat-tree
128-bit granularity
Group Level
Crossbar Interconnect,
Interconnects Logic, Permute, Memory blocks, and
I/O ports.
32-bit granularity
Interconnect design depends on interaction b/w
Reconfigurable Elements.
Interactions identified by mapping ciphers to
elements.

31
Reconfigurable Interconnect Design

Reconfigurable Logic Group Architecture (CGA2)
8 permute units, 4 memory blocks, 6 logic blocks.
Designed based on study of cipher mappings to
reconfigurable elements.
Crossbar Area estimate
(24) x (48) x 32 36,864 lt 49,152 ltlt 76,800
switches.

32
Reconfigurable Interconnect Design

Hierarchical, Locality aware global interconnect
4-NN connections,
Fat-tree.

33
System Organization
34
Mapping of Algorithms

Rijndael,
Our Mapping

35
Mapping of Algorithms

Long Integer Modular Multiplication
Chosen implementation

36
Mapping of Algorithms

Long Integer Modular Multiplication
Our mapping

37
Functional Simulation Results

AES (Rijndael)
Implemented a single round of AES,
Verified using test vectors provided by NIST
Plaintext 0x000102030405060708090A0B0C0D0E0F
Key 0x000102030405060708090A0B0C0D0E0F
Expected Round Output 0xB5C9179EB1CC1199B9C51B92
B5C8159D

38
Functional Simulation Results

Long Integer Modular Multiplication
Implemented a 255-bit long integer modular
multiplier using our blocks.
Generated own test vectors.

39
Synthesis Results

Synthesized for
LSI 10k standard cell library (0.6 micron
technology)
Using Synopsys Design Compiler
Operating Conditions

WCCOM
Process variation Index 1.5
Temperature 70
Voltage 4.75v
Interconnect Model worst_case_tree
BCCOM
Process variation Index 0.6
Temperature 0
Voltage 5.25v
Interconnect Model best_case_tree

05x05 Wire Load Model for Logic, Permute and
Memory Blocks
wire_load("05x05")
resistance 0
capacitance 1
area 0
slope 0.186
fanout_length(1,0.39)
50x50 Wire Load Model for the Crossbar
Interconnect
wire_load("50x50")
resistance 0
capacitance 1
area 0
slope 1.218
fanout_length(1,1.8)

40
Synthesis Results

We use the BCCOM results for our comparison below

41
Comparison with Published Results

Group Area requirements
Permute Area Logic Area Memory Area
Interconnect Area
(8 21,256) (6 8,100) (4 19,251)
(1.5 3 39,104)
170,048 48,600 77,004
175,968
471,620 transistors.
Transistor count per Group 0.5 million
36.1 for the 8 Permute Units,
37.3 for the 24x48 Crossbar Interconnect,
16.3 for the 4 Memory Blocks, and
10.3 for the 6 Logic Blocks

42
Rijndael Results Comparison
43
Comparison with Published Results

AES Encryption
Throughput Comparison
Results compete well with FPGA implementations
Despite being on 0.6 micron standard cell
technology.
Full Custom implementation on recent fabrication
process should provide even better throughput.
Area Comparison
Direct comparison with FPGAs not possible.
Poor utilization of reconfigurable resources.
Area overhead 30 x of ASIC
Primarily due to very high memory requirements of
Rijndael

44
RSA Results Comparison
45
Comparison with Published Results

RSA Encryption
Throughput Results
3 times worse than FPGA
Full-Custom Implementation may improve situation
somewhat.
Primary reason ? delay incurred by 2 Permute
Units on critical path.
Area Comparison
Direct comparison with FPGA not possible.
Good resource utilization
Possible to implement cheaply on single IC.
Area overhead 14 x of ASIC
10 x acceptable for reconfigurable device

46
Discussion

Need to make improvements
Area requirements of Rijndael
Throughput of LIMM.
Proposed improvements
Configurable Memory Group
Replace 1 of every 4 Groups with a Memory group
Dramatic increase in area efficiency of Rijndael,
but 25 drop in area efficiency of all other
ciphers.
Pre-aligned CSA outputs
Eliminate need to shift long integers using
permute units
Shift-Before-Registering of Logic block outputs
Optimize logic block for delay

47
Conclusion and Future Work

Results are promising
Better than other Domain-specific reconfigurable
architectures.
Still room for improvement.
Future work
Functional simulation of other key ciphers
Implement proposed enhancements
Evaluate potential for incorporating ECC support
Development of Control elements and Programming
Model
With support for Run-time Reconfiguration
Full custom implementation of reconfigurable
elements.
And estimation of area/performance
Explore potential for supporting other
application domains at minimal additional cost.

48
Thank You
49
Mapping of Algorithms

Rijndael Contd,
Combinatorial S-boxes
Lots of equations similar to MixColumn
Also Mapped

50
Mapping of Algorithms

Rijndael Contd,
Combinatorial S-boxes
Map and Map-1 operations.

Map Operation Map-1 Operation aH3 (a5 a7)
a7 (aH0 aH1) (aL2 aH3) aH2 (a5 a7)
(a2 a3) a6 (aL1 aH3) (aL2 aL3)
aH0 aH1 (a4 a6) (a1 a7) a5 (aH0 aH1)
aL2 aH0 (a4 a6) (a5) a4 (aL1 aH3)
(aH0 aH1) aL3 aL3 a2 a4) a3 (aH0
aH1) (aL1 aH2) aL2 (a1 a7) a2 (aL1
aH3) (aH0 aH1) aL1 (a1 a2) a1 (aH0
aH1) aH3 aL0 (a4 a6) (a0 a5) a0
(aL0 aH0).
51
Mapping of Algorithms

Rijndael Contd,
Combinatorial S-boxes
Affine and Affine-1 operations.

The Affine Transform The Inverse
Affine Transform q7 (a4 a5) (a6 a7) a3
q7 (a1 a4) a6 q6 (a2 a3)
(a4 a5) (a6 1) q6 (a0 a5)
a3 q5 (a2 a3) (a4 a5) (a1 1)
q5 (a2 a7) a4 q4 (a0 a1) (a2 a3)
a4 q4 (a3 a6) a1 q3 (a0
a1) (a2 a3) a7 q3 (a0 a5)
a2 q2 (a0 a1) (a6 a7) a2
q2 (a1 a4) (a7 1) q1 (a0 a1) (a6
a7) (a5 1) q1 (a3 a6) a0 q0
(a4 a5) (a6 a7) (a0 1) q0
(a2 a7) (a5 1).
52
Mapping of Algorithms

Rijndael Contd,
Combinatorial S-boxes
GF(24) Multiplication

q t u v w x
y z _ q3 a0.b0 a3.b1 a2.b2
a1.b3 0 0 0 q2 a1.b0
a0.b1 a3.b2 a2.b3 a3.b1 a2.b2 a1.b3 q1
a2.b0 a1.b1 a0.b2 a3.b3 0
a3.b2 a2.b3 q0 a3.b0 a2.b1 a1.b2 a0.b3
0 0 a3.b3.
53
Mapping of Algorithms

IDEA Contd
Addition Modulo 216
XOR of 16-bit values
Multiplication Modulo 216 1 -gt more tricky
We use the Modified Low-High Algorithm from 3

54
Mapping of Algorithms

IDEA Contd,
Our Mapping
The unsigned integer Multiplier

55
Mapping of Algorithms

IDEA Contd,
Our Mapping
The Remaining Logic (multiplexers, RCAs etc.)

56
Mapping of Algorithms

Serpent Contd
Mapping fairly simple

57
Mapping of Algorithms

58
Mapping of Algorithms

DES Contd
Our mapping

59
Configurable Memory Block - REMOVE

Our design
2-kilobits of LUT data per Configurable Memory
Block
Most commonly required memory size.
Composed of 32 4-to-4 LUTs
Minimum LUT granularity required by any
symmetric-key cipher.
Configurable interconnect at the Address-in and
Data-out ports of each 4-to-4 LUT
4-bit granularity
Supports the commonly required LUT
configurations, as well as a 4x128 configuration
for data input.
Possible to add support for even more LUT
configurations, at the cost of area.

60
Reconfigurable Interconnect Design

Initial Architecture
Focus on Nearest Neighbor Interconnect
128-bit in and out ports to at least 4-NN
Issues
Lack of Locality Awareness
Potential for Inflexibility when mapping
complicated structures.

61
Reconfigurable Interconnect Design

Original Group Architecture
Key Issues
Very large Interconnect Area Requirements
Poor Utilization of interconnect
Crossbar Area estimate
In-ports x out-ports x granularity
(2416) x (4416) x 32 76800 switches.
Also consider
During algorithm mapping, discovered the need to
separate permute units from logic block outputs.
This incurred yet another crossbar, from logic
units to permute units.
Thus, area issue is critical to architecture
efficiency.

62
Reconfigurable Interconnect Design

Methods of Reducing Area of Intra-Group
Interconnect
Reduce ports in crossbar.
By reducing number of components in group.
Reducing the number of NN ports.
By sharing ports between elements.
Increase Granularity, e.g. from 32- to 64-bits
(effectively halving number of switches for same
number of ports)
Others...
All these approaches will affect the
functionality and flexibility of the
architecture.
Must be done cleverly
The following methods were developed by extensive
mapping and remapping of algorithms to find the
best solutions
Share ports between Permute Units and Memory
Block.
Use memory block as configuration store for
Permute Units.
Restrict number of NN ports in Group x-bar to
256-bits in and 256-bits out, by selecting via
configuration the required NN ports.
Resulted in two two possible organizations
that reduce area without overly compromising
functionality.

63
Reconfigurable Interconnect Design

Candidate Group Architecture 1
4 permute units, 4 memory blocks, 4 logic blocks.
Requires 256-bit bypass paths as well.
Provides less functionality than CGA 2, selection
dependent on area characteristics.
Crossbar Area estimate
(16) x (1688) x 32 (1688) x (32) x 32
49,152 switches ltlt 76,800 switches.

Write a Comment

User Comments (0)

About PowerShow.com

An Efficient Reconfigurable Architecture for Asymmetric and Symmetrickey Cryptography PowerPoint PPT Presentation