Title: An Efficient Reconfigurable Architecture for Asymmetric and Symmetrickey Cryptography
1An Efficient Reconfigurable Architecture for
Asymmetric and Symmetric-key Cryptography
- Masters Thesis Defense
- Ali Mustafa Zaidi
2Research Motivation
- Need to minimize security risks in the
Information Age - Future economies critically dependent on digital
communication. - Security Protocols (SSL, IPSec) are Algorithm
Independent - For future-proofing, enhanced security.
- Ciphers negotiated b/w communicating entities
- Modern line rates are in the Gbps
- Ethernet (LAN), VPNs, ATM.
- System performance not keeping pace with
communication capacity.
3Research Motivation
- Essential security services for Modern Digital
Communication - Confidentiality
- Data Integrity
- Authenticity
- Non-Repudiation
- Both Symmetric and Asymmetric-key ciphers used to
provide these services - Examples of Need
- SSL IPSec Servers require
- High Throughput, High Flexibility, Low Cost
implementations. - Portable PCs, embedded systems etc. require
- Low Cost, Low Power, Reasonable Flexibility
implementations.
4Implementing Ciphers
- Ciphers may be implemented either in Hardware or
in Software - ASIC implementations
- Very High Throughput,
- Low Power Requirements,
- More secure (if designed properly),
- Inflexible
- Software implementations
- Highly flexible
- Low Throughput,
- High Power Requirements
- Less secure
- Reconfigurable Hardware Third option
5What is Reconfigurable Computing?
- Programmable Hardware
- Spatial vs. Temporal implementation of
computation - Ideal for exploiting parallelism at all levels
- Thus a reconfigurable device
- Functions like hardware implementation, but is
- Programmable like software
- Cost of flexibility area inefficiency
- May require 10 x greater area than dedicated
ASIC implementations.
6Why Reconfigurable Hardware?
- Reconfigurable Hardware can provide
- Flexibility equivalent to software
implementations - Algorithm Agility
- Algorithm Modification
- Algorithm Upload
- Throughput close to ASIC implementations
- Exploits parallelism at multiple levels
- Cost Advantages
- Better power consumption than software
- Cheaper to develop than ASICs (at least for low
volumes) - Lower turnaround-time and time-to-market than
ASICs
7Why Domain-Specific Reconfigurability?
- Commercial FPGAs used extensively for
implementation of Ciphers - General-purpose reconfigurable device
- Provides very good throughput values for most
ciphers - Poor power area characteristics
- Efficiency/Throughput may be improved by
- Optimizing resources for a single application
domain. - Greater potential for implementing Run-time
Reconfiguration - Design Challenges
- Requires careful consideration of entire
application domain, in order to understand
trade-offs involved when designing reconfigurable
resources.
8Related Work
- USC Mark II s Advanced Cryptographic Engine for
IpSec 11 based on General Purpose FPGA
- Cavium Networks Security Processor 13 ASIC
with multiple Dedicated cipher cores.
9Related Work
- COBRA 8 Reconfigurable architecture for
Secret-Key ciphers
10Our Work
- A specialized reconfigurable architecture for
both secret-key and public key cryptography. - Design Goals
- Better area efficiency and throughput than
commercial FPGAs for symmetric and asymmetric-key
ciphers - Throughput performance close to reported results
for ASICs - Architecture Flexibility sufficient for
implementing most (if not all) current and future
ciphers. - Reduce configuration data requirements to
facilitate Run-Time Reconfigurability
11Presentation Overview
- Our Design Methodology
- Review of Ciphers Rijndael, Serpent, IDEA, and
Public-Key - Requirements of Application Domain
- Reconfigurable Resources to meet these
requirements - Mapping of Cryptographic Algorithms
- Rijndael, and Long Integer Modular Multiplication
- Results
- Functional Simulation
- Standard Cell Synthesis
- Throughput/Area Comparison with Published Results
- Conclusion and Future Directions
12Our Design Methodology
- Divide-and-conquer
- Identify independently the
- Logic,
- Interconnect, and
- Memory
- requirements of different ciphers.
- Prioritize ciphers for influence on architectural
decisions - Ensure architectural support for all ciphers
(except ECC), but - Focus on optimizing support for most important
ciphers - Rijndael, Serpent (best two AES candidates)
- DES (due to popularity)
- IDEA (due to different nature, also popularity)
- Long-integer Modular Multiplication (required by
all public-key ciphers except ECC)
13Overview of Ciphers Rijndael
- Rijndael
- AddRoundKey 128-bit XOR
- SubBytes 16, 8x8 Memory lookups
- ShiftRows 128-bit permutation of 8-bit
sub-words. - MixColumns ?
- Most complex operation.
14Overview of Ciphers Serpent, IDEA
15Public Key Ciphers
- RSA, Diffie-Hellman, Elgamal
- Require Long Integer Modular Exponentiation
- Elliptic Curve Cryptography
- GF(p) requires Long Integer Modular
Multiplication - GF(2m) requires support for Finite Field
Arithmetic - Not supported in current architecture
16Overview of Ciphers RSA
- Square and Multiply Algorithm for Modular
Exponentiation
- Radix 2k Long Integer Montgomery Modular
Multiplication (LIMM)
Ripple-Carry Addition of Long Integers !
17Overview of Ciphers RSA
- Fastest Implementation of LIMM 42
- Throughput 5 Mb/s for both 512-bit and 1024-bit
RSA -
Five-to-two Multiplier Modular Exponentiation (X,
E, M) K 22k mod M computed
externally 1. P10 , P20 5to2_MontMult(K , 0 , 1
, 0 , M), Z10 , Z20 5to2_MontMult(K , 0 , X ,
0 , M) 2. FOR i 0 to n-1 DO 3. Z1i1 , Z2i1
5to2_MontMult(Z1i , Z2i , Z1i , Z2i , M) 4. IF
ei 1 THEN P1i1 , P2i1 5to2_MontMult(P1i
, P2i , Z1i , Z2i , M) ELSE P1i1 , P2i1
P1i , P2i 5. ENDFOR 6. P1n , P2n
5to2_MontMult(1 , 0 , P1n-1 , P2n-1 , M) 7. P
P1n P2n 8. RETURN P
Five-to-two CSA Montgomery Multiplication (A1 ,
A2 , B1 , B2 , M) 1. S10 , S20 0 , 0 2. FOR i
0 to m-1 DO 3. qi (S1i S2i) Ai(B1B2)
mod 2 4. S1i1 , S2i1 CSR (S1i S2i)
Ai(B1B2) qiM div 2 5. ENDFOR
Carry Save Addition of Long Integers
18Overview of Ciphers RSA
- Five-to-two CSA Montgomery Multiplication
19Requirements of Application Domain
- After an extensive survey of the logic,
interconnect, and memory requirements of both
symmetric-key and asymmetric-key ciphers, the
following requirements were identified
20Requirements of Application Domain
- Logic Requirements
- Symmetric-Key Ciphers
- Bitwise XOR, AND, OR
- Addition and Subtraction Modulo 2n
- Modular Multiplication (mod 232, 264, 216 1,
232 - 1) - Galois field constant multiplication.
- Fixed Permutations on 32-bit, 64-bit and 128-bit
words - Both with and without mappings, constants
- Includes fixed shifts and rotates.
- Variable rotations.
- Asymmetric-Key Ciphers (for chosen implementation
of LIMM) - Long word Carry-Save Addition
- Random-Logic functionality for generating a few
1-bit global signals - Single-bit Shifting of long words.
21Configurable Logic Block
- Configurable Logic Block ? Reconfigurable Carry
Save Adder cell
- Composed of 64 Reconfigurable CSA Cells
- Outputs optionally registered.
- Inputs may be AND/XOR masked.
22Configurable Logic Block
- Each CLB has an associated Random logic block
- Takes bit-wise signals from the outputs of the
Logic Block - Used to implement Bit signals for
- Long integer Modular Multiplication
- IDEA multiplication (generation of partial
products)
23Configurable Logic Block
- Overall architecture of Configurable Logic Block
24Permutation Unit
- Most demanding functionality requirements
- Bitwise permutation of up to 128-bits (Serpent,
DES) - Permutations with repetitions within 8-bit words
(Rijndael) - 1-bit shifts of long integer values (LIMM)
- Potential approaches for implementing a Permute
Unit - Crossbar
- Low latency
- Supports permutations with repetitions
- High area requirements (n2 switches required)
- Benes Networks, Omega-flip Networks, etc. 14
- Low area requirements
- No support for repetitions
- High latency (2 log n logic levels)
25Permutation Unit
- The Benes Network
- Our design
- Replace each middle Benes stage with equivalent
crossbar. - For e.g., if each of eight 16x16 Benes networks
in a 128-bit network is replaced by a 16x16
crossbar - Mappings within 8-bit words now supported.
- Latency reduced 14 logic levels reduced to 5.
- Modest Area overhead incurred.
26Permutation Unit
- Overall architecture of Permutation Unit
27Requirements of Application Domain
- Memory Requirements
- Symmetric-Key Ciphers
- Table Lookups for implementing S-Boxes. Most
Common Configurations include - 8-to-8 (e.g. Rijndael)
- 4-to-4 (e.g. Serpent)
- 6-to-4 (e.g. DES)
- 8-to-32 (e.g. MARS)
- Asymmetric-Key Ciphers (for chosen method)
- Primarily for storage of input values and
buffering of intermediate results.
28Configurable Memory Block
- Overall architecture of Configurable Memory Block
29Requirements of Application Domain
- Interconnect Requirements
- Symmetric-Key Ciphers
- Implies hierarchical interconnect organization
- Straightforward produce-to-consumer data transfer
b/w functional blocks (i.e. rounds, or
sub-operations within rounds). Requires - Simple interconnect, with
- Coarse-granularity
- Diverse communication requirements within
functional blocks. Requires - High Flexibility,
- Finer Granularity.
- Asymmetric-Key Ciphers (for chosen method)
- Straightforward producer-to-consumer
communication - Coarse-Granularity
- Few global bitwise interconnect wires
30Reconfigurable Interconnect Design
- Overall Interconnect Design
- Two levels of hierarchy
- Top Level
- Simple 4 NN Local Interconnect, augmented with a
- Locality-aware Global Interconnect Binary
Fat-tree - 128-bit granularity
- Group Level
- Crossbar Interconnect,
- Interconnects Logic, Permute, Memory blocks, and
I/O ports. - 32-bit granularity
- Interconnect design depends on interaction b/w
Reconfigurable Elements. - Interactions identified by mapping ciphers to
elements.
31Reconfigurable Interconnect Design
- Reconfigurable Logic Group Architecture (CGA2)
- 8 permute units, 4 memory blocks, 6 logic blocks.
- Designed based on study of cipher mappings to
reconfigurable elements. - Crossbar Area estimate
- (24) x (48) x 32 36,864 lt 49,152 ltlt 76,800
switches.
32Reconfigurable Interconnect Design
- Hierarchical, Locality aware global interconnect
- 4-NN connections,
- Fat-tree.
33System Organization
34Mapping of Algorithms
35Mapping of Algorithms
- Long Integer Modular Multiplication
- Chosen implementation
36Mapping of Algorithms
- Long Integer Modular Multiplication
- Our mapping
37Functional Simulation Results
- AES (Rijndael)
- Implemented a single round of AES,
- Verified using test vectors provided by NIST
- Plaintext 0x000102030405060708090A0B0C0D0E0F
- Key 0x000102030405060708090A0B0C0D0E0F
- Expected Round Output 0xB5C9179EB1CC1199B9C51B92
B5C8159D
38Functional Simulation Results
- Long Integer Modular Multiplication
- Implemented a 255-bit long integer modular
multiplier using our blocks. - Generated own test vectors.
39Synthesis Results
- Synthesized for
- LSI 10k standard cell library (0.6 micron
technology) - Using Synopsys Design Compiler
- Operating Conditions
- WCCOM
- Process variation Index 1.5
- Temperature 70
- Voltage 4.75v
- Interconnect Model worst_case_tree
- BCCOM
- Process variation Index 0.6
- Temperature 0
- Voltage 5.25v
- Interconnect Model best_case_tree
- 05x05 Wire Load Model for Logic, Permute and
Memory Blocks - wire_load("05x05")
- resistance 0
- capacitance 1
- area 0
- slope 0.186
- fanout_length(1,0.39)
- 50x50 Wire Load Model for the Crossbar
Interconnect - wire_load("50x50")
- resistance 0
- capacitance 1
- area 0
- slope 1.218
- fanout_length(1,1.8)
40Synthesis Results
- We use the BCCOM results for our comparison below
41Comparison with Published Results
- Group Area requirements
- Permute Area Logic Area Memory Area
Interconnect Area - (8 21,256) (6 8,100) (4 19,251)
(1.5 3 39,104) - 170,048 48,600 77,004
175,968 - 471,620 transistors.
- Transistor count per Group 0.5 million
- 36.1 for the 8 Permute Units,
- 37.3 for the 24x48 Crossbar Interconnect,
- 16.3 for the 4 Memory Blocks, and
- 10.3 for the 6 Logic Blocks
42Rijndael Results Comparison
43Comparison with Published Results
- AES Encryption
- Throughput Comparison
- Results compete well with FPGA implementations
- Despite being on 0.6 micron standard cell
technology. - Full Custom implementation on recent fabrication
process should provide even better throughput. - Area Comparison
- Direct comparison with FPGAs not possible.
- Poor utilization of reconfigurable resources.
- Area overhead 30 x of ASIC
- Primarily due to very high memory requirements of
Rijndael
44RSA Results Comparison
45Comparison with Published Results
- RSA Encryption
- Throughput Results
- 3 times worse than FPGA
- Full-Custom Implementation may improve situation
somewhat. - Primary reason ? delay incurred by 2 Permute
Units on critical path. - Area Comparison
- Direct comparison with FPGA not possible.
- Good resource utilization
- Possible to implement cheaply on single IC.
- Area overhead 14 x of ASIC
- 10 x acceptable for reconfigurable device
46Discussion
- Need to make improvements
- Area requirements of Rijndael
- Throughput of LIMM.
- Proposed improvements
- Configurable Memory Group
- Replace 1 of every 4 Groups with a Memory group
- Dramatic increase in area efficiency of Rijndael,
but 25 drop in area efficiency of all other
ciphers. - Pre-aligned CSA outputs
- Eliminate need to shift long integers using
permute units - Shift-Before-Registering of Logic block outputs
- Optimize logic block for delay
47Conclusion and Future Work
- Results are promising
- Better than other Domain-specific reconfigurable
architectures. - Still room for improvement.
- Future work
- Functional simulation of other key ciphers
- Implement proposed enhancements
- Evaluate potential for incorporating ECC support
- Development of Control elements and Programming
Model - With support for Run-time Reconfiguration
- Full custom implementation of reconfigurable
elements. - And estimation of area/performance
- Explore potential for supporting other
application domains at minimal additional cost.
48Thank You
49Mapping of Algorithms
- Rijndael Contd,
- Combinatorial S-boxes
- Lots of equations similar to MixColumn
- Also Mapped
50Mapping of Algorithms
- Rijndael Contd,
- Combinatorial S-boxes
- Map and Map-1 operations.
Map Operation Map-1 Operation aH3 (a5 a7)
a7 (aH0 aH1) (aL2 aH3) aH2 (a5 a7)
(a2 a3) a6 (aL1 aH3) (aL2 aL3)
aH0 aH1 (a4 a6) (a1 a7) a5 (aH0 aH1)
aL2 aH0 (a4 a6) (a5) a4 (aL1 aH3)
(aH0 aH1) aL3 aL3 a2 a4) a3 (aH0
aH1) (aL1 aH2) aL2 (a1 a7) a2 (aL1
aH3) (aH0 aH1) aL1 (a1 a2) a1 (aH0
aH1) aH3 aL0 (a4 a6) (a0 a5) a0
(aL0 aH0).
51Mapping of Algorithms
- Rijndael Contd,
- Combinatorial S-boxes
- Affine and Affine-1 operations.
The Affine Transform The Inverse
Affine Transform q7 (a4 a5) (a6 a7) a3
q7 (a1 a4) a6 q6 (a2 a3)
(a4 a5) (a6 1) q6 (a0 a5)
a3 q5 (a2 a3) (a4 a5) (a1 1)
q5 (a2 a7) a4 q4 (a0 a1) (a2 a3)
a4 q4 (a3 a6) a1 q3 (a0
a1) (a2 a3) a7 q3 (a0 a5)
a2 q2 (a0 a1) (a6 a7) a2
q2 (a1 a4) (a7 1) q1 (a0 a1) (a6
a7) (a5 1) q1 (a3 a6) a0 q0
(a4 a5) (a6 a7) (a0 1) q0
(a2 a7) (a5 1).
52Mapping of Algorithms
- Rijndael Contd,
- Combinatorial S-boxes
- GF(24) Multiplication
q t u v w x
y z _ q3 a0.b0 a3.b1 a2.b2
a1.b3 0 0 0 q2 a1.b0
a0.b1 a3.b2 a2.b3 a3.b1 a2.b2 a1.b3 q1
a2.b0 a1.b1 a0.b2 a3.b3 0
a3.b2 a2.b3 q0 a3.b0 a2.b1 a1.b2 a0.b3
0 0 a3.b3.
53Mapping of Algorithms
- IDEA Contd
- Addition Modulo 216
- XOR of 16-bit values
- Multiplication Modulo 216 1 -gt more tricky
- We use the Modified Low-High Algorithm from 3
-
54Mapping of Algorithms
- IDEA Contd,
- Our Mapping
- The unsigned integer Multiplier
55Mapping of Algorithms
- IDEA Contd,
- Our Mapping
- The Remaining Logic (multiplexers, RCAs etc.)
56Mapping of Algorithms
- Serpent Contd
- Mapping fairly simple
57Mapping of Algorithms
58Mapping of Algorithms
59Configurable Memory Block - REMOVE
- Our design
- 2-kilobits of LUT data per Configurable Memory
Block - Most commonly required memory size.
- Composed of 32 4-to-4 LUTs
- Minimum LUT granularity required by any
symmetric-key cipher. - Configurable interconnect at the Address-in and
Data-out ports of each 4-to-4 LUT - 4-bit granularity
- Supports the commonly required LUT
configurations, as well as a 4x128 configuration
for data input. - Possible to add support for even more LUT
configurations, at the cost of area.
60Reconfigurable Interconnect Design
- Initial Architecture
- Focus on Nearest Neighbor Interconnect
- 128-bit in and out ports to at least 4-NN
- Issues
- Lack of Locality Awareness
- Potential for Inflexibility when mapping
complicated structures.
61Reconfigurable Interconnect Design
- Original Group Architecture
- Key Issues
- Very large Interconnect Area Requirements
- Poor Utilization of interconnect
- Crossbar Area estimate
- In-ports x out-ports x granularity
- (2416) x (4416) x 32 76800 switches.
- Also consider
- During algorithm mapping, discovered the need to
separate permute units from logic block outputs. - This incurred yet another crossbar, from logic
units to permute units. - Thus, area issue is critical to architecture
efficiency.
62Reconfigurable Interconnect Design
- Methods of Reducing Area of Intra-Group
Interconnect - Reduce ports in crossbar.
- By reducing number of components in group.
- Reducing the number of NN ports.
- By sharing ports between elements.
- Increase Granularity, e.g. from 32- to 64-bits
(effectively halving number of switches for same
number of ports) - Others...
- All these approaches will affect the
functionality and flexibility of the
architecture. - Must be done cleverly
- The following methods were developed by extensive
mapping and remapping of algorithms to find the
best solutions - Share ports between Permute Units and Memory
Block. - Use memory block as configuration store for
Permute Units. - Restrict number of NN ports in Group x-bar to
256-bits in and 256-bits out, by selecting via
configuration the required NN ports. - Resulted in two two possible organizations
- that reduce area without overly compromising
functionality.
63Reconfigurable Interconnect Design
- Candidate Group Architecture 1
- 4 permute units, 4 memory blocks, 4 logic blocks.
Requires 256-bit bypass paths as well. - Provides less functionality than CGA 2, selection
dependent on area characteristics. - Crossbar Area estimate
- (16) x (1688) x 32 (1688) x (32) x 32
49,152 switches ltlt 76,800 switches.