An Efficient Reconfigurable Architecture for Asymmetric and Symmetrickey Cryptography PowerPoint PPT Presentation

presentation player overlay
1 / 63
About This Presentation
Transcript and Presenter's Notes

Title: An Efficient Reconfigurable Architecture for Asymmetric and Symmetrickey Cryptography


1
An Efficient Reconfigurable Architecture for
Asymmetric and Symmetric-key Cryptography
  • Masters Thesis Defense
  • Ali Mustafa Zaidi

2
Research Motivation
  • Need to minimize security risks in the
    Information Age
  • Future economies critically dependent on digital
    communication.
  • Security Protocols (SSL, IPSec) are Algorithm
    Independent
  • For future-proofing, enhanced security.
  • Ciphers negotiated b/w communicating entities
  • Modern line rates are in the Gbps
  • Ethernet (LAN), VPNs, ATM.
  • System performance not keeping pace with
    communication capacity.

3
Research Motivation
  • Essential security services for Modern Digital
    Communication
  • Confidentiality
  • Data Integrity
  • Authenticity
  • Non-Repudiation
  • Both Symmetric and Asymmetric-key ciphers used to
    provide these services
  • Examples of Need
  • SSL IPSec Servers require
  • High Throughput, High Flexibility, Low Cost
    implementations.
  • Portable PCs, embedded systems etc. require
  • Low Cost, Low Power, Reasonable Flexibility
    implementations.

4
Implementing Ciphers
  • Ciphers may be implemented either in Hardware or
    in Software
  • ASIC implementations
  • Very High Throughput,
  • Low Power Requirements,
  • More secure (if designed properly),
  • Inflexible
  • Software implementations
  • Highly flexible
  • Low Throughput,
  • High Power Requirements
  • Less secure
  • Reconfigurable Hardware Third option

5
What is Reconfigurable Computing?
  • Programmable Hardware
  • Spatial vs. Temporal implementation of
    computation
  • Ideal for exploiting parallelism at all levels
  • Thus a reconfigurable device
  • Functions like hardware implementation, but is
  • Programmable like software
  • Cost of flexibility area inefficiency
  • May require 10 x greater area than dedicated
    ASIC implementations.

6
Why Reconfigurable Hardware?
  • Reconfigurable Hardware can provide
  • Flexibility equivalent to software
    implementations
  • Algorithm Agility
  • Algorithm Modification
  • Algorithm Upload
  • Throughput close to ASIC implementations
  • Exploits parallelism at multiple levels
  • Cost Advantages
  • Better power consumption than software
  • Cheaper to develop than ASICs (at least for low
    volumes)
  • Lower turnaround-time and time-to-market than
    ASICs

7
Why Domain-Specific Reconfigurability?
  • Commercial FPGAs used extensively for
    implementation of Ciphers
  • General-purpose reconfigurable device
  • Provides very good throughput values for most
    ciphers
  • Poor power area characteristics
  • Efficiency/Throughput may be improved by
  • Optimizing resources for a single application
    domain.
  • Greater potential for implementing Run-time
    Reconfiguration
  • Design Challenges
  • Requires careful consideration of entire
    application domain, in order to understand
    trade-offs involved when designing reconfigurable
    resources.

8
Related Work
  • USC Mark II s Advanced Cryptographic Engine for
    IpSec 11 based on General Purpose FPGA
  • Cavium Networks Security Processor 13 ASIC
    with multiple Dedicated cipher cores.

9
Related Work
  • COBRA 8 Reconfigurable architecture for
    Secret-Key ciphers

10
Our Work
  • A specialized reconfigurable architecture for
    both secret-key and public key cryptography.
  • Design Goals
  • Better area efficiency and throughput than
    commercial FPGAs for symmetric and asymmetric-key
    ciphers
  • Throughput performance close to reported results
    for ASICs
  • Architecture Flexibility sufficient for
    implementing most (if not all) current and future
    ciphers.
  • Reduce configuration data requirements to
    facilitate Run-Time Reconfigurability

11
Presentation Overview
  • Our Design Methodology
  • Review of Ciphers Rijndael, Serpent, IDEA, and
    Public-Key
  • Requirements of Application Domain
  • Reconfigurable Resources to meet these
    requirements
  • Mapping of Cryptographic Algorithms
  • Rijndael, and Long Integer Modular Multiplication
  • Results
  • Functional Simulation
  • Standard Cell Synthesis
  • Throughput/Area Comparison with Published Results
  • Conclusion and Future Directions

12
Our Design Methodology
  • Divide-and-conquer
  • Identify independently the
  • Logic,
  • Interconnect, and
  • Memory
  • requirements of different ciphers.
  • Prioritize ciphers for influence on architectural
    decisions
  • Ensure architectural support for all ciphers
    (except ECC), but
  • Focus on optimizing support for most important
    ciphers
  • Rijndael, Serpent (best two AES candidates)
  • DES (due to popularity)
  • IDEA (due to different nature, also popularity)
  • Long-integer Modular Multiplication (required by
    all public-key ciphers except ECC)

13
Overview of Ciphers Rijndael
  • Rijndael
  • AddRoundKey 128-bit XOR
  • SubBytes 16, 8x8 Memory lookups
  • ShiftRows 128-bit permutation of 8-bit
    sub-words.
  • MixColumns ?
  • Most complex operation.

14
Overview of Ciphers Serpent, IDEA
  • Serpent
  • IDEA

15
Public Key Ciphers
  • RSA, Diffie-Hellman, Elgamal
  • Require Long Integer Modular Exponentiation
  • Elliptic Curve Cryptography
  • GF(p) requires Long Integer Modular
    Multiplication
  • GF(2m) requires support for Finite Field
    Arithmetic
  • Not supported in current architecture

16
Overview of Ciphers RSA
  • Square and Multiply Algorithm for Modular
    Exponentiation
  • Radix 2k Long Integer Montgomery Modular
    Multiplication (LIMM)

Ripple-Carry Addition of Long Integers !
17
Overview of Ciphers RSA
  • Fastest Implementation of LIMM 42
  • Throughput 5 Mb/s for both 512-bit and 1024-bit
    RSA

Five-to-two Multiplier Modular Exponentiation (X,
E, M) K 22k mod M computed
externally 1. P10 , P20 5to2_MontMult(K , 0 , 1
, 0 , M), Z10 , Z20 5to2_MontMult(K , 0 , X ,
0 , M) 2. FOR i 0 to n-1 DO 3. Z1i1 , Z2i1
5to2_MontMult(Z1i , Z2i , Z1i , Z2i , M) 4. IF
ei 1 THEN P1i1 , P2i1 5to2_MontMult(P1i
, P2i , Z1i , Z2i , M) ELSE P1i1 , P2i1
P1i , P2i 5. ENDFOR 6. P1n , P2n
5to2_MontMult(1 , 0 , P1n-1 , P2n-1 , M) 7. P
P1n P2n 8. RETURN P
Five-to-two CSA Montgomery Multiplication (A1 ,
A2 , B1 , B2 , M) 1. S10 , S20 0 , 0 2. FOR i
0 to m-1 DO 3. qi (S1i S2i) Ai(B1B2)
mod 2 4. S1i1 , S2i1 CSR (S1i S2i)
Ai(B1B2) qiM div 2 5. ENDFOR
Carry Save Addition of Long Integers
18
Overview of Ciphers RSA
  • Five-to-two CSA Montgomery Multiplication

19
Requirements of Application Domain
  • After an extensive survey of the logic,
    interconnect, and memory requirements of both
    symmetric-key and asymmetric-key ciphers, the
    following requirements were identified

20
Requirements of Application Domain
  • Logic Requirements
  • Symmetric-Key Ciphers
  • Bitwise XOR, AND, OR
  • Addition and Subtraction Modulo 2n
  • Modular Multiplication (mod 232, 264, 216 1,
    232 - 1)
  • Galois field constant multiplication.
  • Fixed Permutations on 32-bit, 64-bit and 128-bit
    words
  • Both with and without mappings, constants
  • Includes fixed shifts and rotates.
  • Variable rotations.
  • Asymmetric-Key Ciphers (for chosen implementation
    of LIMM)
  • Long word Carry-Save Addition
  • Random-Logic functionality for generating a few
    1-bit global signals
  • Single-bit Shifting of long words.

21
Configurable Logic Block
  • Configurable Logic Block ? Reconfigurable Carry
    Save Adder cell
  • Composed of 64 Reconfigurable CSA Cells
  • Outputs optionally registered.
  • Inputs may be AND/XOR masked.

22
Configurable Logic Block
  • Each CLB has an associated Random logic block
  • Takes bit-wise signals from the outputs of the
    Logic Block
  • Used to implement Bit signals for
  • Long integer Modular Multiplication
  • IDEA multiplication (generation of partial
    products)

23
Configurable Logic Block
  • Overall architecture of Configurable Logic Block

24
Permutation Unit
  • Most demanding functionality requirements
  • Bitwise permutation of up to 128-bits (Serpent,
    DES)
  • Permutations with repetitions within 8-bit words
    (Rijndael)
  • 1-bit shifts of long integer values (LIMM)
  • Potential approaches for implementing a Permute
    Unit
  • Crossbar
  • Low latency
  • Supports permutations with repetitions
  • High area requirements (n2 switches required)
  • Benes Networks, Omega-flip Networks, etc. 14
  • Low area requirements
  • No support for repetitions
  • High latency (2 log n logic levels)

25
Permutation Unit
  • The Benes Network
  • Our design
  • Replace each middle Benes stage with equivalent
    crossbar.
  • For e.g., if each of eight 16x16 Benes networks
    in a 128-bit network is replaced by a 16x16
    crossbar
  • Mappings within 8-bit words now supported.
  • Latency reduced 14 logic levels reduced to 5.
  • Modest Area overhead incurred.

26
Permutation Unit
  • Overall architecture of Permutation Unit

27
Requirements of Application Domain
  • Memory Requirements
  • Symmetric-Key Ciphers
  • Table Lookups for implementing S-Boxes. Most
    Common Configurations include
  • 8-to-8 (e.g. Rijndael)
  • 4-to-4 (e.g. Serpent)
  • 6-to-4 (e.g. DES)
  • 8-to-32 (e.g. MARS)
  • Asymmetric-Key Ciphers (for chosen method)
  • Primarily for storage of input values and
    buffering of intermediate results.

28
Configurable Memory Block
  • Overall architecture of Configurable Memory Block

29
Requirements of Application Domain
  • Interconnect Requirements
  • Symmetric-Key Ciphers
  • Implies hierarchical interconnect organization
  • Straightforward produce-to-consumer data transfer
    b/w functional blocks (i.e. rounds, or
    sub-operations within rounds). Requires
  • Simple interconnect, with
  • Coarse-granularity
  • Diverse communication requirements within
    functional blocks. Requires
  • High Flexibility,
  • Finer Granularity.
  • Asymmetric-Key Ciphers (for chosen method)
  • Straightforward producer-to-consumer
    communication
  • Coarse-Granularity
  • Few global bitwise interconnect wires

30
Reconfigurable Interconnect Design
  • Overall Interconnect Design
  • Two levels of hierarchy
  • Top Level
  • Simple 4 NN Local Interconnect, augmented with a
  • Locality-aware Global Interconnect Binary
    Fat-tree
  • 128-bit granularity
  • Group Level
  • Crossbar Interconnect,
  • Interconnects Logic, Permute, Memory blocks, and
    I/O ports.
  • 32-bit granularity
  • Interconnect design depends on interaction b/w
    Reconfigurable Elements.
  • Interactions identified by mapping ciphers to
    elements.

31
Reconfigurable Interconnect Design
  • Reconfigurable Logic Group Architecture (CGA2)
  • 8 permute units, 4 memory blocks, 6 logic blocks.
  • Designed based on study of cipher mappings to
    reconfigurable elements.
  • Crossbar Area estimate
  • (24) x (48) x 32 36,864 lt 49,152 ltlt 76,800
    switches.

32
Reconfigurable Interconnect Design
  • Hierarchical, Locality aware global interconnect
  • 4-NN connections,
  • Fat-tree.

33
System Organization
34
Mapping of Algorithms
  • Rijndael,
  • Our Mapping

35
Mapping of Algorithms
  • Long Integer Modular Multiplication
  • Chosen implementation

36
Mapping of Algorithms
  • Long Integer Modular Multiplication
  • Our mapping

37
Functional Simulation Results
  • AES (Rijndael)
  • Implemented a single round of AES,
  • Verified using test vectors provided by NIST
  • Plaintext 0x000102030405060708090A0B0C0D0E0F
  • Key 0x000102030405060708090A0B0C0D0E0F
  • Expected Round Output 0xB5C9179EB1CC1199B9C51B92
    B5C8159D

38
Functional Simulation Results
  • Long Integer Modular Multiplication
  • Implemented a 255-bit long integer modular
    multiplier using our blocks.
  • Generated own test vectors.

39
Synthesis Results
  • Synthesized for
  • LSI 10k standard cell library (0.6 micron
    technology)
  • Using Synopsys Design Compiler
  • Operating Conditions
  • WCCOM
  • Process variation Index 1.5
  • Temperature 70
  • Voltage 4.75v
  • Interconnect Model worst_case_tree
  • BCCOM
  • Process variation Index 0.6
  • Temperature 0
  • Voltage 5.25v
  • Interconnect Model best_case_tree
  • 05x05 Wire Load Model for Logic, Permute and
    Memory Blocks
  • wire_load("05x05")
  • resistance 0
  • capacitance 1
  • area 0
  • slope 0.186
  • fanout_length(1,0.39)
  • 50x50 Wire Load Model for the Crossbar
    Interconnect
  • wire_load("50x50")
  • resistance 0
  • capacitance 1
  • area 0
  • slope 1.218
  • fanout_length(1,1.8)

40
Synthesis Results
  • We use the BCCOM results for our comparison below

41
Comparison with Published Results
  • Group Area requirements
  • Permute Area Logic Area Memory Area
    Interconnect Area
  • (8 21,256) (6 8,100) (4 19,251)
    (1.5 3 39,104)
  • 170,048 48,600 77,004
    175,968
  • 471,620 transistors.
  • Transistor count per Group 0.5 million
  • 36.1 for the 8 Permute Units,
  • 37.3 for the 24x48 Crossbar Interconnect,
  • 16.3 for the 4 Memory Blocks, and
  • 10.3 for the 6 Logic Blocks

42
Rijndael Results Comparison
43
Comparison with Published Results
  • AES Encryption
  • Throughput Comparison
  • Results compete well with FPGA implementations
  • Despite being on 0.6 micron standard cell
    technology.
  • Full Custom implementation on recent fabrication
    process should provide even better throughput.
  • Area Comparison
  • Direct comparison with FPGAs not possible.
  • Poor utilization of reconfigurable resources.
  • Area overhead 30 x of ASIC
  • Primarily due to very high memory requirements of
    Rijndael

44
RSA Results Comparison
45
Comparison with Published Results
  • RSA Encryption
  • Throughput Results
  • 3 times worse than FPGA
  • Full-Custom Implementation may improve situation
    somewhat.
  • Primary reason ? delay incurred by 2 Permute
    Units on critical path.
  • Area Comparison
  • Direct comparison with FPGA not possible.
  • Good resource utilization
  • Possible to implement cheaply on single IC.
  • Area overhead 14 x of ASIC
  • 10 x acceptable for reconfigurable device

46
Discussion
  • Need to make improvements
  • Area requirements of Rijndael
  • Throughput of LIMM.
  • Proposed improvements
  • Configurable Memory Group
  • Replace 1 of every 4 Groups with a Memory group
  • Dramatic increase in area efficiency of Rijndael,
    but 25 drop in area efficiency of all other
    ciphers.
  • Pre-aligned CSA outputs
  • Eliminate need to shift long integers using
    permute units
  • Shift-Before-Registering of Logic block outputs
  • Optimize logic block for delay

47
Conclusion and Future Work
  • Results are promising
  • Better than other Domain-specific reconfigurable
    architectures.
  • Still room for improvement.
  • Future work
  • Functional simulation of other key ciphers
  • Implement proposed enhancements
  • Evaluate potential for incorporating ECC support
  • Development of Control elements and Programming
    Model
  • With support for Run-time Reconfiguration
  • Full custom implementation of reconfigurable
    elements.
  • And estimation of area/performance
  • Explore potential for supporting other
    application domains at minimal additional cost.

48
Thank You
49
Mapping of Algorithms
  • Rijndael Contd,
  • Combinatorial S-boxes
  • Lots of equations similar to MixColumn
  • Also Mapped

50
Mapping of Algorithms
  • Rijndael Contd,
  • Combinatorial S-boxes
  • Map and Map-1 operations.

Map Operation Map-1 Operation aH3 (a5 a7)
a7 (aH0 aH1) (aL2 aH3) aH2 (a5 a7)
(a2 a3) a6 (aL1 aH3) (aL2 aL3)
aH0 aH1 (a4 a6) (a1 a7) a5 (aH0 aH1)
aL2 aH0 (a4 a6) (a5) a4 (aL1 aH3)
(aH0 aH1) aL3 aL3 a2 a4) a3 (aH0
aH1) (aL1 aH2) aL2 (a1 a7) a2 (aL1
aH3) (aH0 aH1) aL1 (a1 a2) a1 (aH0
aH1) aH3 aL0 (a4 a6) (a0 a5) a0
(aL0 aH0).
51
Mapping of Algorithms
  • Rijndael Contd,
  • Combinatorial S-boxes
  • Affine and Affine-1 operations.

The Affine Transform The Inverse
Affine Transform q7 (a4 a5) (a6 a7) a3
q7 (a1 a4) a6 q6 (a2 a3)
(a4 a5) (a6 1) q6 (a0 a5)
a3 q5 (a2 a3) (a4 a5) (a1 1)
q5 (a2 a7) a4 q4 (a0 a1) (a2 a3)
a4 q4 (a3 a6) a1 q3 (a0
a1) (a2 a3) a7 q3 (a0 a5)
a2 q2 (a0 a1) (a6 a7) a2
q2 (a1 a4) (a7 1) q1 (a0 a1) (a6
a7) (a5 1) q1 (a3 a6) a0 q0
(a4 a5) (a6 a7) (a0 1) q0
(a2 a7) (a5 1).
52
Mapping of Algorithms
  • Rijndael Contd,
  • Combinatorial S-boxes
  • GF(24) Multiplication

q t u v w x
y z _ q3 a0.b0 a3.b1 a2.b2
a1.b3 0 0 0 q2 a1.b0
a0.b1 a3.b2 a2.b3 a3.b1 a2.b2 a1.b3 q1
a2.b0 a1.b1 a0.b2 a3.b3 0
a3.b2 a2.b3 q0 a3.b0 a2.b1 a1.b2 a0.b3
0 0 a3.b3.
53
Mapping of Algorithms
  • IDEA Contd
  • Addition Modulo 216
  • XOR of 16-bit values
  • Multiplication Modulo 216 1 -gt more tricky
  • We use the Modified Low-High Algorithm from 3

54
Mapping of Algorithms
  • IDEA Contd,
  • Our Mapping
  • The unsigned integer Multiplier

55
Mapping of Algorithms
  • IDEA Contd,
  • Our Mapping
  • The Remaining Logic (multiplexers, RCAs etc.)

56
Mapping of Algorithms
  • Serpent Contd
  • Mapping fairly simple

57
Mapping of Algorithms
  • DES

58
Mapping of Algorithms
  • DES Contd
  • Our mapping

59
Configurable Memory Block - REMOVE
  • Our design
  • 2-kilobits of LUT data per Configurable Memory
    Block
  • Most commonly required memory size.
  • Composed of 32 4-to-4 LUTs
  • Minimum LUT granularity required by any
    symmetric-key cipher.
  • Configurable interconnect at the Address-in and
    Data-out ports of each 4-to-4 LUT
  • 4-bit granularity
  • Supports the commonly required LUT
    configurations, as well as a 4x128 configuration
    for data input.
  • Possible to add support for even more LUT
    configurations, at the cost of area.

60
Reconfigurable Interconnect Design
  • Initial Architecture
  • Focus on Nearest Neighbor Interconnect
  • 128-bit in and out ports to at least 4-NN
  • Issues
  • Lack of Locality Awareness
  • Potential for Inflexibility when mapping
    complicated structures.

61
Reconfigurable Interconnect Design
  • Original Group Architecture
  • Key Issues
  • Very large Interconnect Area Requirements
  • Poor Utilization of interconnect
  • Crossbar Area estimate
  • In-ports x out-ports x granularity
  • (2416) x (4416) x 32 76800 switches.
  • Also consider
  • During algorithm mapping, discovered the need to
    separate permute units from logic block outputs.
  • This incurred yet another crossbar, from logic
    units to permute units.
  • Thus, area issue is critical to architecture
    efficiency.

62
Reconfigurable Interconnect Design
  • Methods of Reducing Area of Intra-Group
    Interconnect
  • Reduce ports in crossbar.
  • By reducing number of components in group.
  • Reducing the number of NN ports.
  • By sharing ports between elements.
  • Increase Granularity, e.g. from 32- to 64-bits
    (effectively halving number of switches for same
    number of ports)
  • Others...
  • All these approaches will affect the
    functionality and flexibility of the
    architecture.
  • Must be done cleverly
  • The following methods were developed by extensive
    mapping and remapping of algorithms to find the
    best solutions
  • Share ports between Permute Units and Memory
    Block.
  • Use memory block as configuration store for
    Permute Units.
  • Restrict number of NN ports in Group x-bar to
    256-bits in and 256-bits out, by selecting via
    configuration the required NN ports.
  • Resulted in two two possible organizations
  • that reduce area without overly compromising
    functionality.

63
Reconfigurable Interconnect Design
  • Candidate Group Architecture 1
  • 4 permute units, 4 memory blocks, 4 logic blocks.
    Requires 256-bit bypass paths as well.
  • Provides less functionality than CGA 2, selection
    dependent on area characteristics.
  • Crossbar Area estimate
  • (16) x (1688) x 32 (1688) x (32) x 32
    49,152 switches ltlt 76,800 switches.
Write a Comment
User Comments (0)
About PowerShow.com