Cryptographic Algorithms and their Implementations - PowerPoint PPT Presentation

About This Presentation
Title:

Cryptographic Algorithms and their Implementations

Description:

Cryptographic Algorithms and their Implementations Discussion of how to map different algorithms to our architecture Public-Key Algorithms (Modular Exponentiation) – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 38
Provided by: AliMusta2
Category:

less

Transcript and Presenter's Notes

Title: Cryptographic Algorithms and their Implementations


1
Cryptographic Algorithms and their Implementations
  • Discussion of how to map different algorithms to
    our architecture
  • Public-Key Algorithms (Modular Exponentiation)
  • Rijndael
  • Serpent
  • Others (Mars, RC6, Twofish, etc.)

2
Modular Exponentiation
  • Square and Multiply Algorithm for Modular
    Exponentiation

3
Modular Exponentiation
  • Montgomery Modular Multiplication

4
Modular Exponentiation
  • Several Approaches to implementing Modular
    Multiplication
  • Redundant Representation based (e.g. Carry-save)
  • Residue Number System based.
  • Systolic Array Based.
  • Word-based implementations preferable, due to
    similarity with Symmetric-key
  • Rules out systolic arrays

5
Modular Exponentiation
  • Most popular and fastest were Carry-Save
    representation based implementations.
  • Carry-save based were also word-oriented.
  • We selected fastest, simplest implementation
  • Extremely beneficial to have simplicity and
    homogeneity in algorithms when designing a custom
    reconfigurable fabric.
  • Performance when implemented on Xilinx Virtex
    FPGAs almost 5 Mb/s !!! (highest reported that
    we could find)

6
Modular Exponentiation
  • Five-to-two Multiplier Modular Exponentiation (P,
    E, M)
  • K 22k mod M computed externally
  • 1. P10 , P20 5to2_MontMult(K , 0 , 1 , 0 , M),
  • Z10 , Z20 5to2_MontMult(K , 0 , P , 0 , M)
  • 2. FOR i 0 to n-1 DO
  • 3. Z1i1 , Z2i1 5to2_MontMult(Z1i , Z2i , Z1i
    , Z2i , M)
  • 4. IF ei 1 THEN
  • P1i1 , P2i1 5to2_MontMult(P1i , P2i , Z1i ,
    Z2i , M)
  • ELSE
  • P1i1 , P2i1 P1i , P2i
  • 5. ENDFOR
  • 6. P1n , P2n 5to2_MontMult(1 , 0 , P1n-1 ,
    P2n-1 , M)
  • 7. P P1n P2n
  • 8. RETURN P

7
Modular Exponentiation
  • Five-to-two CSA Montgomery Multiplication (A1 ,
    A2 , B1 , B2 , M)
  • 1. S10 , S20 0 , 0
  • 2. FOR i 0 to m-1 DO
  • 3. qi (S1i S2i) Ai(B1B2) mod 2
  • 4. S1i1 , S2i1 CSR (S1i S2i) Ai(B1B2)
    qiM div 2
  • 5. ENDFOR

8
Modular Exponentiation
  • Their Implementation of MM

9
Modular Exponentiation
  • Implementing MM on our design

10
Modular Exponentiation
  • Each of the 64-CSA blocks maps to a single basic
    block
  • Outputs of the last basic block are registered.
  • qi is generated by random-logic block at the
    second basic-block
  • Broadcast to all groups
  • Ai is generated in a similar manner, utilizing
    two more basic-blocks
  • Also broadcast to all groups

11
Modular Exponentiation
  • Efficient and scalable mapping to our design
  • 1024-bit RSA will need to use 16 groups, while
  • 2048-bit will use 32, and 4096-bits will use 64
    groups
  • Primary concern clock rate may be limited by
    bit-broadcasts of qi and Ai
  • Potential impediment to scalability
  • We are exploring methods for pipelining these
    broadcasts as well, to increase cycle-time and
    scalability.

12
Rijndael
  • Primary operations
  • Sub-Bytes
  • Shift-Rows
  • Mix-Columns
  • Add-Round-Key

13
Rijndael
  • Representation of Data 128-bit state.

14
Rijndael
  • Add-Round-Key
  • Simple 128-bit XOR operation uses 1 basic-block
  • Sub-Bytes
  • Simple operation byte-wise table lookup from
    S-Box
  • Each S-box is 2kbits.
  • 16 parallel S-boxes required !
  • No basic-blocks required, ALL memory-blocks
    required !
  • Shift-Rows
  • Simple operation 4 x 32-bit permutations
  • Uses only 1 basic-block

15
Rijndael
  • Mix-Columns
  • Somewhat complicated can be implemented using
    table lookups, but were out of Memory !
  • Alternative implementation

16
Rijndael
  • Mix-Columns
  • Operation may be expressed in terms of xtime()
    function
  • Mix-columns implementation requires xtime()
    operation on each byte, followed by 4 XOR
    operations

17
Rijndael
  • Mix-Columns
  • In order to efficiently implement xtime(), we
    modified it this way
  • In this form, only 2 basic-blocks are needed to
    apply xtime() to all 16 bytes
  • A single basic-block will take the 128-bit data
    as input, and generate the xtime() mask
    (0000x7x70x7) for each of the 16 bytes at the
    permute unit.
  • Another basic-block will now first perform the
    XOR operation, followed by a left shift (and
    substitute LSB with x7) at the permute unit.

18
Rijndael
  • Mix-Columns
  • After generating output from the xtime()
    function, 4 x 128-bit XOR operations need to be
    performed
  • 4 basic-blocks will be used
  • Note that the mix-column operation is carried out
    in parallel on all 4 columns.

19
Rijndael
  • Implementation summary
  • 8 basic-blocks required only
  • 2 (1 each) for Add-Round-Key and Shift-Rows
  • 6 for Mix-Columns (2 for xtime(), 4 for XOR
    operations)
  • 16 Memory-blocks required !!
  • All memory blocks used up in a single round!
  • In-efficient implementation due to memory
    intensive implementation of Rijndael
  • Only 10 logic used, versus 100 memory usage.

20
Rijndael
  • Potential Solutions
  • Add lots of memory !!
  • At least 10 times more
  • Issues with memory placement
  • Consider memory-less implementations of Sub-Byte
  • Requires GF() constant multiplication and Inverse
    Affine Transforms
  • Currently under study as the more efficient and
    practical option.

21
Serpent
  • Substitution-permutation cipher comprised of
  • Key Mixing,
  • S-Box Substitution, and
  • Linear Transformation.
  • S-boxes 4 x 4 bit
  • 32 copies required each round
  • 16 x 4 x 32 2048 bits per round.

22
Serpent
  • The Linear Transformation step consists of
  • 8 fixed permute operations, and
  • 8 XOR operations
  • All operands are 32-bits wide

23
Serpent
  • Serpent is an ideal match for our architecture
  • 8 x 32-bit fixed shifts and rotates can be easily
    implemented by the permute units of 2
    basic-blocks.
  • Additional 2 basic-blocks required to implement
    the 8 x 32-bit XOR operations.
  • 128-bit key mixing stage per round would require
    1 more basic-block
  • Total of 5 basic-blocks and 2kbits of memory
    required per round.
  • Each round perfectly fits in a single group of
    our architecture!
  • 16 rounds of Serpents total of 32 may be
    unrolled in our architecture

24
Other Algorithms
  • DES
  • Implementation of a single round is trivial a
    single group may implement multiple rounds !
  • Twofish
  • Complex structure, requires more time to define
    implementation on our architecture.
  • However, all its basic operations are directly
    supported.
  • RC6 and MARS
  • Involve complicated operations requiring special
    purpose logic
  • Data-dependent rotations
  • Multiplication Modulo 232

25
Other Algorithms
  • RC6 and MARS
  • This special-purpose logic was not incorporated
    because
  • Algorithms are more suitable for software
    implementations than in hardware
  • Lack of support and popularity of these
    algorithms
  • Addition of special-purpose logic would occur
    overhead beyond its area, as additional
    supporting interconnect must be provided.

26
Comparison with Related Work
  • Although we cannot provide results based on
    empirical evaluation, we can present a logical
    framework for comparison of individual features
  • Through deductive reasoning, we identify what
    possible advantages one approach may have over
    the other, assuming all other factors normalized.

27
Comparison with Related Work
  • Comparison with FPGA based implementations
  • Area Efficiency
  • Use of basic gates instead of LUTs
  • Basic-blocks with limited flexibility, thus fewer
    configuration bits
  • Basic units (full adders) combined into clusters
    of 64, and programmed as a single entity
    further savings in configuration memory elements
  • Performance
  • Use of basic gates instead of LUTs
  • Simpler Interconnect, with fewer routing-switches
  • Hierarchical organization no long wires (except
    for bit-broadcast)
  • Far smaller configuration data required faster
    reconfiguration time

28
Comparison with Related Work
  • Comparison with FPGA based implementations
  • Potential pitfalls
  • Design dedicates considerable amount of area to
    inter-block interconnect.
  • Until actual area can be quantified, we are
    unsure of area efficiency estimates.
  • Need to identify most suitable Performance/Area
    tradeoff.

29
Comparison with Related Work
  • Comparison with COBRA Architecture
  • Uses multiple copies of special purpose logic
    blocks, couples with extremely simple
    interconnect.

30
Comparison with Related Work
  • Comparison with COBRA Architecture
  • Low logic-utilization we have more generic
    blocks,
  • Fixed latency operations
  • Intermediate values registered only at RCE
    boundary.

31
Programming Methodology
  • Reconfigurable Computing devices suffer from
    following two critical issues
  • Lack of a comprehensive programming model
  • Lack of hardware virtualization
  • First issue implies the difficulty of programming
    RC architectures such as FPGAs
  • Second issue deals with exposition of hardware
    resource limitations to programmer.

32
Programming Methodology
  • How COBRA deals with these issues
  • Essentially a special-purpose programmable
    architecture than a configurable one
  • VLIW like instructions alleviate some of the
    programming model related issues
  • Also resolve the virtualization aspect.

33
Programming Methodology
  • The programming methodology and the impact of the
    issues mentioned can be seen in terms of a
    spectrum

34
Programming Methodology
  • Programming model issue less severe for us
    because
  • Simple, highly specialized architecture
  • Hardware Virtualization is still a concern.

35
Programming Methodology
  • Programming model
  • Provide basic primitives that are supported by
    our architecture.
  • Programming is to be accomplished by expressing
    an algorithm using these primitives and
    interconnecting these primitives together using
    32-bit interconnect.
  • Mapping such a description onto our design should
    be a trivial software challenge.
  • Due to special purpose nature, primitives are
    limited in number and thus programming should be
    an easy task.

36
Programming Methodology
Programming Primitives
  • 32-bit Carry Save Adder
  • 32-bit XOR
  • 32-bit AND
  • 32-bit OR
  • 32, 64, and 128-bit Ripple Carry Adder
  • 32, 64, and 128-bit Fixed Shifts
  • 32 bit Rotates and random permutes.
  • 64-bit, 128-bit limited permutes (TBD).
  • ANDing 32-bit value with a single bit
  • 128-bit shift-register
  • Random bit-logic implementation, since each block
    is also capable of implementing
  • single 4-input function
  • two 3-input functions
  • four 2-input functions
  • 4 global bit-broadcast lines
  • 32-bit interconnect, point to point.

37
Conclusion Work in Progress
  • Following areas of design still under
    consideration and not completely defined yet
  • Configurable Memory-block Architecture
  • VLSI Design to evaluate performance metrics and
    fine-tuning of logical design
  • i.e. if found to be too slow, reduce no of
    switches, use longer wires, minimize the amount
    of interconnect to that which is necessary, etc.
  • Furthermore, the iterative process of evaluating
    more symmetric-key algorithms and refining the
    architecture is still in progress.
Write a Comment
User Comments (0)
About PowerShow.com