Cryptographic Algorithms and their Implementations - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Cryptographic Algorithms and their Implementations

Description:

Comparison with COBRA Architecture ... Comparison with COBRA Architecture. Low logic-utilization we have more generic blocks, ... How COBRA deals with these issues ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 38

Provided by: alimusta

Category:

more less

Transcript and Presenter's Notes

Title: Cryptographic Algorithms and their Implementations

1
Cryptographic Algorithms and their Implementations

Discussion of how to map different algorithms to
our architecture
Public-Key Algorithms (Modular Exponentiation)
Rijndael
Serpent
Others (Mars, RC6, Twofish, etc.)

2
Modular Exponentiation

Square and Multiply Algorithm for Modular
Exponentiation

3
Modular Exponentiation

Montgomery Modular Multiplication

4
Modular Exponentiation

Several Approaches to implementing Modular
Multiplication
Redundant Representation based (e.g. Carry-save)
Residue Number System based.
Systolic Array Based.
Word-based implementations preferable, due to
similarity with Symmetric-key
Rules out systolic arrays

5
Modular Exponentiation

Most popular and fastest were Carry-Save
representation based implementations.
Carry-save based were also word-oriented.
We selected fastest, simplest implementation
Extremely beneficial to have simplicity and
homogeneity in algorithms when designing a custom
reconfigurable fabric.
Performance when implemented on Xilinx Virtex
FPGAs almost 5 Mb/s !!! (highest reported that
we could find)

6
Modular Exponentiation

Five-to-two Multiplier Modular Exponentiation (P,
E, M)
K 22k mod M computed externally
1. P10 , P20 5to2_MontMult(K , 0 , 1 , 0 , M),
Z10 , Z20 5to2_MontMult(K , 0 , P , 0 , M)
2. FOR i 0 to n-1 DO
3. Z1i1 , Z2i1 5to2_MontMult(Z1i , Z2i , Z1i
, Z2i , M)
4. IF ei 1 THEN
P1i1 , P2i1 5to2_MontMult(P1i , P2i , Z1i ,
Z2i , M)
ELSE
P1i1 , P2i1 P1i , P2i
5. ENDFOR
6. P1n , P2n 5to2_MontMult(1 , 0 , P1n-1 ,
P2n-1 , M)
7. P P1n P2n
8. RETURN P

7
Modular Exponentiation

Five-to-two CSA Montgomery Multiplication (A1 ,
A2 , B1 , B2 , M)
1. S10 , S20 0 , 0
2. FOR i 0 to m-1 DO
3. qi (S1i S2i) Ai(B1B2) mod 2
4. S1i1 , S2i1 CSR (S1i S2i) Ai(B1B2)
qiM div 2
5. ENDFOR

8
Modular Exponentiation

Their Implementation of MM

9
Modular Exponentiation

Implementing MM on our design

10
Modular Exponentiation

Each of the 64-CSA blocks maps to a single basic
block
Outputs of the last basic block are registered.
qi is generated by random-logic block at the
second basic-block
Broadcast to all groups
Ai is generated in a similar manner, utilizing
two more basic-blocks
Also broadcast to all groups

11
Modular Exponentiation

Efficient and scalable mapping to our design
1024-bit RSA will need to use 16 groups, while
2048-bit will use 32, and 4096-bits will use 64
groups
Primary concern clock rate may be limited by
bit-broadcasts of qi and Ai
Potential impediment to scalability
We are exploring methods for pipelining these
broadcasts as well, to increase cycle-time and
scalability.

12
Rijndael

Primary operations
Sub-Bytes
Shift-Rows
Mix-Columns
Add-Round-Key

13
Rijndael

Representation of Data 128-bit state.

14
Rijndael

Add-Round-Key
Simple 128-bit XOR operation uses 1 basic-block
Sub-Bytes
Simple operation byte-wise table lookup from
S-Box
Each S-box is 2kbits.
16 parallel S-boxes required !
No basic-blocks required, ALL memory-blocks
required !
Shift-Rows
Simple operation 4 x 32-bit permutations
Uses only 1 basic-block

15
Rijndael

Mix-Columns
Somewhat complicated can be implemented using
table lookups, but were out of Memory !
Alternative implementation

16
Rijndael

Mix-Columns
Operation may be expressed in terms of xtime()
function
Mix-columns implementation requires xtime()
operation on each byte, followed by 4 XOR
operations

17
Rijndael

Mix-Columns
In order to efficiently implement xtime(), we
modified it this way
In this form, only 2 basic-blocks are needed to
apply xtime() to all 16 bytes
A single basic-block will take the 128-bit data
as input, and generate the xtime() mask
(0000x7x70x7) for each of the 16 bytes at the
permute unit.
Another basic-block will now first perform the
XOR operation, followed by a left shift (and
substitute LSB with x7) at the permute unit.

18
Rijndael

Mix-Columns
After generating output from the xtime()
function, 4 x 128-bit XOR operations need to be
performed
4 basic-blocks will be used
Note that the mix-column operation is carried out
in parallel on all 4 columns.

19
Rijndael

Implementation summary
8 basic-blocks required only
2 (1 each) for Add-Round-Key and Shift-Rows
6 for Mix-Columns (2 for xtime(), 4 for XOR
operations)
16 Memory-blocks required !!
All memory blocks used up in a single round!
In-efficient implementation due to memory
intensive implementation of Rijndael
Only 10 logic used, versus 100 memory usage.

20
Rijndael

Potential Solutions
Add lots of memory !!
At least 10 times more
Issues with memory placement
Consider memory-less implementations of Sub-Byte
Requires GF() constant multiplication and Inverse
Affine Transforms
Currently under study as the more efficient and
practical option.

21
Serpent

Substitution-permutation cipher comprised of
Key Mixing,
S-Box Substitution, and
Linear Transformation.
S-boxes 4 x 4 bit
32 copies required each round
16 x 4 x 32 2048 bits per round.

22
Serpent

The Linear Transformation step consists of
8 fixed permute operations, and
8 XOR operations
All operands are 32-bits wide

23
Serpent

Serpent is an ideal match for our architecture
8 x 32-bit fixed shifts and rotates can be easily
implemented by the permute units of 2
basic-blocks.
Additional 2 basic-blocks required to implement
the 8 x 32-bit XOR operations.
128-bit key mixing stage per round would require
1 more basic-block
Total of 5 basic-blocks and 2kbits of memory
required per round.
Each round perfectly fits in a single group of
our architecture!
16 rounds of Serpents total of 32 may be
unrolled in our architecture

24
Other Algorithms

DES
Implementation of a single round is trivial a
single group may implement multiple rounds !
Twofish
Complex structure, requires more time to define
implementation on our architecture.
However, all its basic operations are directly
supported.
RC6 and MARS
Involve complicated operations requiring special
purpose logic
Data-dependent rotations
Multiplication Modulo 232

25
Other Algorithms

RC6 and MARS
This special-purpose logic was not incorporated
because
Algorithms are more suitable for software
implementations than in hardware
Lack of support and popularity of these
algorithms
Addition of special-purpose logic would occur
overhead beyond its area, as additional
supporting interconnect must be provided.

26
Comparison with Related Work

Although we cannot provide results based on
empirical evaluation, we can present a logical
framework for comparison of individual features
Through deductive reasoning, we identify what
possible advantages one approach may have over
the other, assuming all other factors normalized.

27
Comparison with Related Work

Comparison with FPGA based implementations
Area Efficiency
Use of basic gates instead of LUTs
Basic-blocks with limited flexibility, thus fewer
configuration bits
Basic units (full adders) combined into clusters
of 64, and programmed as a single entity
further savings in configuration memory elements
Performance
Use of basic gates instead of LUTs
Simpler Interconnect, with fewer routing-switches
Hierarchical organization no long wires (except
for bit-broadcast)
Far smaller configuration data required faster
reconfiguration time

28
Comparison with Related Work

Comparison with FPGA based implementations
Potential pitfalls
Design dedicates considerable amount of area to
inter-block interconnect.
Until actual area can be quantified, we are
unsure of area efficiency estimates.
Need to identify most suitable Performance/Area
tradeoff.

29
Comparison with Related Work

Comparison with COBRA Architecture
Uses multiple copies of special purpose logic
blocks, couples with extremely simple
interconnect.

30
Comparison with Related Work

Comparison with COBRA Architecture
Low logic-utilization we have more generic
blocks,
Fixed latency operations
Intermediate values registered only at RCE
boundary.

31
Programming Methodology

Reconfigurable Computing devices suffer from
following two critical issues
Lack of a comprehensive programming model
Lack of hardware virtualization
First issue implies the difficulty of programming
RC architectures such as FPGAs
Second issue deals with exposition of hardware
resource limitations to programmer.

32
Programming Methodology

How COBRA deals with these issues
Essentially a special-purpose programmable
architecture than a configurable one
VLIW like instructions alleviate some of the
programming model related issues
Also resolve the virtualization aspect.

33
Programming Methodology

The programming methodology and the impact of the
issues mentioned can be seen in terms of a
spectrum

34
Programming Methodology

Programming model issue less severe for us
because
Simple, highly specialized architecture
Hardware Virtualization is still a concern.

35
Programming Methodology

Programming model
Provide basic primitives that are supported by
our architecture.
Programming is to be accomplished by expressing
an algorithm using these primitives and
interconnecting these primitives together using
32-bit interconnect.
Mapping such a description onto our design should
be a trivial software challenge.
Due to special purpose nature, primitives are
limited in number and thus programming should be
an easy task.

36
Programming Methodology
Programming Primitives

32-bit Carry Save Adder
32-bit XOR
32-bit AND
32-bit OR
32, 64, and 128-bit Ripple Carry Adder
32, 64, and 128-bit Fixed Shifts
32 bit Rotates and random permutes.
64-bit, 128-bit limited permutes (TBD).

ANDing 32-bit value with a single bit
128-bit shift-register
Random bit-logic implementation, since each block
is also capable of implementing
single 4-input function
two 3-input functions
four 2-input functions
4 global bit-broadcast lines
32-bit interconnect, point to point.

37
Conclusion Work in Progress

Following areas of design still under
consideration and not completely defined yet
Configurable Memory-block Architecture
VLSI Design to evaluate performance metrics and
fine-tuning of logical design
i.e. if found to be too slow, reduce no of
switches, use longer wires, minimize the amount
of interconnect to that which is necessary, etc.
Furthermore, the iterative process of evaluating
more symmetric-key algorithms and refining the
architecture is still in progress.