Unified Architectures for Efficient and Compact CryptoProcessing - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Unified Architectures for Efficient and Compact CryptoProcessing

Description:

Radix-8 multiplier outperforms radix-2 multiplier more than 3 times when the ... Dual-Radix Multiplier. Three multipliers. A1: GF(p)-only multiplier ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 44

Provided by: ipam

Category:

more less

Transcript and Presenter's Notes

Title: Unified Architectures for Efficient and Compact CryptoProcessing

1
Unified Architectures for Efficient and Compact
Crypto-Processing

Erkay Savas
Sabanci University

2
Outline

Research Motivation
Public Key Cryptography
Unified Arithmetic
High-Radix Multiplication
Dual-Radix Multiplication
Support for GF(3n) Arithmetic
Implementation Results
Future Research

3
Motivation

Compatibility
support for fast arithmetic in different finite
fields and groups
Saving in Area
Improve time ? area metric
Algorithm Agility
NTRU ? ECC

4
Public Key Cryptography (PKC)

Each user has a pair of keys
Private Key - known only to the owner
Public Key - known to everyone in the systems
with assurance
Encryption
Encryption with the Public Key of the receiver
Decryption
Only the receiver can decrypt the message by
her/his Private Key

5
Public Key Cryptography in Use

RSA, Rabins scheme
Integer factorization, Square root of modulo a
composite number
Discrete Logarithm Based Algorithms
Diffie-Helman Key Exchange, El Gamal
Elliptic curve DH Key Exchange, ECDSA
Discrete logarithm over elliptic curves
IBE
pairings over elliptic curve points

6
RSA

Most popular PKC
Invented by Rivest/Shamir/Adleman in 1977 at MIT.
Its patent expired in 2000.
Based on Integer Factorization problem
Each user has public and private key pair.

7
RSA Encryption Decryption

Encryption done by using public key
y ? xe mod n, where x, y lt n
Decryption done by using private key
x ? yd mod n

8
DL Based Cryptosystems

Fundamental operation
gx mod p, where x, g lt p and g is primitive

9
Elliptic Curve Cryptography 1/2

Emerging public key cryptography standard for
constrained devices.
160 bit key length is equivalent in cryptographic
strength to 1024-bit RSA.
313 bit ECC is equivalent to 4096 bit RSA
As algebraic/geometric entities have been studied
extensively for the past 150 years.
Rich and deep theory suitable to cryptography
First proposed for cryptographic usage in 1985
independently by Neal Koblitz and Victor Miller

10
Elliptic Curve Cryptography 2/2

Dominant fundamental operations
Multiplication in GF(q) where q pk and p is
prime
Alternatives
GF(p) k 1
GF(2k) p 2
GF(pk)
GF(3k) p 3

11
Identity Based Encryption (IBE)

Public key can be any string
e-mail address, name, etc.
No need for certificates
Anonymity achieved
users can choose any public key without revealing
their ID
It can easily change it

12
IBE Bilinear Mapping

e(xP, yQ) e(P, Q)xy e(yP, xQ) g
g is in an (extension of) the underlying field.
Bilinear mapping over elliptic curves
Weil pairing
Tate pairing
Resource consuming
Most efficient bilinear mappings
defined on curves over GF(3k)

13
An Introduction to UnifiedArithmetic

Types of finite fields are heavily used
Prime fields, GF(p)
Binary extension fields, GF(2k)
Ternary extension fields GF(3k) (recently, due to
IBE schemes)
These finite fields feature dissimilar properties
Different implementations on specialized hardware

14
Unified Arithmetic

Unified hardware design methodology requires
A single (unified) datapath
A single (unified) control
Insignificant overhead in the area
Insignificant overhead in the time complexity
(e.g. critical path delay)
Good time?area metric

15
Unified Arithmetic (GF(p) GF(2k))

A unified hardware design methodology for both
field is possible since
the elements of either field are represented
using almost the same data structures in digital
systems
the algorithms for basic arithmetic operations in
both fields have structural similarities (i.e.
the steps of the algorithms are almost identical)
Hence, eventually unified arithmetic is possible

16
Finite Field Operations in ECC

Addition in GF(p) and GF(2k)
Relatively inexpensive in area and time
complexity
Multiplicative inversion in GF(p) and GF(2k)
Prohibitively expensive in terms of time
Possible to avoid some of them
Multiplication in GF(p) and GF(2k)
Expensive in terms of time and area
Usually most important operation
Our focus

17
Montgomery Multiplication

Very efficient way of doing multiplication in
GF(p) and GF(2k) (now also in GF(3k))
Faster (replaces division by shifts)
Suitable for unified design
Suitable for scalable design
Highly parallel
Suitable for pipelining

18
Montgomery Multiplication

Definition
Given a, b ? GF(p), MonMul(a, b) abR-1 mod p,
where R 2k mod p and k ?log2p?.
Algorithm
c 0
for i 0 to k-1
c (c ai b)
c (c c0 p)/2
if c gt p then c c-p (final subtraction)

19
Algorithm for GF(2k)

Input a(x), b(x) ? GF(2k), p(x) and k
Output c(x) a(x)b(x)xk? GF(2k)
c(x) 0
for i 0 to k-1
c(x) (c(x) ? ai b(x))
c(x) (c(x) ? c0 p(x))/x
No final subtraction
Note that
c/2 and c(x)/x are implemented in an identical
way in SW and HW

20
Representation

Addition
Atomic operation multiplication is performed as
a repeated addition
Unified addition
most efficient when carry-save representation is
used for elements of GF(p)
Carry-save representation
an integer is represented as the sum of two other
integers
x xs xc (sum and carry parts, resp.)

21
Scalability

Original Montgomery multiplication algorithm
performs full-precision integer additions
Not scalable
Instead,
long integers are divided into words
Addition of words are handled separately on word
adders.
Choice of word length depends on the precision,
area and speed requirements

22
Word-Based Multiplication
ai
PUi
c(j)0
c(j)w-1
c(j1)0
c(j1)w-1
c(j)1
c(j1)1
c(j)
23
Dependency Graph
24
Processing Unit (PU) with w2
C1(j)
C0(j)
25
Dual-Field Adder (DFA) 1/2

Almost identical to a full-adder (FA)
Difference
it has and additional (control) input (FSEL)
which suppress the carry output of the adder when
it is set to logic-0
Namely, when FSEL 0 then the adder operates in
GF(2k), otherwise it becomes a regular FA

26
DFA 2/2
B
S
A
C
FSEL
Cout
27
Pipeline Organization with two PUs
s the number of PUs
28
Total Computation Time (in clock cycles)
w word size, k precision, e ?k/w?, s the
number of PUs
29
Example Execution Times

Example k 1024, w 32
s 17 ? T 2105
s 15 ? T 2305
s 10 ? T 3415
s 1 ? T 33792
Example k 2048, w 32
s 33 ? T 4221
s 30 ? T 4543
s 10 ? T 13343
s 1 ? T 133120

30
Comparison to the single-field (GF(p)) design
w word size 1.2 ?m CMOS technology
31
Design Alternatives

Higher Radix
Original design is radix 2
Namely, multiplier bits are scanned one bit in
each clock cycle
Possible to scan two or more bits of the
multiplier a
Radix-4 two bits
Radix-8 three bits
More Complex Design lower clock frequency,
higher area
Less clock cycle count ? Faster execution of
multiplication

32
Comparison

Higher radix vs. single radix
Metric
area ? time
For small total area (i.e. lt10000 equivalent NAND
gates) the performances of radix-2 and radix-8
are comparable
Radix-8 multiplier outperforms radix-2 multiplier
more than 3 times when the total area is around
25000 NAND gates

33
Dual-Radix Multiplier

Radix-2 for GF(p) and radix-4 for GF(2k)

34
Dual-Radix Multiplier

Three multipliers
A1 GF(p)-only multiplier
A2 single-radix unified multiplier (with
precomp.)
A3 dual-radix multiplier
Performance (area ? time)
A3 performs slightly worse than A1 and A2
(between 7 to 19) in GF(p) mode
A3 outperforms A2 by 38 to 46 in GF(2k)-mode

35
Unified Arithmetic?

Unified multiplier
carry-save adders used in multiplier
It is not easy to perform other arithmetic
operations with carry-save representation such as
subtraction and comparison (essential in
inversion)

36
New Redundant Representation

Recall
Carry-save representation
X xs xc.
New redundant representation
Redundant signed representation (RSD)
X xp - xn.
Subtraction is equivalent to the addition
X-Y (xp - xn) - (yp - yn) (xp - xn) (yn -
yp)
Comparison is relatively easy

37
RSD

All previous multipliers require a reverse
transformation to non-redundant for after each
multiplication
There are thousands multiplication in ECC
With RSD, all the computation can be done in RSD
form without any reverse transformation
a single transformation is necessary if the
result is needed in non-redundant form.

38
Support for GF(3n) Arithmetic