Montgomerys Multiplication Technique: How to make it Smaller and Faster - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Montgomerys Multiplication Technique: How to make it Smaller and Faster

Description:

'Modular Multiplication without Trial Division' Math. Computation, vol. 44 (1985) 519-521 ... r = 2k is the radix (prime to M) xi is the ith digit (usually 0 xi r) ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 47

Provided by: Coli102

Category:

more less

Transcript and Presenter's Notes

Title: Montgomerys Multiplication Technique: How to make it Smaller and Faster

1
Montgomerys Multiplication Technique How to
make it Smaller and Faster

Colin D. Walter
Computation Department, UMIST, UK
www.co.umist.ac.uk

2
Peter Montgomery

Modular Multiplication without Trial Division
Math. Computation, vol. 44 (1985)
519-521
(A ? B) mod M
without obtaining digits q ? (A ? B) / M

3
Motivation

Faster RSA Cryptosystem
? through pipelined array
Safer encryption
? against timing or DPA attacks

4
Overview

RSA Notation
Classical Algorithm
Montgomerys Version
Comparison
carry propagation
digit distribution
communication
timing/power attacks
Conclusion

5
Enigma

Special Purpose Colossus (1943-44)
Tommy Flowers,
Bletchley Park, England.
General Purpose ENIAC (1943-46)
John Eckert John Mauchly
Philadelphia, US.

6
RSA

Modulus M of around 1024 bits
Two keys d and e such that Ade ? A mod M
A encrypted to C Ae mod M
C decrypted by A Cd mod M
M PQ, a product of two large primes
e is often small (e.g. a Fermat prime)
d satisfies de ? 1 mod (P1)(Q1)

7
Faster H/W More Secure Encryption

Work to factorize M doubles for every
extra 15 bits (for key lengths 210
bits)
Work to en/decrypt
((102415)/1024)2 per multiplication
((102415)/1024)3 per exponentiation,
i.e. only 5 extra!

8
Number representations
n1

X ?i0 xiri
r 2k is the radix (prime to M)
xi is the ith digit (usually 0 ? xi lt r)
n ? max no. of digits in any number
Redundant reps
wider digit range than 0 .. r?1
H/W is built from k?k-bit multipliers
n fixed by H/W register size

9
Redundancy

Digits xj split into carry-save parts xj xj,s
rxj,c
X ? XY
is performed by digit-parallel addition
xj ? xj,s xj?1,c yj
No carry propagation only old carries on right
side

10
Multiplication A?B

Use n digit multipliers to form ai?B and add to a
partial product P
P 0
For i n?1 downto 0 do
P r?P ai?B
Post-condition P A?B

Either Use redundancy in P and parallel digit
addition to add aiB in one clock cycle
Cell j computes aibj in cycle i

ai
ai
pj1,s
pj,c
pj,s
pj-1,c
pj-1,s
bj1
bj
bj-1
cell j
cell j-1
cell j1
pj,c
pj1,c
pj1,s
pj,s
pj-1,c
pj-1,s
P P ai?B (digit-parallel)
P in Carry-Save form pj pj,s r?pj,c
12

or Pipeline the addition of ai?B over n cycles
and propagate carries with no redundancy
Cell j computes aibj in cycle ij

pj1 bj1
pj bj
pj-1 bj-1
ai
ai
ai
ai
time j1
time j
time j-1
carry
carry
carry
carry
pj1
pj
pj-1
P P ai?B (digit-serial)
13
Multiplier Complexity

Assume wires take area but not time (or power).
Area?Time2 complexity for un-pipelined k-bit
multiplication is bounded below by k2
This can be achieved for time in log k ..?k
Discrete Fourier Transform has large constants
for time and area.
Better, but asymptotically poorer designs for k
expected here.

Cross-over point ?
107 transistors available for RSA ?
k ? 64 to accommodate ai?B
Speed by using at least n multipliers to perform
a full length ai?B (or equivalent) in one cycle.

15
Real-Time ?

Assume
bus is one k-bit digit per cycle
k-bit multiplier operates in one cycle
Then
A?B takes n cycles using n multipliers
Throughput is one digit per cycle for multn.
Need O(nk) multiplications for decryption
Conclude
Need O(nk) rows of n multipliers.

16
Classical Mod Multn Algorithm

Pre-condition 0 ? A lt rn
P 0
For i n?1 downto 0 do
Begin
P rP aiB
qi P div M
P P ? qiM
End
Post-conditions P AB ? QM,
P ? (AB) mod M

17
Comments

Carry propagation a problem
(it slows finding q)
Use only top digits of M and P to determine
a good multiple of M to remove
P is bounded by small multiple of M
Clean up only at end
Critical path is finding q.

18
Disadvantages

Redundant rep. for digit-parallel operation
Global broadcast of q to each digit position

19
Montgomerys Mod Multn Algm

Pre-condition 0 ? A lt rn
P 0
For i 0 to n?1 do
Begin
qi (p0aib0)(-m0-1) mod r
P (P aiB qiM) div r
Invariant 0 ? P lt MB
End
Post-condition Prn AB QM ,
P ? (ABrn) mod M

20
Peter Montgomery

reverses the multiplication order
chooses digits from least to most significant
shifts down on each iteration.
uses the least significant digits
to determine multiple of M to
subtract.
Computes (ABrn) mod M

The factor rn is cleared up in post-processing
Any extra multiple of M is removed then
qi has no carries to wait for
Pipelining of the digits can now take place
compute aibj1 on the cycle after aibj
use a non-redundant representation
no broadcasting of qi

22
The Post-Condition

m0?1 exists
qi chosen so division by r is exact
Define Ai ? j0 ajrj and Qi analogously
Then Ai Ai?1riai and An A
So ri1P AiB QiM at end of ith
iteration
Hence rnP AB QM at end.

i
23
The Bounds

A converted on-line to non-redundant form
Can assume ai ? r?1
So loop invariant P lt MB

If critical path length is computing q
Scale M to ensure (?m0?1) mod r 1
Shift B up to make b0 0
Result
qi p0 mod r is simple
Critical path in repeated cell.
Cost
Increase n by 2

25
Removing rn

The Montgomery class of A is
A ? rnA mod M
Montgomery modr multn is denoted ? .
Montgomery product of A and B is
A ? B ? A B r?n ? ABrn ? AB mod M.
Applying ? to A instead of ? to A produces
Ae in an expn algorithm

_
_
_
_
_
_
_
_
___
_
_
_
___
26
Encryption Process
_
__

Process A ? A ? Ae ? Ae
Precompute R2 rn ? r2n mod M
Start with A ? R2 ? Arn ? A mod M
Exponentiate to obtain Ae
End with Ae ? 1 ? Ae mod M

__
_
_
__
__
_
27
2M Bound

Outputs are re-used as inputs.
So need to bound I/O
Suppose an?1 0
Then P lt MB at end of loop n?2
yields P lt Mr?1B at very end.
e.g. If B lt 2M then P lt 2M

Suppose 2rM lt rn, A lt 2M and R2 lt 2M
Then A lt 2M, Ae lt 2M and P Ae ? 1 lt 2M
Final output P satisfies
Prn Ae QM where Q ? rn?1.
Here Ae lt 2M yields Prn lt (rn1)M So P ? M
P M ? Ae ? 0 mod M ? A ? 0 mod M
A M should never arise A 0 yields P 0.
So no final modular adjustment is necessary.

_
__
__
_
__
__
__
29
Digit-Parallel Implementation

Classical vs Montgomery
Similarities
Broadcasting of qi and ai
Redundant representations
Computing qi takes time
Differences
Bits to determine qi

30
ai, qi
ai, qi
mj bj
mj-1 bj-1
mj1 bj1
pj,c
pj1,s
pj-1,c
pj-1,s
pj,s
cell j
cell j-1
cell j1
pj1,c
pj,c
pj-1,c
pj1,s
pj-1,s
pj,s
P P ai?B (digit-parallel, not modular)
P in Carry-Save form pj pj,s r?pj,c
31
ai qi
ai qi
mj1 bj1
mn-1 bn-1
mj-1 bj-1
mj bj
qi
pj,s
pj1,s
pn-2,s
pj-1,s
pj-2,s
j
j1
j-1
n-1
qi1
pj-1,c
pj,c
pn-3,c
pj1,c
pj-3,c
pj,c
pj-2,c
Digit-Parallel P rP ai?B - qi?M
(Classical)
32
mi1
mi
bj1
bj
mi-1
m0
b0
bj-1
qi
qi
qi
qi
qi
ai
ai
ai
ai
ai
ai
j
j-1
j1
0
ci,j2
ci,j
ci,1
ci,j1
ci,j-1
(i)
(i)
(i)
(i1)
(i1)
(i1)
pj-1
pj
p0
pj-1
pj
pj-2
(i)
pj1
(n)
(n)
(n)
pj-2
pj-1
pj
Data Flow for P(i1) (P(i) ai?B qi?M)/r
(Montgomery)
33
Systolic Array (Montgomery)

Write ith value of P as P(i) ? j0 p(i?1) r j
Cells in col j compute p(i)j at time 2ij
p(i)j rc(i)j ? p(i?1)j1 c(i)j?1 aibj
qimj
Cells in col 0 compute qi at time 2i
qi ? (p(i?1)1aib0)(?m0?1) mod r
Any number of rows may be constructed
Different timing schedules are possible

n?1
34
Systolic Array for P (A?B Q?M)r-n
p(i)
p(i)
p(i)
p(i)
j
j-2
mj-1 bj-1
mj bj
mj1 bj1
j-1
j1
ai
ai
ai
ai
cell i,j1
cell i,j
cell i,j-1
qi
qi
qi
qi
carry
carry
carry
carry
p(i1)
p(i1)
p(i1)
p(i1)
j-1
j-2
j1
j
mj bj
mj1 bj1
mj-1 bj-1
ai1
ai1
ai1
ai1
cell i1,j1
cell i1,j
cell i1,j-1
qi1
qi1
qi1
qi1
carry
carry
carry
carry
mj-1 bj-1
mj bj
mj1 bj1
p(i2)
p(i2)
p(i2)
p(i2)
j-1
j
j-2
j1
35
mi1
mi
bj1
bj
mi-1
m0
b0
bj-1
qi
qi
qi
qi
qi
ai
ai
ai
ai
ai
ai
j
j-1
j1
0
ci,j2
ci,j
ci,1
ci,j1
ci,j-1
(i)
(i)
(i)
(i1)
(i1)
(i1)
pj-1
pj
p0
pj-1
pj
pj-2
(i)
pj1
(n)
(n)
(n)
pj-2
pj-1
pj
Data Flow for P(i1) (P(i) ai?B qi?M)/r
36
Digit-Serial Implementation (Montgomery)

Advantages
Local communication
Shorter critical path
Critical path easily in repeated cell
Non-redundant representation
Digit serial I/O
Different digits qi and ai re DPA

37
Digit-Serial Implementation (Montgomery)

Disadvantage
H/W only half used
Solutions
Interleave two multiplications
E.g. configure exponentiation ? 75 use
Group digits as per Peter Kornerup 94

Other cell boundaries/groupings are possible
Timing front angles in the data dependency graph
can be altered
For current speed of array implementations see
Blum and Paar 99
Vuillemin et al. 97 constructed an array
Design is parametrised by k and no. of rows.

39
Data Dependency Diagrams
40
Data Dependency Diagrams
Parallel Digit Implementation
t 0
t 1
t 2
t 3
41
Data Dependency Diagrams
Walter 93
t4
t3
t2
t1
t0
t5
1 tick
t6
t7
2 ticks
...
42
Data Dependency Diagrams
Kornerup 94
t0
t1
t2
t3
t4
t5
t6
43
Data Integrity

P AB ? QM or Prn AB QM
These are easily checked mod m.
e.g. m a prime just above the maximum cell
output.
Cost one cell in the array i.e. increasing n
by 1.
On error, abort or re-compute by another route
e.g. M replaced by dM for a digit d prime to r.

44
Timing Power Attacks

Most attacks which succeed on the classical
algorithm have equivalents which will succeed on
corresponding implementation of Montgomerys
algorithm.
With parallel digit processing, the same digits
of A and Q are used in every digit slice in the
same cycle. So DPA might reveal them.
Pipelined version has no equivalent (see data
dependency graph). It uses many different digits
of A and Q in each cycle. DPA is more difficult.

45
Conclusions

For single k-bit multiplier or array of n
parallel cells, classical and Montgomery
algorithms are almost equal.
For pipelined array, Montgomery method has
advantages smaller time area constants, better
I/O, better against DPA
Pipeline is more complex for 100 use, but faster
clock.
Parameters can be chosen for specific purposes.