Montgomerys Multiplication Technique: How to make it Smaller and Faster - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Montgomerys Multiplication Technique: How to make it Smaller and Faster

Description:

'Modular Multiplication without Trial Division' Math. Computation, vol. 44 (1985) 519-521 ... r = 2k is the radix (prime to M) xi is the ith digit (usually 0 xi r) ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 47
Provided by: Coli102
Category:

less

Transcript and Presenter's Notes

Title: Montgomerys Multiplication Technique: How to make it Smaller and Faster


1
Montgomerys Multiplication Technique How to
make it Smaller and Faster
  • Colin D. Walter
  • Computation Department, UMIST, UK
  • www.co.umist.ac.uk

2
Peter Montgomery
  • Modular Multiplication without Trial Division
    Math. Computation, vol. 44 (1985)
    519-521
  • (A ? B) mod M
    without obtaining digits q ? (A ? B) / M

3
Motivation
  • Faster RSA Cryptosystem
  • ? through pipelined array
  • Safer encryption
  • ? against timing or DPA attacks

4
Overview
  • RSA Notation
  • Classical Algorithm
  • Montgomerys Version
  • Comparison
  • carry propagation
  • digit distribution
  • communication
  • timing/power attacks
  • Conclusion

5
Enigma
  • Special Purpose Colossus (1943-44)
  • Tommy Flowers,
  • Bletchley Park, England.
  • General Purpose ENIAC (1943-46)
  • John Eckert John Mauchly
  • Philadelphia, US.

6
RSA
  • Modulus M of around 1024 bits
  • Two keys d and e such that Ade ? A mod M
  • A encrypted to C Ae mod M
  • C decrypted by A Cd mod M
  • M PQ, a product of two large primes
  • e is often small (e.g. a Fermat prime)
  • d satisfies de ? 1 mod (P1)(Q1)

7
Faster H/W More Secure Encryption
  • Work to factorize M doubles for every
    extra 15 bits (for key lengths 210
    bits)
  • Work to en/decrypt
  • ((102415)/1024)2 per multiplication
  • ((102415)/1024)3 per exponentiation,
  • i.e. only 5 extra!

8
Number representations
n1
  • X ?i0 xiri
  • r 2k is the radix (prime to M)
  • xi is the ith digit (usually 0 ? xi lt r)
  • n ? max no. of digits in any number
  • Redundant reps
  • wider digit range than 0 .. r?1
  • H/W is built from k?k-bit multipliers
  • n fixed by H/W register size

9
Redundancy
  • Digits xj split into carry-save parts xj xj,s
    rxj,c
  • X ? XY
  • is performed by digit-parallel addition
  • xj ? xj,s xj?1,c yj
  • No carry propagation only old carries on right
    side

10
Multiplication A?B
  • Use n digit multipliers to form ai?B and add to a
    partial product P
  • P 0
  • For i n?1 downto 0 do
  • P r?P ai?B
  • Post-condition P A?B

11
  • Either Use redundancy in P and parallel digit
    addition to add aiB in one clock cycle
  • Cell j computes aibj in cycle i

ai
ai
pj1,s
pj,c
pj,s
pj-1,c
pj-1,s
bj1
bj
bj-1
cell j
cell j-1
cell j1
pj,c
pj1,c
pj1,s
pj,s
pj-1,c
pj-1,s
P P ai?B (digit-parallel)
P in Carry-Save form pj pj,s r?pj,c
12
  • or Pipeline the addition of ai?B over n cycles
    and propagate carries with no redundancy
  • Cell j computes aibj in cycle ij

pj1 bj1
pj bj
pj-1 bj-1
ai
ai
ai
ai
time j1
time j
time j-1
carry
carry
carry
carry
pj1
pj
pj-1
P P ai?B (digit-serial)
13
Multiplier Complexity
  • Assume wires take area but not time (or power).
  • Area?Time2 complexity for un-pipelined k-bit
    multiplication is bounded below by k2
  • This can be achieved for time in log k ..?k
  • Discrete Fourier Transform has large constants
    for time and area.
  • Better, but asymptotically poorer designs for k
    expected here.

14
  • Cross-over point ?
  • 107 transistors available for RSA ?
  • k ? 64 to accommodate ai?B
  • Speed by using at least n multipliers to perform
    a full length ai?B (or equivalent) in one cycle.

15
Real-Time ?
  • Assume
  • bus is one k-bit digit per cycle
  • k-bit multiplier operates in one cycle
  • Then
  • A?B takes n cycles using n multipliers
  • Throughput is one digit per cycle for multn.
  • Need O(nk) multiplications for decryption
  • Conclude
  • Need O(nk) rows of n multipliers.

16
Classical Mod Multn Algorithm
  • Pre-condition 0 ? A lt rn
  • P 0
  • For i n?1 downto 0 do
  • Begin
  • P rP aiB
  • qi P div M
  • P P ? qiM
  • End
  • Post-conditions P AB ? QM,
  • P ? (AB) mod M

17
Comments
  • Carry propagation a problem
  • (it slows finding q)
  • Use only top digits of M and P to determine
    a good multiple of M to remove
  • P is bounded by small multiple of M
  • Clean up only at end
  • Critical path is finding q.

18
Disadvantages
  • Redundant rep. for digit-parallel operation
  • Global broadcast of q to each digit position

19
Montgomerys Mod Multn Algm
  • Pre-condition 0 ? A lt rn
  • P 0
  • For i 0 to n?1 do
  • Begin
  • qi (p0aib0)(-m0-1) mod r
  • P (P aiB qiM) div r
  • Invariant 0 ? P lt MB
  • End
  • Post-condition Prn AB QM ,
  • P ? (ABrn) mod M

20
Peter Montgomery
  • reverses the multiplication order
  • chooses digits from least to most significant
  • shifts down on each iteration.
  • uses the least significant digits
    to determine multiple of M to
    subtract.
  • Computes (ABrn) mod M

21
  • The factor rn is cleared up in post-processing
  • Any extra multiple of M is removed then
  • qi has no carries to wait for
  • Pipelining of the digits can now take place
  • compute aibj1 on the cycle after aibj
  • use a non-redundant representation
  • no broadcasting of qi

22
The Post-Condition
  • m0?1 exists
  • qi chosen so division by r is exact
  • Define Ai ? j0 ajrj and Qi analogously
  • Then Ai Ai?1riai and An A
  • So ri1P AiB QiM at end of ith
    iteration
  • Hence rnP AB QM at end.

i
23
The Bounds
  • A converted on-line to non-redundant form
  • Can assume ai ? r?1
  • So loop invariant P lt MB

24
  • If critical path length is computing q
  • Scale M to ensure (?m0?1) mod r 1
  • Shift B up to make b0 0
  • Result
  • qi p0 mod r is simple
  • Critical path in repeated cell.
  • Cost
  • Increase n by 2

25
Removing rn
  • The Montgomery class of A is
  • A ? rnA mod M
  • Montgomery modr multn is denoted ? .
  • Montgomery product of A and B is
  • A ? B ? A B r?n ? ABrn ? AB mod M.
  • Applying ? to A instead of ? to A produces
  • Ae in an expn algorithm

_
_
_
_
_
_
_
_
___
_
_
_
___
26
Encryption Process
_
__
  • Process A ? A ? Ae ? Ae
  • Precompute R2 rn ? r2n mod M
  • Start with A ? R2 ? Arn ? A mod M
  • Exponentiate to obtain Ae
  • End with Ae ? 1 ? Ae mod M

__
_
_
__
__
_
27
2M Bound
  • Outputs are re-used as inputs.
  • So need to bound I/O
  • Suppose an?1 0
  • Then P lt MB at end of loop n?2
  • yields P lt Mr?1B at very end.
  • e.g. If B lt 2M then P lt 2M

28
  • Suppose 2rM lt rn, A lt 2M and R2 lt 2M
  • Then A lt 2M, Ae lt 2M and P Ae ? 1 lt 2M
  • Final output P satisfies
  • Prn Ae QM where Q ? rn?1.
  • Here Ae lt 2M yields Prn lt (rn1)M So P ? M
  • P M ? Ae ? 0 mod M ? A ? 0 mod M
  • A M should never arise A 0 yields P 0.
  • So no final modular adjustment is necessary.

_
__
__
_
__
__
__
29
Digit-Parallel Implementation
  • Classical vs Montgomery
  • Similarities
  • Broadcasting of qi and ai
  • Redundant representations
  • Computing qi takes time
  • Differences
  • Bits to determine qi

30
ai, qi
ai, qi
mj bj
mj-1 bj-1
mj1 bj1
pj,c
pj1,s
pj-1,c
pj-1,s
pj,s
cell j
cell j-1
cell j1
pj1,c
pj,c
pj-1,c
pj1,s
pj-1,s
pj,s
P P ai?B (digit-parallel, not modular)
P in Carry-Save form pj pj,s r?pj,c
31
ai qi
ai qi
mj1 bj1
mn-1 bn-1
mj-1 bj-1
mj bj
qi
pj,s
pj1,s
pn-2,s
pj-1,s
pj-2,s
j
j1
j-1
n-1
qi1
pj-1,c
pj,c
pn-3,c
pj1,c
pj-3,c
pj,c
pj-2,c
Digit-Parallel P rP ai?B - qi?M
(Classical)
32
mi1
mi
bj1
bj
mi-1
m0
b0
bj-1
qi
qi
qi
qi
qi
ai
ai
ai
ai
ai
ai
j
j-1
j1
0
ci,j2
ci,j
ci,1
ci,j1
ci,j-1
(i)
(i)
(i)
(i1)
(i1)
(i1)
pj-1
pj
p0
pj-1
pj
pj-2
(i)
pj1
(n)
(n)
(n)
pj-2
pj-1
pj
Data Flow for P(i1) (P(i) ai?B qi?M)/r
(Montgomery)
33
Systolic Array (Montgomery)
  • Write ith value of P as P(i) ? j0 p(i?1) r j
  • Cells in col j compute p(i)j at time 2ij
  • p(i)j rc(i)j ? p(i?1)j1 c(i)j?1 aibj
    qimj
  • Cells in col 0 compute qi at time 2i
  • qi ? (p(i?1)1aib0)(?m0?1) mod r
  • Any number of rows may be constructed
  • Different timing schedules are possible

n?1
34
Systolic Array for P (A?B Q?M)r-n
p(i)
p(i)
p(i)
p(i)
j
j-2
mj-1 bj-1
mj bj
mj1 bj1
j-1
j1
ai
ai
ai
ai
cell i,j1
cell i,j
cell i,j-1
qi
qi
qi
qi
carry
carry
carry
carry
p(i1)
p(i1)
p(i1)
p(i1)
j-1
j-2
j1
j
mj bj
mj1 bj1
mj-1 bj-1
ai1
ai1
ai1
ai1
cell i1,j1
cell i1,j
cell i1,j-1
qi1
qi1
qi1
qi1
carry
carry
carry
carry
mj-1 bj-1
mj bj
mj1 bj1
p(i2)
p(i2)
p(i2)
p(i2)
j-1
j
j-2
j1
35
mi1
mi
bj1
bj
mi-1
m0
b0
bj-1
qi
qi
qi
qi
qi
ai
ai
ai
ai
ai
ai
j
j-1
j1
0
ci,j2
ci,j
ci,1
ci,j1
ci,j-1
(i)
(i)
(i)
(i1)
(i1)
(i1)
pj-1
pj
p0
pj-1
pj
pj-2
(i)
pj1
(n)
(n)
(n)
pj-2
pj-1
pj
Data Flow for P(i1) (P(i) ai?B qi?M)/r
36
Digit-Serial Implementation (Montgomery)
  • Advantages
  • Local communication
  • Shorter critical path
  • Critical path easily in repeated cell
  • Non-redundant representation
  • Digit serial I/O
  • Different digits qi and ai re DPA

37
Digit-Serial Implementation (Montgomery)
  • Disadvantage
  • H/W only half used
  • Solutions
  • Interleave two multiplications
  • E.g. configure exponentiation ? 75 use
  • Group digits as per Peter Kornerup 94

38
  • Other cell boundaries/groupings are possible
  • Timing front angles in the data dependency graph
    can be altered
  • For current speed of array implementations see
    Blum and Paar 99
  • Vuillemin et al. 97 constructed an array
  • Design is parametrised by k and no. of rows.

39
Data Dependency Diagrams
40
Data Dependency Diagrams
Parallel Digit Implementation
t 0
t 1
t 2
t 3
41
Data Dependency Diagrams
Walter 93
t4
t3
t2
t1
t0
t5
1 tick
t6
t7
2 ticks
...
42
Data Dependency Diagrams
Kornerup 94
t0
t1
t2
t3
t4
t5
t6
43
Data Integrity
  • P AB ? QM or Prn AB QM
  • These are easily checked mod m.
  • e.g. m a prime just above the maximum cell
    output.
  • Cost one cell in the array i.e. increasing n
    by 1.
  • On error, abort or re-compute by another route
  • e.g. M replaced by dM for a digit d prime to r.

44
Timing Power Attacks
  • Most attacks which succeed on the classical
    algorithm have equivalents which will succeed on
    corresponding implementation of Montgomerys
    algorithm.
  • With parallel digit processing, the same digits
    of A and Q are used in every digit slice in the
    same cycle. So DPA might reveal them.
  • Pipelined version has no equivalent (see data
    dependency graph). It uses many different digits
    of A and Q in each cycle. DPA is more difficult.

45
Conclusions
  • For single k-bit multiplier or array of n
    parallel cells, classical and Montgomery
    algorithms are almost equal.
  • For pipelined array, Montgomery method has
    advantages smaller time area constants, better
    I/O, better against DPA
  • Pipeline is more complex for 100 use, but faster
    clock.
  • Parameters can be chosen for specific purposes.

46
  • Go forth and Multiply
Write a Comment
User Comments (0)
About PowerShow.com