Long Modular Multiplication for Cryptographic Applications presentation

About This Presentation

Transcript and Presenter's Notes

Title: Long Modular Multiplication for Cryptographic Applications

1
Long Modular Multiplication forCryptographic
Applications

Laszlo Hars
Seagate Research
Workshop on Cryptographic Hardware and Embedded
Systems, CHES 2004 Boston, MA
Full version of the paper is at
http//www.hars.us/Papers/ModMult.pdf

2
Outline

Background (need, algorithms, complexity)
Target occasional PK crypto (smartcard, OSD)
Optimizations
Hardware architecture
General purpose, support fast modular reduction
Speed Parallel operation multiply add /
load
Memory In-place update
Algorithmic improvements
Multiply with short Reciprocal (trial division)
Precision scaling of reciprocals
Drop insignificant terms
Modulus scaling

3
Modular Multiplication

ab mod m remainder of (ab) m
Used in RSA, ECC, ElGamal, Diffie-Hellman,
Primality tests, BBS-PRNG
Assume a,b,m are n-digit numbers
m normalized ½ d n m lt d n
Digit size (machine word) 16 bits (864)
n 64 for RSA-1024 (10256)
Squaring twice faster
Conserve memory
Divide after Multiply double length product

4
Modular Multiplication

Interleaved multiplication and division
Barrett multiplication
Multiply with reciprocal (d 2n/m extra n
digits)
Quisquater's multiplication
Scaling the modulus for many MS 1-bits(S extra
n digits storage)
Montgomery multiplication
Number representation a ? ad n mod m
Right-to-left (simple) interleaved division
Needs pre- and post processing

5
Sub-Quadratic time algorithms

Fast multiplications
Complicated algorithms
Pays for very long numbers
Karatsuba O(nlog2 3) faster if n gt 1030
Toom-Cook 3,4way O(na)
3FT (Finite Field Fourier Transform)
O(nlognloglogn)
Division multiplication with reciprocal
Long Reciprocal d 2n/m
Newton iteration 0.62 multiplication time
Speed-ups for PKC www.hars.us/Papers/Truncated
Products.pdf

6
Quadratic time algorithms

School multiplication n2 digit products
School division kn2 digit operations
Quotient digits estimated with short divisions
Digit-Multiplications other operations
Simple structure
No extra storage when interleaved
Slower
Quotient digits with trial-and-error
Goal reduce correction steps

7
Multiply-Accumulate

DSP multiplication parallel toload / store /
add / compare
Order of the digit-product calculation
Row-order (use input digits sequentially)
fori 0 a-1 forj 0 b-1 aibj
More memory access
Column-order (output digits sequentially)
fork 0 ab-2 fori,j ij k aibj
Longer accumulator (can be split)

8
HW Architecture

General purpose µP with enhancements
Circuit utilization Multi-use
DSP structure multiplication others
Multiplier is large and slow
Long accumulator
Split adder / counter
In-Accumulator instructions
Quotient-digit correction circuit
Updateable memory circular offset write

9
HW Architecture

16-bit digits
Shift-add 17.5-bit mult
In-Accumulator
Shift
Add

10
Quotient Digits

No need to store q
q ? multiplication with short reciprocal µ
µ is used many times
µ ? Newton iteration, look-up table
All bits - 2 MS digits and 1 bit error 0 or 1
(-1)
More than 1-digit reciprocal quotient often OK
Most economical µ d n 2/ 2m
µ1,µ0scale 2m, making µ exact 2-digit
Special case m ½ d n ? µ d 2 -1
Usable µ d n 1.5/ m, µ 2d n 1/ m

11
The basic algorithm LRL4
Rn-1n-3 ana-1 bnb-1 d ana-1 bnb-2 ana-2
bnb-1 // Col 1,2 for k nanb-4 n-3 //
Columns to left Rnn-4 Sijk aibj //
Loop-1 to right if (overflow) R - m q
(Rn-1µ1d2 Rn-1µ0d Rn-2µ1d Rn-2µ0)/d32
R (Rqm)d // Loop-2 for k 0 n-4 // LS
digits to left Rnk Sijk aibj // Loop-3
1 while( Rn gt 0 ) R - m // fix overflow
1
2
3
4
Left-Right-Left (military step) algorithm
12
Inner Loops (multiply-add)
Q 0 // 50-bit accumulator for k 0
n-4 Q MS(Q) rk for j max(0,k1-na)
min(k1,nb) Q ak-jbj rk
D0(Q) for i n-3 n // storing digits Q
MS(Q) ri ri D0(d)
Sijk aibj

c 0 // 1-digit temp store
Q 0 // 33-bit accumulator
for k 0 n-1
Q MS(Q) c qmk
c rk
rk D0(Q)

(Rqm)d
13
Improvements

Probability of an overflow lt n /d.
When a, b and m uniform random (?)
DSP SW mod reduction time 1.0001n2 4n
multiply time 10 additions 1.000 01n2 4n
HW assisted time n2 4n
Variants (Accumulator xn d 3 xn-1d 2 )
LRL4 q 2(µ1xn d 2 (µ1xn-1µ0 xn) d µ0
xn-1) / d 3
LRL3 q 2(µ1xn d (µ1xn-1µ0 xn ) ) / d 2
LRL2 q (µ1xn d µ0xn) / d 2d, many
corrections

Sequential quotient correction
e
2
14
Shorter reciprocal

1 digit ? error explosion
1 digit 2 bits OK
µ ½ 2d n1 / m d µ0 d, with d 0 or
½
50-bit Accumulator with carry c 0 or 1
R c d 3 xn d 2 xn-1 d xn-2
Estimated quotient-digit
q (R R d /d ) /d 2 µ0 c µ0 xn /d
µR /d
Mod reduction time
SW 1.25n2 n (mult 10 adds 1.025n2 n)
HW n2 n

µ11
Quotient correction
15
Modulus Scaling

Special m NO multiplication for quotient-digit
Quotient digit q rn 1
(0F) MS digit of m d -1 1112
(10 ) MS 2 digits of m 1,0
Transform m 1-digit scaling factor S
mS is n1-digit
Last reduction step is with m ? n-digit result
Need to store m and mS
Faster than Montgomery n2 const
Montgomery with modulus scaling n2 const
LS digit of m d -1 1112 (xF)
Last reduction step is with m ? n-digit result

16
Summary

Write a Comment

User Comments (0)

About PowerShow.com

Long Modular Multiplication for Cryptographic Applications PowerPoint PPT Presentation