Title: Long Modular Multiplication for Cryptographic Applications
1Long Modular Multiplication forCryptographic
Applications
- Laszlo Hars
- Seagate Research
- Workshop on Cryptographic Hardware and Embedded
Systems, CHES 2004 Boston, MA - Full version of the paper is at
http//www.hars.us/Papers/ModMult.pdf
2Outline
- Background (need, algorithms, complexity)
- Target occasional PK crypto (smartcard, OSD)
- Optimizations
- Hardware architecture
- General purpose, support fast modular reduction
- Speed Parallel operation multiply add /
load - Memory In-place update
- Algorithmic improvements
- Multiply with short Reciprocal (trial division)
- Precision scaling of reciprocals
- Drop insignificant terms
- Modulus scaling
3Modular Multiplication
- ab mod m remainder of (ab) m
- Used in RSA, ECC, ElGamal, Diffie-Hellman,
Primality tests, BBS-PRNG - Assume a,b,m are n-digit numbers
- m normalized ½ d n  m lt d n
- Digit size (machine word) 16 bits (864)
- n 64 for RSA-1024 (10256)
- Squaring twice faster
- Conserve memory
- Divide after Multiply double length product
4Modular Multiplication
- Interleaved multiplication and division
- Barrett multiplication
- Multiply with reciprocal (d 2n/m extra n
digits) - Quisquater's multiplication
- Scaling the modulus for many MS 1-bits(S extra
n digits storage) - Montgomery multiplication
- Number representation a ? ad n mod m
- Right-to-left (simple) interleaved division
- Needs pre- and post processing
5Sub-Quadratic time algorithms
- Fast multiplications
- Complicated algorithms
- Pays for very long numbers
- Karatsuba O(nlog2 3) faster if n gt 1030
- Toom-Cook 3,4way O(na)
- 3FT (Finite Field Fourier Transform)
O(nlognloglogn) - Division multiplication with reciprocal
- Long Reciprocal d 2n/m
- Newton iteration 0.62 multiplication time
- Speed-ups for PKC www.hars.us/Papers/Truncated
Products.pdf
6Quadratic time algorithms
- School multiplication n2 digit products
- School division kn2 digit operations
- Quotient digits estimated with short divisions
- Digit-Multiplications other operations
- Simple structure
- No extra storage when interleaved
- Slower
- Quotient digits with trial-and-error
- Goal reduce correction steps
7Multiply-Accumulate
- DSP multiplication parallel toload / store /
add / compare - Order of the digit-product calculation
- Row-order (use input digits sequentially)
- fori 0 a-1 forj 0 b-1 aibj
- More memory access
- Column-order (output digits sequentially)
- fork 0 ab-2 fori,j ij k aibj
- Longer accumulator (can be split)
8HW Architecture
- General purpose µP with enhancements
- Circuit utilization Multi-use
- DSP structure multiplication others
- Multiplier is large and slow
- Long accumulator
- Split adder / counter
- In-Accumulator instructions
- Quotient-digit correction circuit
- Updateable memory circular offset write
9HW Architecture
- 16-bit digits
- Shift-add 17.5-bit mult
- In-Accumulator
- Shift
- Add
10Quotient Digits
- No need to store q
- q ? multiplication with short reciprocal µ
- µ is used many times
- µ ? Newton iteration, look-up table
- All bits - 2 MS digits and 1 bit error 0 or 1
(-1) - More than 1-digit reciprocal quotient often OK
- Most economical µ  d n 2/ 2m
µ1,µ0scale 2m, making µ exact 2-digit - Special case m  ½ d n ? µ  d 2 -1
- Usable µ  d n 1.5/ m, µ  2d n 1/ m
11The basic algorithm LRL4
Rn-1n-3 ana-1 bnb-1 d ana-1 bnb-2 ana-2
bnb-1 // Col 1,2 for k nanb-4 n-3 //
Columns to left Rnn-4 Sijk aibj //
Loop-1 to right if (overflow) R - m q
(Rn-1µ1d2 Rn-1µ0d Rn-2µ1d Rn-2µ0)/d32
R (Rqm)d // Loop-2 for k 0 n-4 // LS
digits to left Rnk Sijk aibj // Loop-3
1 while( Rn gt 0 ) R - m // fix overflow
1
2
3
4
Left-Right-Left (military step) algorithm
12Inner Loops (multiply-add)
Q 0 // 50-bit accumulator for k 0
n-4 Q MS(Q) rk for j  max(0,k1-na)Â
min(k1,nb) Q ak-jbj rk
D0(Q) for i n-3 n // storing digits Q
MS(Q) ri ri D0(d)
Sijk aibj
- c 0 // 1-digit temp store
- Q 0 // 33-bit accumulator
- for k 0 n-1
- Q MS(Q) c qmk
- c rk
- rk D0(Q)
(Rqm)d
13Improvements
- Probability of an overflow lt n /d.
- When a, b and m uniform random (?)
- DSP SW mod reduction time 1.0001n2Â Â 4n
- multiply time 10 additions 1.000 01n2Â Â 4n
- HW assisted time n2Â Â 4n
- Variants (Accumulator xn d 3  xn-1d 2 )
- LRL4 q  2(µ1xn d 2  (µ1xn-1µ0 xn) d  µ0
xn-1) / d 3 - LRL3 q  2(µ1xn d   (µ1xn-1µ0 xn ) ) / d 2
- LRL2 q  (µ1xn d  µ0xn) / d 2d, many
corrections
Sequential quotient correction
e
2
14Shorter reciprocal
- 1 digit ? error explosion
- 1 digit 2 bits OK
- µ  ½ 2d n1 / m  d  µ0  d, with d  0 or
½ - 50-bit Accumulator with carry c 0 or 1
- R c d 3  xn d 2  xn-1 d xn-2
- Estimated quotient-digit
- q  (R R d /d ) /d 2  µ0 c  µ0 xn /d
µR /d  - Mod reduction time
- SW 1.25n2Â Â n (mult 10 adds 1.025n2Â Â n)
- HW n2Â Â n
µ11
Quotient correction
15Modulus Scaling
- Special m NO multiplication for quotient-digit
- Quotient digit q rn 1
- (0F) MS digit of m d -1 1112
- (10 ) MS 2 digits of m 1,0
- Transform m 1-digit scaling factor S
- mS is n1-digit
- Last reduction step is with m ? n-digit result
- Need to store m and mS
- Faster than Montgomery n2 const
- Montgomery with modulus scaling n2 const
- LS digit of m d -1 1112 (xF)
- Last reduction step is with m ? n-digit result
16Summary