Long Modular Multiplication for Cryptographic Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Long Modular Multiplication for Cryptographic Applications

Description:

Divide after Multiply: double length product. Modular Multiplication ... multiply time = 10 additions: 1.000 01n2 4n. HW assisted time = n2 4n ... – PowerPoint PPT presentation

Number of Views:320
Avg rating:3.0/5.0
Slides: 17
Provided by: laszl2
Category:

less

Transcript and Presenter's Notes

Title: Long Modular Multiplication for Cryptographic Applications


1
Long Modular Multiplication forCryptographic
Applications
  • Laszlo Hars
  • Seagate Research
  • Workshop on Cryptographic Hardware and Embedded
    Systems, CHES 2004 Boston, MA
  • Full version of the paper is at
    http//www.hars.us/Papers/ModMult.pdf

2
Outline
  • Background (need, algorithms, complexity)
  • Target occasional PK crypto (smartcard, OSD)
  • Optimizations
  • Hardware architecture
  • General purpose, support fast modular reduction
  • Speed Parallel operation multiply add /
    load
  • Memory In-place update
  • Algorithmic improvements
  • Multiply with short Reciprocal (trial division)
  • Precision scaling of reciprocals
  • Drop insignificant terms
  • Modulus scaling

3
Modular Multiplication
  • ab mod m remainder of (ab) m
  • Used in RSA, ECC, ElGamal, Diffie-Hellman,
    Primality tests, BBS-PRNG
  • Assume a,b,m are n-digit numbers
  • m normalized ½ d n  m lt d n
  • Digit size (machine word) 16 bits (864)
  • n 64 for RSA-1024 (10256)
  • Squaring twice faster
  • Conserve memory
  • Divide after Multiply double length product

4
Modular Multiplication
  • Interleaved multiplication and division
  • Barrett multiplication
  • Multiply with reciprocal (d 2n/m extra n
    digits)
  • Quisquater's multiplication
  • Scaling the modulus for many MS 1-bits(S extra
    n digits storage)
  • Montgomery multiplication
  • Number representation a ? ad n mod m
  • Right-to-left (simple) interleaved division
  • Needs pre- and post processing

5
Sub-Quadratic time algorithms
  • Fast multiplications
  • Complicated algorithms
  • Pays for very long numbers
  • Karatsuba O(nlog2 3) faster if n gt 1030
  • Toom-Cook 3,4way O(na)
  • 3FT (Finite Field Fourier Transform)
    O(nlognloglogn)
  • Division multiplication with reciprocal
  • Long Reciprocal d 2n/m
  • Newton iteration 0.62 multiplication time
  • Speed-ups for PKC www.hars.us/Papers/Truncated
    Products.pdf

6
Quadratic time algorithms
  • School multiplication n2 digit products
  • School division kn2 digit operations
  • Quotient digits estimated with short divisions
  • Digit-Multiplications other operations
  • Simple structure
  • No extra storage when interleaved
  • Slower
  • Quotient digits with trial-and-error
  • Goal reduce correction steps

7
Multiply-Accumulate
  • DSP multiplication parallel toload / store /
    add / compare
  • Order of the digit-product calculation
  • Row-order (use input digits sequentially)
  • fori 0 a-1 forj 0 b-1 aibj
  • More memory access
  • Column-order (output digits sequentially)
  • fork 0 ab-2 fori,j ij k aibj
  • Longer accumulator (can be split)

8
HW Architecture
  • General purpose µP with enhancements
  • Circuit utilization Multi-use
  • DSP structure multiplication others
  • Multiplier is large and slow
  • Long accumulator
  • Split adder / counter
  • In-Accumulator instructions
  • Quotient-digit correction circuit
  • Updateable memory circular offset write

9
HW Architecture
  • 16-bit digits
  • Shift-add 17.5-bit mult
  • In-Accumulator
  • Shift
  • Add

10
Quotient Digits
  • No need to store q
  • q ? multiplication with short reciprocal µ
  • µ is used many times
  • µ ? Newton iteration, look-up table
  • All bits - 2 MS digits and 1 bit error 0 or 1
    (-1)
  • More than 1-digit reciprocal quotient often OK
  • Most economical µ  d n 2/ 2m
    µ1,µ0scale 2m, making µ exact 2-digit
  • Special case m  ½ d n ? µ  d 2 -1
  • Usable µ  d n 1.5/ m, µ  2d n 1/ m

11
The basic algorithm LRL4
Rn-1n-3 ana-1 bnb-1 d ana-1 bnb-2 ana-2
bnb-1 // Col 1,2 for k nanb-4 n-3 //
Columns to left Rnn-4 Sijk aibj //
Loop-1 to right if (overflow) R - m q
(Rn-1µ1d2 Rn-1µ0d Rn-2µ1d Rn-2µ0)/d32
R (Rqm)d // Loop-2 for k 0 n-4 // LS
digits to left Rnk Sijk aibj // Loop-3
1 while( Rn gt 0 ) R - m // fix overflow
1
2
3
4
Left-Right-Left (military step) algorithm
12
Inner Loops (multiply-add)
Q 0 // 50-bit accumulator for k 0
n-4 Q MS(Q) rk for j  max(0,k1-na) 
min(k1,nb) Q ak-jbj rk
D0(Q) for i n-3 n // storing digits Q
MS(Q) ri ri D0(d)
Sijk aibj
  • c 0 // 1-digit temp store
  • Q 0 // 33-bit accumulator
  • for k 0 n-1
  • Q MS(Q) c qmk
  • c rk
  • rk D0(Q)

(Rqm)d
13
Improvements
  • Probability of an overflow lt n /d.
  • When a, b and m uniform random (?)
  • DSP SW mod reduction time 1.0001n2  4n
  • multiply time 10 additions 1.000 01n2  4n
  • HW assisted time n2  4n
  • Variants (Accumulator xn d 3  xn-1d 2  )
  • LRL4 q  2(µ1xn d 2  (µ1xn-1µ0 xn) d  µ0
    xn-1) / d 3
  • LRL3 q  2(µ1xn d    (µ1xn-1µ0 xn ) ) / d 2
  • LRL2 q   (µ1xn d  µ0xn) / d 2d, many
    corrections

Sequential quotient correction
e
2
14
Shorter reciprocal
  • 1 digit ? error explosion
  • 1 digit 2 bits OK
  • µ  ½ 2d n1 / m  d  µ0  d, with d  0 or
    ½
  • 50-bit Accumulator with carry c 0 or 1
  • R c d 3  xn d 2  xn-1 d  xn-2
  • Estimated quotient-digit
  • q  (R R d /d ) /d 2  µ0 c  µ0 xn /d
    µR /d  
  • Mod reduction time
  • SW 1.25n2  n (mult 10 adds 1.025n2  n)
  • HW n2  n

µ11
Quotient correction
15
Modulus Scaling
  • Special m NO multiplication for quotient-digit
  • Quotient digit q rn 1
  • (0F) MS digit of m d -1 1112
  • (10 ) MS 2 digits of m 1,0
  • Transform m 1-digit scaling factor S
  • mS is n1-digit
  • Last reduction step is with m ? n-digit result
  • Need to store m and mS
  • Faster than Montgomery n2 const
  • Montgomery with modulus scaling n2 const
  • LS digit of m d -1 1112 (xF)
  • Last reduction step is with m ? n-digit result

16
Summary
Write a Comment
User Comments (0)
About PowerShow.com