Title: Matt Henricksen
1Fast Implementation of Symmetric Ciphers
- Matt Henricksen
- Information Security InstituteQueensland
University of Technology
2Introduction
- Dragon A Fast Word Based Stream Cipher.
- Kevin Chen, Matt Henricksen, Bill Millan, Joanne
Fuller, Leonie Simpson, Ed Dawson, H. Lee, S
Moon. - Why we need fast symmetric ciphers
- Ten steps to designing fast symmetric ciphers
- Some conclusions
3Why do we care about speed?
- Benchmark cipher AES
- New ciphers must be
- ultra-efficient (multi-gigabit per second) or
- efficient in constrained devices or
- demonstrably more secure than AES
Steve Babbage Stream ciphers what does
industry want? http//www.ecrypt.eu.org/stvl/sasc
/slides21.pdf
4ECRYPT NoE eSTREAM
- 2005 call for stream cipher primitives
slow broken IP
http//www.ecrypt.eu.org/stream/perf/results
5eSTREAM Candidate Statements about Software
Implementation
6Low Risk, High Reward Guidelines
Cipher efficiency relates to not only number
and type of operations but also the match
between design and architecture
7Timing Symmetric Ciphers
- cpuid
- rdtsc
- mov subtime, eax
- cpuid
- rdtsc
- sub eax, subtime
- mov subtime, eax
- cpuid
- rdtsc
- mov subtime, eax
- cpuid
- rdtsc
- sub eax, subtime
- mov subtime, eax
cpuid rdtsc mov subtime, eax cpuid rdtsc sub eax,
subtime mov subtime, eax cpuid rdtsc mov
time, eax Do operation cpuid rdtsc sub eax, time
mov time, eax
8Timing Symmetric Ciphers
953.3
10Intel Pentium 4 Architecture
Execution Engine
L2 Cache
Front-end
Execution Core
Ports
Fast/Normal Integer
Memory Load
Memory Store
Retirement Unit
11Registers
- Register pressure
- seven registers, more variables
- invisible overhead
- POP register
- MOV register, eighth variable
- MOV eighth variable, register
- PUSH register
12Phelix
Doug Whiting, Bruce Schneier, Stefan Lucks and
Frédéric Muller. Phelix Fast Encryption and
Authentication in a Single Cryptographic
Primitive
13Phelix
14Mir-1 (Loop State Update)
15Large States and Small Updates
Dragon Stream Cipher
16(No Transcript)
17L2 Cache
Execution Engine
Front-end
Execution Core
Retirement Unit
18FSRs
S0
S4
x
(x-4) mod l
S4
S0
19Unrolling FSRs
- Advantages
- no need for index
- loop unrolling benefit
- no bound checking
- Disadvantages
- increased code footprint
- reduced flexibility
- possible overhead
- reduced applicability
- cache penalties
20Py
P 256 byte array
Pi 1 word pointer into P
s 1 word memory
Y 260 word array ( 1040 bytes)
Yi 1 word pointer into Y
Eli Biham and Jennifer Seberry. Py A Fast and
Secure Stream Cipher using Rolling Arrays
21Py Strategy
F
default strategy allocate 4000 stages 32
kilobytes
- Advantages
- 25 increase in speed
- loop unrolling benefit
- no bound checking
- Disadvantages
- increased code footprint (1500)
- only partially unrolled
- extensive copy retained
- reduced flexibility
- reduced applicability
- cache penalties
22Merkle-Damgård Construction
(128 bits)
23Iterated Halving
Praveen S.S. Gauravaram, Lauren May and William
L. Millan CRUSH A New Cryptographic Hash
Function using Iterative Halving Technique
24Execution Engine
Port 0
Port 3
Port 2
Port 1
Memory Store
ALU x2
Memory Load
ALU x2
Int x1
MMX x½
Move x1
, -, ? AND, OR, NOT Ifthen Store
, -, ?
?, / ltlt, ltltlt
25Schedule of Instructions
From IA-32 Intel Architecture Optimization
Reference Manual, April 2006
26Latency and Throughput
27(No Transcript)
28Time 0 1 2 3 4 5 6 6.5 7 7.5 8 9 10 10.5 11 12 1
3 14
ALU0 (Port 0) ADD eax, edx ADD ebx, esi
XOR ecx, eax XOR edx, ebx ADD esi, ecx ADD
edx, edi XOR eax, edx
ALU1 (Port 1) ROL edx, 15 ROL esi,
25 ROL eax, 9 ROL ebx, 10 ROL ecx, 17
LOAD (Port 2) MOV edi MOV eax MOV ebx MOV
ecx MOV edx MOV esi MOV edi POP edx
STORE (Port 3) PUSH edx
29S-boxes
- High source of non-linearity
- 8x8 s-box single lookup
- do not use fast execution units
- at least three cycles per byte
- 8 x 8 s-box lookup y S(x)
ALU0 MOV MOV
30Large s-boxes
S3
S2
S1
S0
8 x 32
x2
x1
x0
x3
0
8
31
x
y
31(No Transcript)
32Hermes
114 cycles/byte
33MAG
30.7 cycles/byte
34- Productivity
- Dont use small word sizes (HERMES)
- Dont throw away word as filter (MAG)
pCIPHER
35Branching
conditional
branch 1
branch 2
36Branching in MAG
throughput 0.5
227 iterations - ? 1.21 seconds, ? 0.046975
seconds
37Improving the Branching
throughput 1 latency 10
227 iterations - ? 0.66 seconds, ? 0.0037839
seconds
38MMX/SSE
39SSE2
- Limited instruction set
- cannot work with imediates
- cannot indirectly address memory (sboxes)
- Operations ? 2 cycles each
- (half speed execution unit)
- Not interoperable with other register sets
- penalty for transferring from general registers
- Alignment penalties
- Misalignment within a cache line -40
- Misalignment across a cache line -500
40Dragon
15 cycles
1.5 cycles
42 cycles
1.5 cycles
23 cycles
Time 0 0.5 1.0 1.5
ALU0 XOR EBX XOR EDX XOR EAX
ALU1 XOR ESI ADD ECX ADD EDI
410 MOV EDX, DWORD EBP 0x8
0
127
63
1 MOVDQA XMM0, EDX
2 MOVDQA XMM1, EDX16
10 XORPD XMM1, XMM0
12 PSRLDQ XMM1, 4
14 PADDQ XMM1, 4
16 MOVDQA EDX, XMM0
18 MOVDQA EDX16, XMM1
26 CALL sboxes
42Summary
- Main points of this talk
- dont make arbitrary assumptions
- understand architecture in general terms
- implement your cipher during design phase
- Secondary points
- all of the fast eSTREAM ciphers follow most of
these guidelines - but not all guidelines
- (Py uses lots of memory, Dragon uses large
s-boxes, etc) - these guidelines are orthogonal to cipher
security - they are guidelines, not constraints!
43Questions?