Matt Henricksen presentation

About This Presentation

Transcript and Presenter's Notes

Title: Matt Henricksen

1
Fast Implementation of Symmetric Ciphers

Matt Henricksen
Information Security InstituteQueensland
University of Technology

2
Introduction

Dragon A Fast Word Based Stream Cipher.
Kevin Chen, Matt Henricksen, Bill Millan, Joanne
Fuller, Leonie Simpson, Ed Dawson, H. Lee, S
Moon.
Why we need fast symmetric ciphers
Ten steps to designing fast symmetric ciphers
Some conclusions

3
Why do we care about speed?

Benchmark cipher AES
New ciphers must be
ultra-efficient (multi-gigabit per second) or
efficient in constrained devices or
demonstrably more secure than AES

Steve Babbage Stream ciphers what does
industry want? http//www.ecrypt.eu.org/stvl/sasc
/slides21.pdf
4
ECRYPT NoE eSTREAM

2005 call for stream cipher primitives

slow broken IP
http//www.ecrypt.eu.org/stream/perf/results
5
eSTREAM Candidate Statements about Software
Implementation
6
Low Risk, High Reward Guidelines
Cipher efficiency relates to not only number
and type of operations but also the match
between design and architecture
7
Timing Symmetric Ciphers

cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax

cpuid rdtsc mov subtime, eax cpuid rdtsc sub eax,
subtime mov subtime, eax cpuid rdtsc mov
time, eax Do operation cpuid rdtsc sub eax, time
mov time, eax
8
Timing Symmetric Ciphers
9
53.3
10
Intel Pentium 4 Architecture
Execution Engine
L2 Cache
Front-end
Execution Core
Ports
Fast/Normal Integer
Memory Load
Memory Store
Retirement Unit
11
Registers

Register pressure
seven registers, more variables
invisible overhead
POP register
MOV register, eighth variable
MOV eighth variable, register
PUSH register

12
Phelix
Doug Whiting, Bruce Schneier, Stefan Lucks and
Frédéric Muller. Phelix Fast Encryption and
Authentication in a Single Cryptographic
Primitive
13
Phelix
14
Mir-1 (Loop State Update)
15
Large States and Small Updates
Dragon Stream Cipher
16
(No Transcript)
17
L2 Cache
Execution Engine
Front-end
Execution Core
Retirement Unit
18
FSRs
S0
S4
x
(x-4) mod l
S4
S0
19
Unrolling FSRs

Advantages
no need for index
loop unrolling benefit
no bound checking

Disadvantages
increased code footprint
reduced flexibility
possible overhead
reduced applicability
cache penalties

20
Py
P 256 byte array
Pi 1 word pointer into P
s 1 word memory
Y 260 word array ( 1040 bytes)
Yi 1 word pointer into Y
Eli Biham and Jennifer Seberry. Py A Fast and
Secure Stream Cipher using Rolling Arrays
21
Py Strategy
F
default strategy allocate 4000 stages 32
kilobytes

Advantages
25 increase in speed
loop unrolling benefit
no bound checking

Disadvantages
increased code footprint (1500)
only partially unrolled
extensive copy retained
reduced flexibility
reduced applicability
cache penalties

22
Merkle-Damgård Construction
(128 bits)
23
Iterated Halving
Praveen S.S. Gauravaram, Lauren May and William
L. Millan CRUSH A New Cryptographic Hash
Function using Iterative Halving Technique
24
Execution Engine
Port 0
Port 3
Port 2
Port 1
Memory Store
ALU x2
Memory Load
ALU x2
Int x1
MMX x½
Move x1
, -, ? AND, OR, NOT Ifthen Store
, -, ?
?, / ltlt, ltltlt
25
Schedule of Instructions
From IA-32 Intel Architecture Optimization
Reference Manual, April 2006
26
Latency and Throughput
27
(No Transcript)
28
Time 0 1 2 3 4 5 6 6.5 7 7.5 8 9 10 10.5 11 12 1
3 14
ALU0 (Port 0) ADD eax, edx ADD ebx, esi
XOR ecx, eax XOR edx, ebx ADD esi, ecx ADD
edx, edi XOR eax, edx
ALU1 (Port 1) ROL edx, 15 ROL esi,
25 ROL eax, 9 ROL ebx, 10 ROL ecx, 17
LOAD (Port 2) MOV edi MOV eax MOV ebx MOV
ecx MOV edx MOV esi MOV edi POP edx
STORE (Port 3) PUSH edx
29
S-boxes

High source of non-linearity
8x8 s-box single lookup
do not use fast execution units
at least three cycles per byte
8 x 8 s-box lookup y S(x)

ALU0 MOV MOV
30
Large s-boxes
S3
S2
S1
S0
8 x 32
x2
x1
x0
x3
0
8
31
x
y
31
(No Transcript)
32
Hermes
114 cycles/byte
33
MAG
30.7 cycles/byte
34

Productivity
Dont use small word sizes (HERMES)
Dont throw away word as filter (MAG)

pCIPHER
35
Branching
conditional
branch 1
branch 2
36
Branching in MAG
throughput 0.5
227 iterations - ? 1.21 seconds, ? 0.046975
seconds
37
Improving the Branching
throughput 1 latency 10
227 iterations - ? 0.66 seconds, ? 0.0037839
seconds
38
MMX/SSE

39
SSE2

Limited instruction set
cannot work with imediates
cannot indirectly address memory (sboxes)
Operations ? 2 cycles each
(half speed execution unit)
Not interoperable with other register sets
penalty for transferring from general registers
Alignment penalties
Misalignment within a cache line -40
Misalignment across a cache line -500

40
Dragon
15 cycles
1.5 cycles
42 cycles
1.5 cycles
23 cycles
Time 0 0.5 1.0 1.5
ALU0 XOR EBX XOR EDX XOR EAX
ALU1 XOR ESI ADD ECX ADD EDI
41
0 MOV EDX, DWORD EBP 0x8
0
127
63
1 MOVDQA XMM0, EDX
2 MOVDQA XMM1, EDX16
10 XORPD XMM1, XMM0
12 PSRLDQ XMM1, 4
14 PADDQ XMM1, 4
16 MOVDQA EDX, XMM0
18 MOVDQA EDX16, XMM1
26 CALL sboxes
42
Summary

Main points of this talk
dont make arbitrary assumptions
understand architecture in general terms
implement your cipher during design phase
Secondary points
all of the fast eSTREAM ciphers follow most of
these guidelines
but not all guidelines
(Py uses lots of memory, Dragon uses large
s-boxes, etc)
these guidelines are orthogonal to cipher
security
they are guidelines, not constraints!

43
Questions?

Write a Comment

User Comments (0)

About PowerShow.com

Matt Henricksen PowerPoint PPT Presentation