Title: Master
1Masters ThesisFast Flexible Architectures for
Secure Communication
- Lisa Wu
- University of Michigan
- Advanced Computer Architecture Laboratory
- Advisor Professor Todd Austin
2Project Overview
- Cipher Kernel Analyses
- Throughput analysis, bottleneck analysis,
relative run time cost, kernel characterization - Architectural Extensions
- CryptoManiac Architecture
- Instruction architecture, system architecture,
processing element architecture, physical design
characteristics - Super Optimizer
- Validation and parameter studies
- Performance Analysis
- Encryption rate studies
3My Research Contribution
- Design and implementation of the CryptoManiac
co-processor - Hardware models of CryptoManiac
- 8WC, 4WC, 3WC, 2WC, and 4WNC
- ISA and scheduling of kernels
- Timing, area, power, and performance analysesof
the CryptoManiac co-processor - Design and implementation of the super optimizer
- Instruction combination study
- Automatic generation of varied width schedules
- Publication - ISCA 2001
4Cryptography
- Definitions
- encryption vs. decryption
- public-key cipher vs. secret-key cipher
- Public-secret key ciphers are the most commonly
used
plaintext
ciphertext
plaintext
Public Key
Private Key
plaintext
ciphertext
plaintext
Private Key
Private Key
5SSL Session BreakdownFocus Secret-Key Ciphers
server
client
authenticate
public
private key
https get
https recv
. . .
private
close
6Benchmark Suite
- Cipher Key Size Blk Size Rnds/Blk Author Applicati
on - 3DES 112 64 48 CryptSoft SSL, SSH
- Blowfish 128 64 16 CryptSoft Norton Utilities
- IDEA 128 64 8 Ascom PGP, SSH
- Mars 128 128 16 IBM AES Candidate
- RC4 128 8 1 CryptSoft SSL
- RC6 128 128 18 RSA Security AES Candidate
- Rijndael 128 128 10 Rijmen AES Standard
- Twofish 128 128 16 Counterpane AES Candidate
7Cipher Throughput Analysis
- Alpha 21264 vs. 4W
- All except Mars and Twofish were within 10 of
the actual machine tests - Mars 11, Twofish 15
- Alpha 21264 vs. DF
- Blowfish, IDEA, and RC6 are running within 20 of
DF performance - Mars 29, Twofish 76
- RC4 and Rijndael are outliers
8Cipher Bottleneck Analysis
- Alias - impact of stalling loads in the pipeline
until all ealier store addresses have been
resolved - Branch - effects of mispredictions
- Issue - impact of reducing issue width
- Mem - impact of introducing a realistic memory
system - Res - impact of limited functional unit resources
- Window - impact of a limited-size instruction
window
9Cipher Relative Run Time CostFocus Kernel Loop
- 3DES and IDEA are small even for 16 byte sessions
- Mars, RC4, RC6, Rijndael, and Twofish drop well
below 10 for 4k byte sessions - Blowfish is outlier, drops below 10 only for
64k byte sessions
10Cipher Kernel Characterization
- SBOX - substitutions
- XBOX - permutations
- IDEA, Mars, RC4, and RC6 rely on arithmetic
computations benefit from more resources
(multiplies) and from faster operations (rotates) - Blowfish, 3DES, Rijndael and Twofish rely on
substitutions benefit from increased memory
bandwidth and accesses
11Architectural Extensions
- All instructions are limited to two register
input operands and one register output - ROL and ROR (rotates) for 64 and 32-bit data
types - ROLX and RORX support a constant rotate of a
register input, followed by an XOR with another
register input - MULMOD computes the modular multiplication of two
register values modulo the value 0x10001 - SBOX speeds the accessing of substitution tables
with 256-entry tables and 32-bit contents - SBOXSYNC synchronize the SBOX table with memory
- XBOX implements a portion of a full 64-bit
permutation
12SBOX Instruction Semantics
- SBOX instruction eliminates address generation
- All SBOX tables are aligned to a 1k byte boundary
- Address generation becomes zero-latency bit
concatenation - Stores to SBOX storage are not visible by later
SBOXs until - An SBOXSYNC is executed
- An alias bit is set
13Performance of ISA Extensions
14The CryptoManiac Processor
- A 4-wide 32-bit VLIW machine with no cache and a
simple branch predictor - Supports a triadic (three input operands) ISA
that permits combining of most cryptographic
operation pairs for better clock cycle
utilization - Can be combined into chip multiprocessor
configurations for improved performance on
workloads with inter-session and inter-packet
parallelism
15CryptoManic ISA
- bundle ltinstgtltinstgtltinstgtltinstgt
- inst ltoperation pairgtltdestgtltoperand 1gtltoperand
2gtltoperand 3gt - operation pair ltshortgtlttinygtlttinygtltshortgtlttin
ygtlttinygtltlonggtltnopgt - tiny ltxorgt ltandgt ltincgt ltsignextgt ltnopgt
- short ltaddgt ltsubgt ltrotgt ltsboxgt ltnopgt
- long ltmulgt ltmulmodgt
- Examples
- Instruction Expression
- Add-Xor R4, R1, R2, R3 R4 lt- (R1R2)?R3
- And-Rot R4, R1, R2, R3 R4 lt- (R1R2)ltltltR3
- And-Xor R4, R1, R2, R3 R4 lt- (R1R2)?R3
16Scheduling Example Blowfish
17High-Level Schematic of a Single Functional Unit
18CryptoManiac Architecture
19CryptoManiac System Architecture
20Timing and Area Results
21Encryption Performance
22Special Case Studies3DES and Rijndael
23The Super Optimizer
- Validate hand-scheduled kernel results
- Automate generation of optimized kernels for the
various CryptoManiac architecture studied - Instruction combination studies give insight as
to possibly eliminate unnecessary hardware
S
24Instruction Combination Study
25Instruction Combining Characteristics
26Conclusion
- Two hardware/software-design techniques to
improve the performance of secret-key cipher
algorithms - Add instruction support for fast substitutions,
general permutations, rotates, and modular
arithmetic - SBOX eliminates address generation
- Overall speedup of 59 over baseline machine w/
rotates - Design an efficient 4-wide VLIW cryptographic
co-processor called the CryptoManiac - Instruction combining - efficient utilization of
clock cycle - Rijndael runs 2.25 times faster with 1/100th area
and power of a 600MHz Alpha processor
27Future Work
- Access the cost of programmability in the
CryptoManiac by comparing design and performance
of - A dedicated hardware Rijndael implementation (no
programmability) - A FPGA Rijndael implementation (hardware
programmability) - CryptoManiac (software programmability).
- Other application specific processors such as
audio processing, speech recognition, and
soft-radio.
28Acknowledgement
- Credit for much of the work described in this
thesis belongs to my advisor, Professor Todd
Austin, for his insight, guidance, and patience.
He provided for an excellent research
environment, left me enough freedom to do things
the way I thought they should be done, and was
always available to discuss ideas and problems. - I would also like to thank my committee members
Professor Steve Reinhardt and Professor Gary
Tyson for reviewing this document and serving on
the defense committee. - Other people that have contributed to the
CryptoManiac project include Chris Weaver for
hardware design and synthesis support, Jerome
Burke and John McDonald for earlier versions of
ISA extensions code modifications.