Title: Generating Truly Optimal Code Using a Metaprogramming Library
1Generating Truly Optimal CodeUsing a
Metaprogramming Library
- Don Clugston
- First D Programming Conference, 24 August 2007
2String mixins in D undercooked, but very tasty
char greet(char greeting) return
writefln( greeting , world!) void
main() mixin( greet( Hello ) )
- Compiles to
- Vindicates built-in string operations
void main() writefln( Hello, world! )
3The Challenge
- Fortran BLAS (a standard set of highly optimised
routines). The crucial functions are coded in
asm. - y a x
- But BLAS is limited nothing for simple things
- x y - z
- a r0.3 g0.5 b0.2
void DAXPY(double y, double x, double a)
for (i 0 i lt y.length i)
yi xi a
4Operating overloading
- Gives ideal syntax, always works
- Cant operate on built-in types
- Inefficient because
- Creates unnecessary temporaries.
- Multiple loops, eg abcd ?
- Somehow, we need to get the expression inside the
for loop!
double temp1 new double, temp2 new
double for(int i0 iltb.length i)
temp1i bi ci for(int i0,
ilttemp1.length i) temp2i temp1i
di a temp2
5The Wizard Solution Expression Templates (eg,
Blitz)
- Overloaded operators dont do the calculation
instead, they record the operation as a proxy
type, creating a syntax tree. - Example (ab)/(c-d)
- Need a good optimiser.
- Works in D as well as C. BUT we are fighting
the compiler!
DVExprltDVBinExprOpltDVExprlt DVBinExprOpltDVeciterT
, DVeciterT, DApAddgtgt, DVExprltDVBinExprOplt
DVeciterT, DVeciterT, DApSubtractgtgt,
DApDividegtgt
6Representing the Syntax Tree in D
- In D, any expression can be represented in a
single template. - Represent types and values in a tuple. Represent
expression in a char . A..Z correspond to
T0..T25. - eg
- Note that A appears twice in the expression
(operator overloading cant represent that).
void vectorOperation(char expression, T)(T
values)
vectorOperation!(A(BC)/(AD))(x, y, z, u, v)
7Finding the vectors in a tuple
- Its a vector if you can index it.
- Imperfection cant index tuple in CTFE.
- Workaround create array of results.
- Usage
- if ( isVector!(Tuple)i)
template isVector(T...) static if (T.length
0) const bool isVector else
static if( is( typeof(T00) ) ) const
bool isVector true isVector!(T1..)
else const bool isVector false
isVector!(T1..)
8Metaprogramming For Muggles
char muggle (char expr, Values...)()
char code "for (int i0 iltvalues0.length
i) " foreach(c expr) if (c gt 'A'
c lt 'Z) // A-Z become tuple members.
code "values" itoa(c-'A') ""
// add i if it was a vector
if (isVector!(Values)c-'A') code "i"
else code c // Everything else
is unchanged return code "
template VEC(char expr) void
VEC(Values...)(Values values) mixin(
muggle!(expr, Values) )
- USAGE
- double firstvec, secondvec, thirdvec
- VEC!("AB(CAD)")(firstvec, secondvec,
thirdvec, 25.7)
9Trivial enhancements
- Ensure all vectors are the same length.
- Assert no aliasing (vectors dont overlap).
- Equalize with hand-coded asm BLAS routines.
foreach(int i, bool b isVector!(Values)1..)
if (b) code assert(values
atoi(i) .length values0.length)
static if ( expr ABC is( Values0
double ) is( Values1 double )
is ( Values2 double ) ) return
DAXPY(values0.length, values0.ptr,
values1.ptr, values2)
10Asm code via perturbation
- Its hard to determine the optimal asm for an
algorithm, much easier to modify existing code. - Begin with Agner Foggs optimal asm code for
DAXPY. Use same loop design and register
allocation strategy. - Ignore difficult cases fallback to D code.
11X87 (stack-based)
- Convert the infix expression into postfix. Split
into and . - Swap operands to avoid FMUL latency.
- A B - C D ? A (AB) - (CD)
- ? C D A B - A
- Avoid gaps in the instruction set
- Eg, fewer instructions for 80-bit reals, so load
them first whenever possible.
12X87 code generation
- Directly convert postfix to inline asm.
VEC!("CB(AD)")( 2213.3, vec1, floatvec,
vec2) // Postfix BADCC L1 fld double
ptr EAX 8ESI //B fld double ptr EAX
8ESI //A fadd double ptr EDX 8ESI
//D fmulp ST(1), ST // fadd float ptr
ECX 4ESI //C fxch ST(1), ST fstp
float ptr ECX 4ESI - 4 // C L2 inc
ESI jnz L1
13SSE/SSE2 (register-based)
- Cant do mixed-precision operations.
- Unroll loop by 2 or 4, to take advantage of SIMD.
- Instruction scheduling is less critical, but
register allocation is more complicated than for
x87.
14GPGPU
- Use the GPU in modern video cards to perform
massively parallel calculations. - Uses OpenGL or DirectX calls, instead of inline
asm. - Full of hacks (pretend your data is a texture!)
but a rational API should emerge soon. - This should NOT be built into a compiler!
15Adding a front end
- Operator overloading
- Same limitations as before
- Mixins eg, mixin(blade(firstvecsecondvec2.38
)) - clumsy syntax BUT
- Can detect aliases
- Allows better error messages
- Can unroll small loops inline
- Closer to proposed macro syntax
16Front end using mixins
- Lex first second 2.38 ? ABC.
- Determine types, resolve aliases, convert
constants to literals. - Determine precedence and associativity
- Perform constant folding
- We can do most of this using mixins
- Compiler help is most required for 4
- __traits could help
17Determining types
char getSymbolTable(char symbols)
char result "" for(int i0
iltsymbols.length i) if (igt0) result
"," result "typeof(" symbolsi
).stringof, symbolsi
.stringof result "" return
result
- When mixed in, this creates an array2 of
string literals. - 0 is the type, 1 is the value
18Determining precedence
class AST(char expr) alias expr text
AST!("(" text T.text ")") opAdd(T)(T
x) return null AST!("(" text
T.text ")") opMul(T)(T x) return null
AST!( text "(" T.text ") )
opIndex(T)(T x) return null char
getPrecedence(char expr) char code
"typeof(" for(int i0 iltexpr.length
i) if (exprigt'A'
exprilt'Z') code
"(cast(AST!(" expri "))(null))"
else code expri return code
").text" mixin(getPrecedence(ABCD) ) ?
A((BC)D)
19Conclusion
- Implementation and syntactic issues remain
- Syntax for runtime and compile-time reflection
- Macros, and an extended __traits syntax should
help. - How to clean up mixin(), yet retain its power?
- Yet perfectly optimal code is already possible.
Libraries can perform optimisations previously
required a compiler back-end.