Title: Robin Hogan
1Fast reverse-mode automatic differentiation using
expression templates in C
- Robin Hogan
- University of Reading
2Overview
- Spaceborne radar and lidar
- Adjoint coding
- Automatic differentiation
- New approach
- Testing with lidar multiple-scattering forward
models
3Spaceborne radar, lidar and radiometers
EarthCare
- The A-Train
- NASA
- 700-km orbit
- CloudSat 94-GHz radar (launch 2006)
- Calipso 532/1064-nm depol. lidar
- MODIS multi-wavelength radiometer
- CERES broad-band radiometer
- AMSR-E microwave radiometer
- EarthCARE launch 2015(?)
- ESAJAXA
- 400-km orbit more sensitive
- 94-GHz Doppler radar
- 355-nm HSRL/depol. lidar
- Multispectral imager
- Broad-band radiometer
- Heart-warming name
4What do CloudSat and Calipso see?
- Radar D6, detects whole profile, surface echo
provides integral constraint - Lidar D2, more sensitive to thin cirrus and
liquid but attenuated - Radar-lidar ratio provides size D
Cloudsat radar
CALIPSO lidar
Insects Aerosol Rain Supercooled liquid
cloud Warm liquid cloud Ice and supercooled
liquid Ice Clear No ice/rain but possibly
liquid Ground
Target classification
Delanoe and Hogan (2008, 2010)
5Unified retrieval
- Ingredients developed
- Implement previous work
- Not yet developed
6Unified retrieval Forward model
- From state vector x to forward modelled
observations H(x)...
x
Ice snow
Liquid cloud
Rain
Aerosol
Lookup tables to obtain profiles of extinction,
scattering backscatter coefficients, asymmetry
factor
Sum the contributions from each constituent
Radiative transfer models
7Radiative transfer models
Observation Model Speed Status
Radar reflectivity factor Multiscatter single scattering option N OK
Radar reflectivity factor in deep convection Multiscatter single scattering plus TDTS MS model (Hogan and Battaglia 2008) N2 OK
Radar Doppler velocity Single scattering OK if no NUBF fast MS model with Doppler does not exist N2 Not available for MS
HSRL lidar in ice and aerosol Multiscatter PVC model (Hogan 2008) N OK
HSRL lidar in liquid cloud Multiscatter PVC plus TDTS models N2 OK
Lidar depolarization Multiscatter under development N2 In progress
Infrared radiances Delanoe and Hogan (2008) two-stream source function method N No adjoint
Infrared radiances RTTOV (EUMETSAT license) N Disappointing accuracy for clouds
Solar radiances LIDORT (permissive license) N Testing
- After much pain have hand-coded adjoint for
multiscatter model (in C) but still need adjoint
for all the rest of the algorithm (in C)
8Adjoint and Jacobian coding
- Variational retrieval methods are posed as
- find the vector x that minimises the cost
function J(x) - Two common minimization methods
- The quasi-Newton method requires the adjoint
code to compute the gradient ?J/?x for any x - The Gauss-Newton method writes the observational
part of the cost function as the sum of the
squared deviation of the observations from their
forward modelled counterparts y, and requires a
code to compute the Jacobian matrix H ?y/?x
- Since J(x) is complicated (containing all of our
radiative transfer models), the code to generate
?J/?x or ?y/?x is even more complicated - Can it be generated automatically?
9Approaches to adjoint coding
- Do it by hand (e.g. ECMWF)
- Painful and time consuming to debug
- Generates the most efficient code
- Do it numerically perturb each element of x one
by one - Inefficient and infeasible for large x
- Subject to round-off error
- What Im using at the moment with Unified
Algorithm - Automatic differentiation 1 Use a
source-to-source compiler - E.g. TAPENADE/TAF/TAC generate adjoint source
file from algorithm file generates quite
efficient code - Comercial 5k/year for TAF/TAC academic license
and need permission to distribute generated
source code - TAPENADE requires to upload file to server
- Limited support for C classes and no support
for C templates - Automatic differentiation 2 Use an operator
overloading technique - E.g. CppAD, ADOL-C, in principle can work with
any language features - Typically 25 times slower than hand-coded
adjoint! - Can we do better?
10Simple example
- Consider simple algorithm y(x0, x1) contrived for
didactic purposes - Implemented in C or Fortran90 as
- Task given ?J/?y, we want to compute ?J/?x0 and
?J/?x1
function algorithm(x) result(y) implicit none
real, intent(in) x(2) real
y real s y 4.0 s
2.0x(1) 3.0x(2)x(2) y y sin(s)
return endfunction
double algorithm(const double x2) double y
4.0 double s 2.0x0 3.0x1x1 y
sin(s) return y
11Creating the adjoint code 1
- Differentiate the algorithm
- Write each statement in matrix form
- Transpose the matrix to get equivalent adjoint
statement
12Creating the adjoint code 2
- Apply adjoint statements in reverse order
Forward mode
double algorithm_AD(const double x2, double
y_AD1, double x_AD2) double y 4.0 double
s 2.0x0 3.0x1x1 y sin(s) /
Adjoint part / double s_AD 0.0 y_AD0
sin(s) y_AD0 s_AD y cos(s)
y_AD0 x_AD0 3.0 s_AD x_AD1 6.0
x0 s_AD s_AD 0.0 y_AD0 0.0 return
y
Note need to store intermediate values for the
reverse pass Hand-coding is time-consuming and
error prone for large codes
13Automatic differentiation
- We want something like this (now in C)
- Operators (e.g. /) and functions (e.g. sin,
exp, log) applying to adouble objects are
overloaded not only to return the result of the
operation, but also to store the gradient
information in stack - Libraries CppAD, SACADO and ADOL-C do this but
the result is around 25 times slower than
hand-coded adjoints why?
adouble algorithm(const adouble x2) adouble y
4.0 adouble s 2.0x0 3.0x1x1 y
sin(s) return y // Main code Stack stack
// Object where info will be stored adouble
x2 , // Set algorithm inputs adouble y
algorithm(x) // Run algorithm and store info
in stack y.set_gradient(y_AD) // Set
dJ/dy stack.reverse() // Run adjoint code from
stored info x_AD0 x0.get_gradient() //
Save resulting values of dJ/dx0 x_AD1
x1.get_gradient() // ... and dJ/dx1
14Minimum necessary storage
- What is the minimum necessary storage to store
these statements? - If we label each gradient by an integer (since
theyre unknown in forward pass) then we need two
stacks that can be added to as the algorithm
progresses - Can then run backwards through stack to compute
adjoints
Statement stack
Operation stack
Index to LHS gradient Index to first operation
2 (dy) 0
3 (ds) 0
2 (dy) 2
Multiplier Index to RHS gradient
0 2.0 0 (dx0)
1 6.0x1 1 (dx1)
2 sin(s) 2 (dy)
3 y cos(s) 3 (ds)
4
15Adjoint algorithm is simple
- Need to cope with three different types of
differential statement
Forward mode
Equivalent adjoint statements
General differential statement
for i 0 to n
16which can be coded as follows
- This does the right thing in our three cases
- Zero on RHS
- One or more gradients on RHS
- Same gradient on LHS and RHS
17Dual numbers approach
- How can these stacks be created?
- Consider what happens when compiler sees this
line - Compiler splits this up into two parts with
temporary t - We could define adouble as dual number x, dx
(invented by Clifford 1873) and then overload sin
and operator - sin(s), cos(s)ds sin(s, ds)
- yt, tdyydt y, dy t, dt
- This would correctly apply
- but only if the gradient terms on the
right-hand-side are known! - This is not useful for the reverse-mode (adjoint)
when we want to store a symbolic representation
of the gradient on the forward sweep which is
then filled on the reverse sweep - Dual numbers are used in some forward-mode-only
(tangent linear) automatic differentiation tools.
y y sin(s)
adouble t sin(s) y operator(y, t)
18So how do CppAD ADOL-C work?
- In the forward pass they store the whole
algorithm symbolically, not just the derivative
form! - This means every operator and function needs to
be stored symbolically (e.g. 0 for plus, 1 for
minus, 42 for atan etc) - The stored algorithm can then be analysed to
generate an adjoint function - This all happens behind the scenes so easy to
use, but not surprising that it is 25 times
slower than a hand-coded adjoint
19Computational graphs
- The basic problem is that standard operator
overloading can only pass information from the
most nested operation outwards
Pass y sin(s) to be new y
operator
Pass value of sin(s)
sin
y
s
20Implementing the chain rule
21Computational graph 2
- Clearly differentiation most naturally involves
passing information in the opposite sense
Each node representing arbitrary function or
operator y(a) needs to be able to take a real
number w and pass wdy/da down the chain Binary
function or operator y(a,b) would pass wdy/da to
one argument and wdy/db to other At the end of
the chain, store the result on the stack But how
do we implement this?
22What is a template?
- Templates are a key ingredient to generic
programming in C - Imagine we have a function like this
- We want it to work with any numerical type
(single precision, complex numbers etc) but dont
want to laboriously define a new overloaded
function for each possible type - Can use a function template
double cube(const double x) double y
xxx return y
template lttypename Typegt Type cube(Type x) Type
y xxx return y double a 1.0 b
cube(a) // compiler creates function
cubeltdoublegt complexltdoublegt c(1.0, 2.0) // c
1 2i d cube(c) // compiler creates function
cubeltcomplexltdoublegt gt
23What is an expression template?
- C also supports class templates
- Veldhuizen (1995) used this feature to introduce
the idea of Expression Templates to optimize
array operations and make C as fast as
Fortran-90 for array-wise operations - We use it as a way to pass information in both
directions through the expression tree - sin(A) for an argument of arbitrary type A is
overloaded to return an object of type SinltAgt - operator(A,B) for arguments of arbitrary type A
and B is overloaded to return an object of type
MultiplyltA,Bgt - Now when we compile the statement yysin(x)
- The right-hand-side resolves to an object RHS
of type Multiplyltadouble,Sinltadoublegt gt - The overloaded assignment operator first calls
RHS.value() to get y - It then calls RHS.calc_gradient(), to add entries
to operation stack - Multiply and Sin are defined with member
functions so that they can correctly pass
information up and down the expression tree
24New approach
- The following types are passed up the chain at
compile time
Multiplyltadouble,Sinltadoublegt gt
operator
Sinltadoublegt
adouble
sin
y
adouble
s
25Implementation of SinltAgt
// Definition of Sin class template ltclass
Agt class Sin public ExpressionltSinltAgt gt
public // Member functions // Constructor
store reference to a and its numerical value
Sin(const ExpressionltAgt a) a_(a),
a_value_(a.value()) // Return the value
double value() const return
sin(a_value_) // Compute derivative and
pass to a void calc_gradient(Stack stack,
double multiplier) const
a_.calc_gradient(stack, cos(a_value_)multiplier)
private // Data members const A a_
// A reference to the object double
a_value_ // The numerical value of object //
Overload the sin function it returns a SinltAgt
object template ltclass Agt inline SinltAgt
sin(const ExpressionltAgt a) return SinltAgt(a)
- Adept library has done this for all operators
and functions
26Optimizations
- Why are expression templates fast?
- Compound types representing complex expressions
are known at compile time - C automatically inlines function calls between
objects in an expression, leaving little more
than the operations you would put in a hand-coded
application of the chain rule - Further optimizations
- Stack object keeps memory allocated between calls
to avoid time spent allocating incrementally more
memory - If the Jacobian is computed it is done in strips
to exploit vectorization (SSE/SSE2 on Intel) and
loop unrolling - The current stack is accessed by a global but
thread-local variable, rather than storing a link
to the stack in every adouble object (as in CppAD
and ADOL-C)
27Testing using lidar multiple scattering models
- Photon Variance-Covariance method for small-angle
multiple scattering - Hogan (JAS 2008)
- Somewhat similar to a monochromatic radiance
model - Four coupled ODEs are integrated forward in space
- Several variables at N gates give N output
signals - Computational cost proportional to N
- Time-dependent two-stream method for wide-angle
multiple scattering - Hogan and Battaglia (JAS 2008)
- Similar to a time-dependent 1D advection model
- Four coupled PDEs are integrated forward in time
- Several variables at N gates gives N output
signals - Computational cost proportional to N 2
28Simulation of 3D photon transport
- Animation of scalar flux (II)
- Colour scale is logarithmic
- Represents 5 orders of magnitude
- Domain properties
- 500-m thick
- 2-km wide
- Optical depth of 20
- No absorption
- In this simulation the lateral distribution is
Gaussian at each height and each time
29Benchmark results
- Time relative to original code, gcc-4.4, Pentium
2.5 GHz, 2 MB cache
Adjoint PVC N50 TDTS N50
Hand-coded adjoint 3.0 (1.02.0) 3.6 (1.02.6)
New C library Adept 3.5 (2.70.8) 3.8 (2.61.2)
ADOL-C 25 (187) 20 (155)
CppAD 29 (1577) 34 (1789)
30Outlook
- New library Adept (Automatic Differentiation
using Expression Templates) produces adjoint with
minimum difficulty for user - No knowledge of templates required by user at all
- Simple and efficient to compute Jacobian matrix
as well - Freely available at http//www.met.reading.ac.uk/c
louds/adept/ - Typically 5-20 slower than hand-coded adjoints
- But immeasurably faster in terms of programmer
time - Code is complete for applying to any C code with
real numbers - Further development desirable
- Complex numbers
- Use within C matrix/vector libraries,
particularly those that already use Expression
Templates (like the one I use for the Unified
Algorithm) - Easily facilitate checkpointing so large codes
dont exhaust memory - Automatically compute higher-order derivatives
(e.g. Hessian matrix) - Potential for student projects to get small data
assimilation systems up and running and efficient
quickly - Impossible to apply in Fortran no template
capability!
31(No Transcript)
32Minimizing the cost function
- and 2nd derivative (the Hessian matrix)
- Gradient Descent methods
- Fast adjoint method to calculate ?xJ means dont
need to calculate Jacobian - Disadvantage more iterations needed since we
dont know curvature of J(x) - Quasi-Newton method to get the search direction
(e.g. L-BFGS used by ECMWF) builds up an
approximate inverse Hessian A for improved
convergence - Scales well for large x
- Poorer estimate of the error at the end
- Gradient of cost function (a vector)
- Gauss-Newton method
- Rapid convergence (instant for linear problems)
- Get solution error covariance for free at the
end - Levenberg-Marquardt is a small modification to
ensure convergence - Need the Jacobian matrix H of every forward
model can be expensive for larger problems as
forward model may need to be rerun with each
element of the state vector perturbed
33Time-dependent 2-stream approx.
- Describe diffuse flux in terms of outgoing stream
I and incoming stream I, and numerically
integrate the following coupled PDEs - These can be discretized quite simply in time and
space (no implicit methods or matrix inversion
required)
Source Scattering from the quasi-direct beam
into each of the streams
Time derivative Remove this and we have the
time-independent two-stream approximation
Gain by scattering Radiation scattered from
the other stream
Loss by absorption or scattering Some of lost
radiation will enter the other stream
Spatial derivative Transport of radiation
from upstream
Hogan and Battaglia (2008, J. Atmos. Sci.)