Robin Hogan

About This Presentation

Title:

Robin Hogan

Description:

Fast reverse-mode automatic differentiation using expression templates in C++ Robin Hogan University of Reading – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 34

Provided by: RobinH175

Category:

more less

Transcript and Presenter's Notes

Title: Robin Hogan

1
Fast reverse-mode automatic differentiation using
expression templates in C

Robin Hogan
University of Reading

2
Overview

Spaceborne radar and lidar
Adjoint coding
Automatic differentiation
New approach
Testing with lidar multiple-scattering forward
models

3
Spaceborne radar, lidar and radiometers
EarthCare

The A-Train
NASA
700-km orbit
CloudSat 94-GHz radar (launch 2006)
Calipso 532/1064-nm depol. lidar
MODIS multi-wavelength radiometer
CERES broad-band radiometer
AMSR-E microwave radiometer

EarthCARE launch 2015(?)
ESAJAXA
400-km orbit more sensitive
94-GHz Doppler radar
355-nm HSRL/depol. lidar
Multispectral imager
Broad-band radiometer
Heart-warming name

4
What do CloudSat and Calipso see?

Radar D6, detects whole profile, surface echo
provides integral constraint
Lidar D2, more sensitive to thin cirrus and
liquid but attenuated
Radar-lidar ratio provides size D

Cloudsat radar
CALIPSO lidar
Insects Aerosol Rain Supercooled liquid
cloud Warm liquid cloud Ice and supercooled
liquid Ice Clear No ice/rain but possibly
liquid Ground
Target classification
Delanoe and Hogan (2008, 2010)
5
Unified retrieval

Ingredients developed
Implement previous work
Not yet developed

6
Unified retrieval Forward model

From state vector x to forward modelled
observations H(x)...

x
Ice snow
Liquid cloud
Rain
Aerosol
Lookup tables to obtain profiles of extinction,
scattering backscatter coefficients, asymmetry
factor
Sum the contributions from each constituent
Radiative transfer models
7
Radiative transfer models

Observation Model Speed Status
Radar reflectivity factor Multiscatter single scattering option N OK
Radar reflectivity factor in deep convection Multiscatter single scattering plus TDTS MS model (Hogan and Battaglia 2008) N2 OK
Radar Doppler velocity Single scattering OK if no NUBF fast MS model with Doppler does not exist N2 Not available for MS
HSRL lidar in ice and aerosol Multiscatter PVC model (Hogan 2008) N OK
HSRL lidar in liquid cloud Multiscatter PVC plus TDTS models N2 OK
Lidar depolarization Multiscatter under development N2 In progress
Infrared radiances Delanoe and Hogan (2008) two-stream source function method N No adjoint
Infrared radiances RTTOV (EUMETSAT license) N Disappointing accuracy for clouds
Solar radiances LIDORT (permissive license) N Testing

After much pain have hand-coded adjoint for
multiscatter model (in C) but still need adjoint
for all the rest of the algorithm (in C)

8
Adjoint and Jacobian coding

Variational retrieval methods are posed as
find the vector x that minimises the cost
function J(x)
Two common minimization methods
The quasi-Newton method requires the adjoint
code to compute the gradient ?J/?x for any x
The Gauss-Newton method writes the observational
part of the cost function as the sum of the
squared deviation of the observations from their
forward modelled counterparts y, and requires a
code to compute the Jacobian matrix H ?y/?x

Since J(x) is complicated (containing all of our
radiative transfer models), the code to generate
?J/?x or ?y/?x is even more complicated
Can it be generated automatically?

9
Approaches to adjoint coding

Do it by hand (e.g. ECMWF)
Painful and time consuming to debug
Generates the most efficient code
Do it numerically perturb each element of x one
by one
Inefficient and infeasible for large x
Subject to round-off error
What Im using at the moment with Unified
Algorithm
Automatic differentiation 1 Use a
source-to-source compiler
E.g. TAPENADE/TAF/TAC generate adjoint source
file from algorithm file generates quite
efficient code
Comercial 5k/year for TAF/TAC academic license
and need permission to distribute generated
source code
TAPENADE requires to upload file to server
Limited support for C classes and no support
for C templates
Automatic differentiation 2 Use an operator
overloading technique
E.g. CppAD, ADOL-C, in principle can work with
any language features
Typically 25 times slower than hand-coded
adjoint!
Can we do better?

10
Simple example

Consider simple algorithm y(x0, x1) contrived for
didactic purposes
Implemented in C or Fortran90 as
Task given ?J/?y, we want to compute ?J/?x0 and
?J/?x1

function algorithm(x) result(y) implicit none
real, intent(in) x(2) real
y real s y 4.0 s
2.0x(1) 3.0x(2)x(2) y y sin(s)
return endfunction
double algorithm(const double x2) double y
4.0 double s 2.0x0 3.0x1x1 y
sin(s) return y
11
Creating the adjoint code 1

Differentiate the algorithm
Write each statement in matrix form
Transpose the matrix to get equivalent adjoint
statement

12
Creating the adjoint code 2

Apply adjoint statements in reverse order

Forward mode
double algorithm_AD(const double x2, double
y_AD1, double x_AD2) double y 4.0 double
s 2.0x0 3.0x1x1 y sin(s) /
Adjoint part / double s_AD 0.0 y_AD0
sin(s) y_AD0 s_AD y cos(s)
y_AD0 x_AD0 3.0 s_AD x_AD1 6.0
x0 s_AD s_AD 0.0 y_AD0 0.0 return
y
Note need to store intermediate values for the
reverse pass Hand-coding is time-consuming and
error prone for large codes
13
Automatic differentiation

We want something like this (now in C)
Operators (e.g. /) and functions (e.g. sin,
exp, log) applying to adouble objects are
overloaded not only to return the result of the
operation, but also to store the gradient
information in stack
Libraries CppAD, SACADO and ADOL-C do this but
the result is around 25 times slower than
hand-coded adjoints why?

adouble algorithm(const adouble x2) adouble y
4.0 adouble s 2.0x0 3.0x1x1 y
sin(s) return y // Main code Stack stack
// Object where info will be stored adouble
x2 , // Set algorithm inputs adouble y
algorithm(x) // Run algorithm and store info
in stack y.set_gradient(y_AD) // Set
dJ/dy stack.reverse() // Run adjoint code from
stored info x_AD0 x0.get_gradient() //
Save resulting values of dJ/dx0 x_AD1
x1.get_gradient() // ... and dJ/dx1
14
Minimum necessary storage

What is the minimum necessary storage to store
these statements?
If we label each gradient by an integer (since
theyre unknown in forward pass) then we need two
stacks that can be added to as the algorithm
progresses
Can then run backwards through stack to compute
adjoints

Statement stack
Operation stack
Index to LHS gradient Index to first operation
2 (dy) 0
3 (ds) 0
2 (dy) 2

Multiplier Index to RHS gradient
0 2.0 0 (dx0)
1 6.0x1 1 (dx1)
2 sin(s) 2 (dy)
3 y cos(s) 3 (ds)
4
15
Adjoint algorithm is simple

Need to cope with three different types of
differential statement

Forward mode
Equivalent adjoint statements
General differential statement
for i 0 to n
16
which can be coded as follows

This does the right thing in our three cases
Zero on RHS
One or more gradients on RHS
Same gradient on LHS and RHS

17
Dual numbers approach

How can these stacks be created?
Consider what happens when compiler sees this
line
Compiler splits this up into two parts with
temporary t
We could define adouble as dual number x, dx
(invented by Clifford 1873) and then overload sin
and operator
sin(s), cos(s)ds sin(s, ds)
yt, tdyydt y, dy t, dt
This would correctly apply
but only if the gradient terms on the
right-hand-side are known!
This is not useful for the reverse-mode (adjoint)
when we want to store a symbolic representation
of the gradient on the forward sweep which is
then filled on the reverse sweep
Dual numbers are used in some forward-mode-only
(tangent linear) automatic differentiation tools.

y y sin(s)
adouble t sin(s) y operator(y, t)
18
So how do CppAD ADOL-C work?

In the forward pass they store the whole
algorithm symbolically, not just the derivative
form!
This means every operator and function needs to
be stored symbolically (e.g. 0 for plus, 1 for
minus, 42 for atan etc)
The stored algorithm can then be analysed to
generate an adjoint function
This all happens behind the scenes so easy to
use, but not surprising that it is 25 times
slower than a hand-coded adjoint

19
Computational graphs

The basic problem is that standard operator
overloading can only pass information from the
most nested operation outwards

Pass y sin(s) to be new y
operator
Pass value of sin(s)
sin
y
s
20
Implementing the chain rule

21
Computational graph 2

Clearly differentiation most naturally involves
passing information in the opposite sense

Each node representing arbitrary function or
operator y(a) needs to be able to take a real
number w and pass wdy/da down the chain Binary
function or operator y(a,b) would pass wdy/da to
one argument and wdy/db to other At the end of
the chain, store the result on the stack But how
do we implement this?
22
What is a template?

Templates are a key ingredient to generic
programming in C
Imagine we have a function like this
We want it to work with any numerical type
(single precision, complex numbers etc) but dont
want to laboriously define a new overloaded
function for each possible type
Can use a function template

double cube(const double x) double y
xxx return y
template lttypename Typegt Type cube(Type x) Type
y xxx return y double a 1.0 b
cube(a) // compiler creates function
cubeltdoublegt complexltdoublegt c(1.0, 2.0) // c
1 2i d cube(c) // compiler creates function
cubeltcomplexltdoublegt gt
23
What is an expression template?

C also supports class templates
Veldhuizen (1995) used this feature to introduce
the idea of Expression Templates to optimize
array operations and make C as fast as
Fortran-90 for array-wise operations
We use it as a way to pass information in both
directions through the expression tree
sin(A) for an argument of arbitrary type A is
overloaded to return an object of type SinltAgt
operator(A,B) for arguments of arbitrary type A
and B is overloaded to return an object of type
MultiplyltA,Bgt
Now when we compile the statement yysin(x)
The right-hand-side resolves to an object RHS
of type Multiplyltadouble,Sinltadoublegt gt
The overloaded assignment operator first calls
RHS.value() to get y
It then calls RHS.calc_gradient(), to add entries
to operation stack
Multiply and Sin are defined with member
functions so that they can correctly pass
information up and down the expression tree

24
New approach

The following types are passed up the chain at
compile time

Multiplyltadouble,Sinltadoublegt gt
operator
Sinltadoublegt
adouble
sin
y
adouble
s
25
Implementation of SinltAgt
// Definition of Sin class template ltclass
Agt class Sin public ExpressionltSinltAgt gt
public // Member functions // Constructor
store reference to a and its numerical value
Sin(const ExpressionltAgt a) a_(a),
a_value_(a.value()) // Return the value
double value() const return
sin(a_value_) // Compute derivative and
pass to a void calc_gradient(Stack stack,
double multiplier) const
a_.calc_gradient(stack, cos(a_value_)multiplier)
private // Data members const A a_
// A reference to the object double
a_value_ // The numerical value of object //
Overload the sin function it returns a SinltAgt
object template ltclass Agt inline SinltAgt
sin(const ExpressionltAgt a) return SinltAgt(a)

Adept library has done this for all operators
and functions

26
Optimizations

Why are expression templates fast?
Compound types representing complex expressions
are known at compile time
C automatically inlines function calls between
objects in an expression, leaving little more
than the operations you would put in a hand-coded
application of the chain rule
Further optimizations
Stack object keeps memory allocated between calls
to avoid time spent allocating incrementally more
memory
If the Jacobian is computed it is done in strips
to exploit vectorization (SSE/SSE2 on Intel) and
loop unrolling
The current stack is accessed by a global but
thread-local variable, rather than storing a link
to the stack in every adouble object (as in CppAD
and ADOL-C)

27
Testing using lidar multiple scattering models

Photon Variance-Covariance method for small-angle
multiple scattering
Hogan (JAS 2008)
Somewhat similar to a monochromatic radiance
model
Four coupled ODEs are integrated forward in space
Several variables at N gates give N output
signals
Computational cost proportional to N
Time-dependent two-stream method for wide-angle
multiple scattering
Hogan and Battaglia (JAS 2008)
Similar to a time-dependent 1D advection model
Four coupled PDEs are integrated forward in time
Several variables at N gates gives N output
signals
Computational cost proportional to N 2

28
Simulation of 3D photon transport

Animation of scalar flux (II)
Colour scale is logarithmic
Represents 5 orders of magnitude
Domain properties
500-m thick
2-km wide
Optical depth of 20
No absorption
In this simulation the lateral distribution is
Gaussian at each height and each time

29
Benchmark results

Time relative to original code, gcc-4.4, Pentium
2.5 GHz, 2 MB cache

Adjoint PVC N50 TDTS N50
Hand-coded adjoint 3.0 (1.02.0) 3.6 (1.02.6)
New C library Adept 3.5 (2.70.8) 3.8 (2.61.2)
ADOL-C 25 (187) 20 (155)
CppAD 29 (1577) 34 (1789)
30
Outlook

New library Adept (Automatic Differentiation
using Expression Templates) produces adjoint with
minimum difficulty for user
No knowledge of templates required by user at all
Simple and efficient to compute Jacobian matrix
as well
Freely available at http//www.met.reading.ac.uk/c
louds/adept/
Typically 5-20 slower than hand-coded adjoints
But immeasurably faster in terms of programmer
time
Code is complete for applying to any C code with
real numbers
Further development desirable
Complex numbers
Use within C matrix/vector libraries,
particularly those that already use Expression
Templates (like the one I use for the Unified
Algorithm)
Easily facilitate checkpointing so large codes
dont exhaust memory
Automatically compute higher-order derivatives
(e.g. Hessian matrix)
Potential for student projects to get small data
assimilation systems up and running and efficient
quickly
Impossible to apply in Fortran no template
capability!

31
(No Transcript)
32
Minimizing the cost function

and 2nd derivative (the Hessian matrix)
Gradient Descent methods
Fast adjoint method to calculate ?xJ means dont
need to calculate Jacobian
Disadvantage more iterations needed since we
dont know curvature of J(x)
Quasi-Newton method to get the search direction
(e.g. L-BFGS used by ECMWF) builds up an
approximate inverse Hessian A for improved
convergence
Scales well for large x
Poorer estimate of the error at the end

Gradient of cost function (a vector)
Gauss-Newton method
Rapid convergence (instant for linear problems)
Get solution error covariance for free at the
end
Levenberg-Marquardt is a small modification to
ensure convergence
Need the Jacobian matrix H of every forward
model can be expensive for larger problems as
forward model may need to be rerun with each
element of the state vector perturbed

33
Time-dependent 2-stream approx.

Describe diffuse flux in terms of outgoing stream
I and incoming stream I, and numerically
integrate the following coupled PDEs
These can be discretized quite simply in time and
space (no implicit methods or matrix inversion
required)

Source Scattering from the quasi-direct beam
into each of the streams
Time derivative Remove this and we have the
time-independent two-stream approximation
Gain by scattering Radiation scattered from
the other stream
Loss by absorption or scattering Some of lost
radiation will enter the other stream
Spatial derivative Transport of radiation
from upstream
Hogan and Battaglia (2008, J. Atmos. Sci.)

Write a Comment

User Comments (0)