introduction to algorithmic differentiation · 2012-08-13 · introduction to algorithmic...

Introduction to Algorithmic Differentiation

Kshitij Kulshreshtha

Universitat Paderborn

ESSIM Summer School13. August - 16. August 2012

Dresden

K. Kulshreshtha 1 / 94 Intro to AD ESSIM Summer School

Schedule

1. Introduction/Motivation/Function Representation/Forward mode2. Reverse Mode/Higher order Derivatives3. Implementations/ADOL-C4. Hand on Exercises (ADOL-C)


Computing Derivatives

SimulationSensitivity

CalculationOptimization





data

theory

user

input x

output y

modelling computerprogram





data

theory

user

differentiation

input x

output y


enhancedprogram





data

theory

user user

differentiation

input x

output y

input x

output y


enhancedprogram

optimizationalgorithm

sensitivityyx


Computing DerivativesGiven:

Description of functional relation asI formula y = F (x) explicit expression y = F (x)I computer program ?

Task:

Computation of derivatives takingI requirements on exactnessI computational effort

into account


Derivation: What for?


Derivation: What for?I Optimization:

unbounded: min f (x), f : Rn Rbounded: min f (x), f : Rn R

c(x) = 0, c : Rn Rmh(x) 0, h : Rn Rl

I Solution of nonlinear equation systemsF (x) = 0, F : Rn Rn

Newton method requires F (x) RnnI Simulation of complex procedures

I system definitionI integration of differential equations

I Sensitivity analysisI Real-time control




c(x) = 0, c : Rn Rmh(x) 0, h : Rn Rl


Newton method requires F (x) Rnn

I Simulation of complex proceduresI system definitionI integration of differential equations





c(x) = 0, c : Rn Rmh(x) 0, h : Rn Rl


Newton method requires F (x) RnnI Simulation of complex procedures

I system definitionI integration of differential equations



Symbolic Differentiation



x2 2x



x2 2xxp pxp1



x2 2xxp pxp1cos(x) sin(x)



x2 2xxp pxp1cos(x) sin(x)exp(x) exp(x)



x2 2xxp pxp1cos(x) sin(x)exp(x) exp(x)log(x) 1x



x2 2xxp pxp1cos(x) sin(x)exp(x) exp(x)log(x) 1xg(h(x)) g(h(x)) h(x)

...



I exact derivativesI f (x) = exp(sin(x2))

f (x) = exp(sin(x2)) cos(x2) 2xI min J(x , u) such that x = f (x , u) + IC

reduced formulation: J(x , u) J(u)

J (u) based on symbolic adjoint = fx (x , u)> + TC

I cost (common subexpression, implementation)

I legacy code with large number of linesclosed form expression not available



I exact derivativesI f (x) = exp(sin(x2))

f (x) = exp(sin(x2)) cos(x2) 2xI min J(x , u) such that x = f (x , u) + IC

reduced formulation: J(x , u) J(u)

J (u) based on symbolic adjoint = fx (x , u)> + TCI cost (common subexpression, implementation)

I legacy code with large number of linesclosed form expression not available


read_input_file(argv[1], &code_control);

code_control.timestep_type = 0; // calculate timestep size like TAU

// read in CFD mesh read_cfd_mesh(code_control.CFDmesh_name, &gridbase); grid[0] = gridbase; // remove mesh corner points arizing more than once . . . // e.g. for block structured area and at interface between // block structured and unstructured area remove_double_points( &gridbase, grid); // write out mesh in tecplot format write_pointdata( name, &(grid[0]));

// calculate metric of finest grid level/* grid[0].xp[ii][1] += 0.00000001; */ calc_metric(&(grid[0]), &code_control); puts("calc_metric ready"); // create coarse meshes for multigrid, calculate their metric // and initialze forcing functions to zero for (i = 1; i < code_control.nlevels; i++) { create_coarse_mesh(&(grid[i1]), &(grid[i])); init2zero(&(grid[i]), grid[i].force); } puts("create_coarse_mesh ready"); // initialize flow field on all grid levels to free stream // quantities for (i = 0; i < code_control.nlevels; i++) init_field(&(grid[i]), &code_control); puts("init_field ready"); // if selected read restart file if (code_control.restart == 1) read_restart( "restart", grid, &code_control, &first_residual, &first_step);

// calculate primitive variables for all grid levels and // initialize states at the boundary for (i = 0; i < code_control.nlevels; i++) { cons2prim(&(grid[i]), &code_control); init_bdry_states(&(grid[i])); } // open file for writing convergence history conv = fopen("conv.dat", "w"); fprintf(conv, "title = convergence\n"); fprintf(conv, "variables = iter, l2res, lift, drag\n");

level = 0; printf("will perform %d steps\n",code_control.nsteps[level]);

// starting time of computation t1 = time(&t1);

double lift, drag;

// loop over all multigrid cycles

Jan 01, 08 21:46 Seite 29/30euler2d.c

for (it = 0; it < code_control.nsteps[level]; it++) { double residual; lift = 0.0; drag = 0.0;

// calculate actual weight of gradient needed for reconstruction if (sum_it+first_step 1.0e10) ? residual: 1.0;

// print out convergence information to file and standard output printf("IT = %d %20.10e %20.10e %20.10e %4.2f\n", sum_it, residual / first_residual, lift, drag, weight); fprintf(conv, "%d %20.10e %20.10e %20.10e\n", sum_it+first_step, residual / first_residual, lift, drag); sum_it++; }

// final time of computation t2 = time(&t2); // print out time needed for the time loop printf ("Zeit : %f\n", difftime(t2, t1)); last_step = first_step + code_control.nsteps[0] ;

fclose(conv);

// map solution from cell centers to vertices center2point(grid); // write out field solution write_eulerdata( "euler.dat", grid, &code_control); // write out solution on walls write_surf( "eulersurf.dat", grid, &code_control); // write restart file write_restart( "restart", grid, &code_control, first_residual, last_step);

return 0;}

Jan 01, 08 21:46 Seite 30/30euler2d.c

Gedruckt von Andrea Walther

Dienstag Januar 01, 2008 15/15euler2d.c


Finite Differences

Idea: Taylor-expansion, f : R R smooth then

f (x + h) = f (x) + h(x) + h2f (x)/2 + h3f (x)/6 + f (x + h) f (x) + hf (x)

Df (x) f (x + h) f (x)h

I simple derivative calculation (only function evaluations!)I inexact derivativesI computation cost often too high

F : Rn R OPS(F (x)) (n + 1)OPS(F (x))


Example 1


Example 2


Algorithmic Differentiation (AD)

Main Products:I Quantitative dependence information (local):

I Weighted and directed partial derivativesI Error and condition number estimates . . .I Lipschitz constants, interval enclosures . . .I Eigenvalues, Newton steps . . .

I Qualitative dependence information (regional):I Sparsity structures, degrees of polynomialsI Ranks, eigenvalue multiplicities . . .


Algorithmic Differentation (AD)

Main Properties:I No truncation errors

I Chain rule applied to numbersI Applicability to arbitrary programsI A priori bounded and/or adjustable costs:

I Total operations countI Maximal memory requirementI Total memory traffic

always relative to original function.



Main Properties:I No truncation errorsI Chain rule applied to numbers

I Applicability to arbitrary programsI A priori bounded and/or adjustable costs:





Main Properties:I No truncation errorsI Chain rule applied to numbersI Applicability to arbitrary programs

I A priori bounded and/or adjustable costs:I Total operations countI Maximal memory requirementI Total memory traffic




Main Properties:I No truncation errorsI Chain rule applied to numbersI Applicability to arbitrary programsI A priori bounded and/or adjustable costs:





Main Properties:I No truncation errorsI Chain rule applied to numbersI Applicability to arbitrary programsI A priori bounded and/or adjustable costs:


always relative to original function. (Griewank, Walther 09)


The Hello-World-Example of AD

Lighthouse

quay

y1 = tan( t) tan( t) and y2 =

tan( t) tan( t)


The Hello-World-Example of AD

quay

Lighthouse

y1

y2

t

y2 = y1

y1 = tan( t) tan( t) and y2 =

tan( t) tan( t)


Automatisation:Computer tolls as Maple, Mathematica, . . .

Maple provides


Forward Mode of AD

x(t)y(t)

x

yF

F

y(t) = t F (x(t)) = F(x(t)) x(t) F (x , x)


Forward Mode(Lighthouse)v3 = x1 = v2 = x2 = v1 = x3 = v0 = x4 = t

v1 = v1 v0v2 = tan(v1)v3 = v2 v2v4 = v3 v2v5 = v4/v3v6 = v5 v2y1 = v5y2 = v6

v3 = x1v2 = x2v1 = x3v0 = x4v1 = v1 v0 + v1 v0v2 = v1/ cos(v1)2

v3 = v2 v2v4 = v3 v2 + v3 v2v5 = (v4 v3 v5) (1/v3)v6 = v5 v2 + v5 v2y1 = v5y2 = v6




v3 = x1v2 = x2v1 = x3v0 = x4

v1 = v1 v0 + v1 v0v2 = v1/ cos(v1)2





v3 = x1v2 = x2v1 = x3v0 = x4v1 = v1 v0 + v1 v0

v2 = v1/ cos(v1)2






v3 = v2 v2

v4 = v3 v2 + v3 v2v5 = (v4 v3 v5) (1/v3)v6 = v5 v2 + v5 v2y1 = v5y2 = v6





v3 = v2 v2v4 = v3 v2 + v3 v2

v5 = (v4 v3 v5) (1/v3)v6 = v5 v2 + v5 v2y1 = v5y2 = v6





v3 = v2 v2v4 = v3 v2 + v3 v2v5 = (v4 v3 v5) (1/v3)

v6 = v5 v2 + v5 v2y1 = v5y2 = v6





v3 = v2 v2v4 = v3 v2 + v3 v2v5 = (v4 v3 v5) (1/v3)v6 = v5 v2 + v5 v2

y1 = v5y2 = v6


Complexity (Forward Mode)

tang c MOVES 1 + 1 3 + 3 3 + 3 2 + 2ADDS 0 1 + 1 0 + 1 0 + 0MULTS 0 0 1 + 2 0 + 1NLOPS 0 0 0 1 + 1

OPS(F (x)x) c OPS(F (x))

with c [2,5/2] platform dependent


Reverse Mode AD = Discrete Adjoints

x

F

y

x y>F (x) = x y>F (x) F (x , y)



x

F

y

y

y >y

=c

x y>F (x) = x y>F (x) F (x , y)



x

x

F

Fy

y

y >F

(x)=

c

y >y

=c

x y>F (x) = x y>F (x) F (x , y)


Reverse Mode (Lighthouse)v3 = x1; v2 = x2; v1 = x3; v0 = x4;

v1 = v1 v0;v2 = tan(v1);

v3 = v2 v2;v4 = v3 v2;

v5 = v4/v3;v6 = v5 v2;

y1 = v5; y2 = v6;v5 = y1; v6 = y2;

v5 += v6v2; v2 += v6v5;v4 += v5/v3; v3 = v5v5/v3;

v3 += v4v2; v2 += v4v3;v2 += v3; v2 = v3;

v1 += v2/ cos2(v1);v1 += v1v0; v0 += v1v1;

x4 = v0; x3 = v1; x2 = v2; x1 = v3;


Complexity (Reverse Mode)grad c

MOVES 1 + 1 3 + 6 3 + 8 2 + 5ADDS 0 1 + 2 0 + 2 0 + 1MULTS 0 0 1 + 2 0 + 1NLOPS 0 0 0 1 + 1

OPS(y>F (x)) c OPS(F (x))MEM(y>F (x)) OPS(F (x))

with c [3,4] platform dependent

Remarks:I Cost for gradient calculation independent of nI Memory requirement may cause problem! Checkpointing


Complexity (Reverse Mode)grad c

MOVES 1 + 1 3 + 6 3 + 8 2 + 5ADDS 0 1 + 2 0 + 2 0 + 1MULTS 0 0 1 + 2 0 + 1NLOPS 0 0 0 1 + 1

OPS(y>F (x)) c OPS(F (x))MEM(y>F (x)) OPS(F (x))

with c [3,4] platform dependentRemarks:

I Cost for gradient calculation independent of nI Memory requirement may cause problem! Checkpointing


Algorithmic Differentiation (AD)Differentiation of computer programs within machine precision

Basic forms:

forward mode: OPS(F (x)x) c1 OPS(F ), c1 [2,5/2]reverse mode: OPS(y>F (x)) c2 OPS(F ), c2 [3,4]

MEM(y> F (x)) OPS(F )combination: OPS(y>F (x)x) c3 OPS(F ), c3 [7,10]

Tasks:

Derivatives (of any order) sparsity patterns, differentiation of fixpointiterations and time-stepping procedures . . .Implementations?



Basic forms:



Tasks:

Derivatives (of any order) sparsity patterns, differentiation of fixpointiterations and time-stepping procedures . . .

Implementations?



Basic forms:



Tasks:

Derivatives (of any order) sparsity patterns, differentiation of fixpointiterations and time-stepping procedures . . .Implementations?


Intermediate Conclusions


Intermediate Conclusions

I Evaluation of derivatives with working accuracy

I Efficient evaluation of derivatives of any order

I Numerous tools available,based on source transformation or operator overloading

I Nondifferentiability only on meager set

I Optimal Jacobi accumulation is (NP-)hard

I Full Jacobians/Hessians often not needed or sparse

I see www.autodiff.org


Forward Mode of AD

x(t)y(t)

x

yF

F

y(t) = t F (x(t)) = F(x(t)) x(t) F (x , x)


Evaluation procedure for:

F (x) = exp(sin(x2))

ddt

F (x) = exp(sin(x2)) cos(x2) 2x x

v0 = x v0 = xv1 = v20 v1 = 2v0 v0v2 = sin(v1) v2 = cos(v1) v1v3 = exp(v2) v3 = exp(v2) v2y = v3 y = v3 = ddt F (x)


General: With ui (vj )ji

v = (ui ) v = i (ui )ui

Elementary Tangents for Instrinsic Functions

v = sin(u) v = cos(u) uv =

u v = 0.5 u/vv = tan(u) v = u/ cos2(u)

Elementary Tangents for Arithmetic Operations

v = u w v = u wv = u w v = w u + u wv = v/w v = (u v w)/w


Forward Differentiation of Lighthouse Examplev3 = x1 = v3 x1v2 = x2 = v2 x2v1 = x3 = v1 x3v0 = x4 = t v0 x4v1 = v1 v0v2 = tan(v1)v3 = v2 v2v4 = v3 v2v5 = v4/v3v6 = v5v7 = v5 v2y1 = v6y2 = v7


Forward Differentiation of Lighthouse Examplev3 = x1 = v3 x1v2 = x2 = v2 x2v1 = x3 = v1 x3v0 = x4 = t v0 x4v1 = v1 v0 v1 = v1 v0 + v1 v0v2 = tan(v1) v2 = v1/ cos(v1)2

v3 = v2 v2 v3 = v2 v2v4 = v3 v2 v4 = v3 v2 + v3 v2v5 = v4/v3 v5 = (v4 v3 v5) (1/v3)v6 = v5 v6 = v5v7 = v5 v2 v7 = v5 v2 + v5 v2y1 = v6y2 = v7


Forward Differentiation of Lighthouse Examplev3 = x1 = v3 x1v2 = x2 = v2 x2v1 = x3 = v1 x3v0 = x4 = t v0 x4v1 = v1 v0 v1 = v1 v0 + v1 v0v2 = tan(v1) v2 = v1/ cos(v1)2

v3 = v2 v2 v3 = v2 v2v4 = v3 v2 v4 = v3 v2 + v3 v2v5 = v4/v3 v5 = (v4 v3 v5) (1/v3)v6 = v5 v6 = v5v7 = v5 v2 v7 = v5 v2 + v5 v2y1 = v6 y1 = v6y2 = v7 y2 = v7


Definei (ui , ui ) i (ui )ui

General Tangent Procedure

[vin, vin] = [xi , xi ] i = 1, . . . ,n

[vi , vi ] = [i (ui ), i (ui )] i = 1, . . . , l

[ymi , ymi ] = [vli , vli ] i = m 1, . . . ,0


Complexity:

tang c MOVES 1 + 1 3 + 3 3 + 3 2 + 2ADDS 0 1 + 1 0 + 1 0 + 0MULTS 0 0 1 + 2 0 + 1NLOPS 0 0 0 1 + 1

OPS(F (x)x) c OPS(F (x))

with c [2,5/2] platform dependent

Remarks:I How to arange evaluation of [vi , vi ] optimal?I Overwrites are no problem


Reverse Mode of AD

x

F

y

x> y>F (x) = x y>F (x) F (x , y)


Reverse Mode of AD

x

F

y

y

y >y

=c

x> y>F (x) = x y>F (x) F (x , y)


Reverse Mode of AD

x

x F

F

y

y

y >F

(x)=

c

y >y

=c

x> y>F (x) = x y>F (x) F (x , y)


Evaluation Procedure:

F (x) = exp(sin(x2))yF (x) = y exp(sin(x2)) cos(x2) 2x

v0 = x v3 = yv1 = v20 v2 = exp(v2) v3v2 = sin(v1) v1 = cos(v1) v2v3 = exp(v2) v0 = 2v0 v1y = v3 x = v0


Computation of yF (x)?

Rewrite general tangent procedure

[vin, vin] = [xi , xi ] i = 1, . . . ,n

[vi , vi ] = [i (ui ), i (ui )] i = 1, . . . , l

[ymi , ymi ] = [vli , vli ] i = m 1, . . . ,0

in matrix-vector notation.


Define

cij ivj

Ai

1 0 0 0...

......

......

......

...0 0 1 0ci 1n ci 2n ci i1 0 00 0 1 0...

......

......

......

...0 0 1

R(n+l)(n+l)

= I + en+i [i (ui ) en+i ]T i = 1, . . . , lPn [I,0, . . . ,0] Rn(n+l),

Qm [0, . . . , I] Rm(n+l).


Tangent Operation vi = i (ui )ui equals v = Ai v

Linear transformation y = F (x) x equals

y = QmAlAl1 . . .A2A1PTn x ,

with the product representation

F (x) = QmAlAl1 . . .A2A1PTn Rmn .

linear transformation x = y F (x) (reverse mode) equals

xT = PnAT1 AT2 . . .A

Tl1A

Tl Q

Tmy

T ,


Because ofATi = I + [i (ui ) en+i ]eTn+i ,

product ATi vj with vj Rn+l forms incremental operation:

1. Index i 6= j 6 i vj is left unchanged

2. Index j i vj is incremented by vi cij3. Index i vi is set to zero

Hence, for computing x = y F (x) one has toI evaluate all intermediate vi obtaining partial derivatives cijI increment adjoint values vi in reverse order.


Corresponding Incremental Adjoint Recursion

vi = 0 i = 1 n, . . . , lvin = xi i = 1, . . . ,nvi = i (vj )ji i = 1, . . . , lymi = vli i = m 1, . . . ,0vli = ymi i = 0, . . . ,m 1vj + = vi vj

i (ui ) for j i i = l , . . . ,1xi = vin i = n, . . . ,1

No overwriting Zeroing out of the vi is omitted.


Adjoint Recursion of Lighthouse-Examplev3 = x1; v2 = x2; v1 = x3; v0 = x4;

v1 = v1 v0;v2 = tan(v1);

v3 = v2 v2;v4 = v3 v2;

v5 = v4/v3;v6 = v5 v2;

y1 = v5; y2 = v6;v5 = y1; v6 = y2;

v5 += v6v2; v2 += v6v5;v4 += v5/v3; v3 += v5v5/v3;

v3 += v4v2; v2 += v4v3;v2 += v3; v2 = v3;

v1 += v2/ cos2(v1);v1 += v1v0; v0 += v1v1;

x4 = v0; x3 = v1; x2 = v2; x1 = v3;


Reverse Operations of ElementalsElemental Reverse

v = c v = 0

v = u w u += v ; w = v ; v = 0

v = u w u += v w ; w += v u; v = 0

v = 1/u u= (v v) v ; v = 0

v =

u u += 0.5 v/v ; v = 0

v = uc u += (v c) v/u; v = 0

v = exp (u) u += v v ; v = 0

v = log (u) u += v/u; v = 0

v = sin (u) u += v cos (u); v = 0


Complexity:

grad c MOVES 1 + 1 3 + 6 3 + 8 2 + 5ADDS 0 1 + 2 0 + 2 0 + 1MULTS 0 0 1 + 2 0 + 1NLOPS 0 0 0 1 + 1

OPS(yF (x)) c OPS(F (x))

with c [3,5] platform dependentRemark:

I Overwrites cause problem! See Checkpointing.


Interpretation of adjoint values vi :I Lagrange multipliers of

Min yys.t. (ui ) vi = 0 i = 1, . . . , l

vli ymi = 0 i = 1, . . . ,mI Error propagation factors

vi = vi (1 + i ), |i | macheps

y y =l

i=1

vivii + H.O.T

l

i=1

|vivi |+ H.O.T


Vector ModesHow to reduce overhead in forward and reverse mode?

Propagate a bundle of vectors instead of a single vector!

Forward Mode:

Compute Y = F (x)X Rmp for X Rnp

instead of

y = F (x)x Rm for x Rn

I Replace vj R by Vj Rp in tangent procedureI Replace z R(lm) by Z R(lm)pI Everything else remains unchanged


Vector Tangents:

v = u w V = U Wv = u w V = w U + u Wv = sin(u) V = cos(u) U

Complexity:

tangp c MOVES 1 + p 3 + 3p 3 + 3p 2 + 2pADDS 0 1 + p 0 + p 0 + 0MULTS 0 0 1 + 2p 0 + pNLOPS 0 0 0 2

OPS(F (x)X ) c OPS(F (x))

with c [1 + p,1 + 1.5p]


Applications:I Computation of full Jacobian

Set X = In

OPS(F (x)) (1 + 1.5n)OPS(F (x))

Compare with scalar mode, i.e. F (x)ei , i = 1, . . . ,n:

OP(f (x)) 2.5nOPS(F (x))

Remarkable run-time savings can be obtainedI Computation of sparse Jacobians


Reverse mode:

Compute X = YF (x) Rqn for Y Rqm

instead of

x = yF (x) Rn for y Rm

Implementation:I Replace vi R by Vi Rq in adjoint recursionI Everything else remains unchanged


Complexity:

gradq c MOVES 1 + q 3 + 6q 5 + 6q 3 + 4qADDS 0 1 + 2q 0 + 2q 0 + qMULTS 0 0 1 + 2q 0 + qNLOPS 0 0 0 2

OPS(YF (x)) k OPS(F (x))

with c [1 + 2q,1.5 + 2.5q]

Remarks:I Computation of full JacobianI Remarkable run-time savings can be obtainedI Application: e.g. matrix compression for sparse Jacobians


General Evaluation Procedure:

vin = xi for i = 1, . . . ,n

vi = i (vj )ji for i = 1, . . . , l

ymi = vli for i = m 1, . . . ,0

whereI vi , i 0, are the independentsI vi , i l m + 1, are the dependentsI i is elemental function, = set of elemental functionsI j i is dependence relation: vi depends directly on vj


Essential Optional Vector

Smooth u + v , u v u v , u/v k uk vku, c c u, c u k ck ukrec(u) 1/u ukexp(u), log(u) uc , arcsin(u)sin(u), cos(u) tan(u)|u|c , c > 1 . . .

Lipschitz abs(u) |u| max(u, v) max{|uk |}k ,

k |uk ||u, v |

u2+v2 min(u, v)

k u

2k

General heav(u) sign(u)(u > 0)?v , w


Representation as computational graph

v3

v2

v1

v0

v1 v2

v4 v5

v3

v6

v7

Computational graph of lighthouse example


Representation as extended system E(x ; v) = 0

I Defineui (vj )ji for i = 1 n, . . . , li (ui ) xi+n for i = 1 n, . . . ,0cij ivj called elemental partials

I Assume dependents variables as mutually independent

Then one has thatI i < 1 or j > l m implies cij 0I Evaluation procedure equivalent to nonlinear system

0 = E(x ; v) (i (ui ) vi )i=1n,...,l


Topics to keep in mind:

I Composite Differentiation (overall differentiability):Here it is assumed that all i are differentiable atrespective argument.

I Overwriting = Shared Allocation of vi and vjNot considered here, but:There are ways to handle them.

I Aliasing of Arguments and Results:Not considered here, but:Again there are ways to handle them.


Main Results:

I Gradients are cheap function costsI Costs grow only quadratically in degree

I Nondifferentiability only on meager setI Optimal Jacobi-accumulation is (NP-)hardI Jacobi-accumulation is often unneccessary


Main Results:

I Gradients are cheap function costsI Costs grow only quadratically in degreeI Nondifferentiability only on meager set

I Optimal Jacobi-accumulation is (NP-)hardI Jacobi-accumulation is often unneccessary


Main Results:

I Gradients are cheap function costsI Costs grow only quadratically in degreeI Nondifferentiability only on meager setI Optimal Jacobi-accumulation is (NP-)hard

I Jacobi-accumulation is often unneccessary


Main Results:

I Gradients are cheap function costsI Costs grow only quadratically in degreeI Nondifferentiability only on meager setI Optimal Jacobi-accumulation is (NP-)hardI Jacobi-accumulation is often unneccessary


Higher Order DerivativesSecond derivatives:

Use forward and reverse differentiation together!

y = F (x)reverse

differentiation x = yF (x)

x = yF (x)forward

differentiation x = yF (x)x + yF (x)

For scalar-valued function F , y = 1, y = 0, and x one has

x = F (x)x = Hessian-vector product


Implementation

x1

d1

d2

bx

y4

Obj

y1

y2

Sourcecode

y = F (x) Rm


Implementation

d2

Sourcecode

Object Code

y = F (x)

y = F (x) Rm

y = F (x) x Rm

forwa

rd

(x, x) R2n


Implementation

Sourcecode

Object Code Object Code

y = F (x)

y = F (x) Rm

y = F (x) x Rm

forwa

rdreverse

(x, x) R2n

Intermediatevalues

Stack (Tape)

x = yF (x) Rn

(x, y) Rn+m


Implementation

Sourcecode

Object Code Object Code

y = F (x)

y = F (x) Rm

y = F (x) x Rm

forwa

rdreverse

(x, x) R2n

Intermediatevalues

Stack (Tape)

x = yF (x) Rnx = yF (x) x Rn

(x, y) Rn+m


Higher derivatives: Think in Taylor polynomialsLet x be the path

x(t) d

j=0

xj t j : R 7 R with xj =1j! j

t jx(t)

t=0

HenceI xj is scaled derivative at t = 0.I x1 = tangent,x2 = curvature.

Now:Consider F : Rn Rm,d times continuously differentiable y(t) F (x(t)) : R 7 Rm is smooth (chain-rule)


There exist Taylor coefficients yj Rm, 0 j d , with

y(t) =d

j=0

yj t j + O(td+1).

Using the chain-rule, one obtains:

y0 = F (x0)

y1 = F (x0)x1

y2 = F (x0)x2 +12

F (x0)x1x1

y3 = F (x0)x3 + F (x0)x1x2 +16

F (x0)x1x1x1

directional derivatives of high order (forward mode)

OPS(x0, . . . , xd y0, . . . , yd ) d2OPS(x0 y0)

using Taylor arithmetic


Cross Country / Preaccumulation

Consider again Jacobian of the extended system E(x ; v):

E (x ; v) =

I 0 0B L I 0R T 1

with F (x) = R + T (I L)1B = Schur complement

I Forward mode: Row-Block-Elimination of T by L II Reverse mode: Column-Block-Elimination of B by L II Cross country: Apply other elimination order


Another InterpretationElimination in Jacobian of extended system =Vertex elimination in computational graph:

4

1

+

Example for forward and reverse mode:

0

1

1 2 3 4

6

5

c 1,0

c1,

1 c2,1 c3,2 c4,3

c5,1

c 5,4

c6,0

c6,4


AD by Source Transformation

Code for Simulation Code for Derivatives

func.src 1. Lexical analysis2. Syntax analysis3. Semantic analysis4. Static data flow analyses (Symbol table, Error handler)5. AD !!!6. Code optimization7. Unparsing

func&der.drc


Remarks

Pros:I Make use of compiler optimization !!I Derivative code is readable

Cons:I Requires complete compiler activity analysis, new language features

I Flexibility


Speelpennings Example

Speelpennings function f (x) =n

i=1 xi coded in Fortran:

SUBROUTINE speelpenning ( x , y )DOUBLE PRECISION x (10 ) , yINTEGER i

y=1.0DO i =1 ,10

y=yx ( i )END DOEND


Forward Mode Differentiation

Requirements for AD tool:I determine active variablesI accociate derivative objectsI generate differntiated statementsI generate differentiated subroutines if required


Forward Mode Differentiation

SUBROUTINE SPEELPENNING D( x , xd , y , ydDOUBLE PRECISION x (10 ) , xd (10 ) , y , ydINTEGER i

y=1.0yd =0.D0DO i =1 ,10

y=yd=ydx ( i )+ yxd ( i )y=yx ( i )

END DOEND


Reverse Mode Differentiation

Requirements for AD tool:I determine active variablesI associate derivative objectsI generate adjoint statementsI determine values to be recordedI reverse control flowI generate recording part of the code


Reverse Mode DifferentiationSUBROUTINE SPEELPENNING B( x , xb , y , yb )DOUBLE PRECISION x (10 ) , xb (10 ) , y , ybINTEGER i

y=1.0DO i =1 ,10

CALL PUSHREAL8( y )y=yx ( i )

END DOCALL PUSHINTEGER4( i 1)CALL POPINTEGER4( adTo )DO i =adTo,1 ,1

CALL POPREAL8( y )xb ( i )= xb ( i ) + yybyb=x ( i ) yb

END DOyb =0.D0END


Source Transformation ToolsFortran:

AMPL Bell Lab, USA forward + reverseNAG RWTH Aachen, germany + NAG forward + reverseTAF FastOpt, Germany forward + reverseTapenade INRIA, France forward + reverseOpenAD ANL, USA, RWTH Aachen, germany forward. . .

All tools provide at most second derivatives !Goal: Get AD into the compiler

Matlab:ADIMAT RWTH Aachen, Germany forwardMAD (in TOMS) Univ. Shrivenham, GB forward. . .


AD by Opeator OverloadingImplementation strategies:

I New variable type for active variables,i.e., for independents, dependents, intermediates (!)

I Derivative information included asI field in composite structuresI generation of internal function representation

Principle: New class definition

type ad double = c lassVAL : double ;???

end ;


AD by Field in Composite StructureIdea for forward mode:

type ad double = c lassVAL : double ;DER : array [ 1 . . p ] of double ;

end ;

All operations are overloaded, e.g., multiplication:

opera tor (U,V : ad double ) W : ad double ;i : integer ;begin

W.VAL := U. VAL V.VAL ;for i := 1 to p do

W.DER[ i ] := U.VALV.DER[ i ] + U.DER[ i ]V.VAL ;end ;


AD by Field in Composite StructureIdea for forward mode:

type ad double = c lassVAL : double ;DER : array [ 1 . . p ] of double ;

end ;

All operations are overloaded, e.g., multiplication:

opera tor (U,V : ad double ) W : ad double ;i : integer ;begin

W.VAL := U.VAL V.VAL ;for i := 1 to p do

W.DER[ i ] := U.VALV.DER[ i ] + U.DER[ i ]V.VAL ;end ;


Remarks:I Implements basic rules of differentations for

I each arithmetic operationI each intrinsic function

I Easy to realizetapeless forward mode of ADOL-C

Idea for reverse mode: Graph in coreNeed of backward pointer for calculating adjoints new node type:

. . .

VAL IND

DER ADJ


W := U * V generates

U.VAL U.IND

V.VAL V.IND

V.ADJV.DER

U.ADJU.DER

W.VAL

W.DER

W.IND

W.ADJ

Data dependency in opposite direction !!


Remarks:I Calculation of derivatives only possible after

function evaluation at same point

I Complete computational graph is kept in coreAmount of internal storage may cause problems

I Access on nodes largely at random

Conclusion:Derivative calculation in active class OK for forward mode butnot practicable for reverse mode on large problems !!


AD by Internal RepresentationIdea:

Class adouble {VAL : double ;IND : i n t e g e r ; }

W = U * V causes recording onto three tapes:INDEXCOUNT += 1W.IND = INDEXCOUNTW.VAL = U.VAL * V.VALW.VAL VALUE-TAPEMULT OPERATION-TAPE(U.IND, V.IND, W.IND) INDEX-TAPE


Remarks for Tape-based Derivatives

I Mechanical application of chain-rule based on the tapes

I Data access on tapes strictly sequential

I Tapes can be reused for function / derivativecalculation at different points

I Several tapes for different control flows depending onactive variables


General Remarks for Operator Overloading

Pros:I Only choice for C/C++I Great flexibility with respect to mode/orderI Simple to maintain

Cons:I Compiler optimization may be lostI Only object code for derivatives


Operator Overloading Tools

C/C++:ADO1 Cranfield Uni., GB forward + reverseADOL-C Uni. Paderborn, Germany forward + reverseCppAD Uni. Washington, USA forward + reverseFADBAD TU Denmark forward + reverseTADIFF TU Denmark forward...

Matlab:ADMAT Cayuga Research Associates, USA forward + reverse...


Conclusions


I Forward mode: OPS(F (x)x) c OPS(F ), c [2,5/2]I Reverse mode: OPS(y> F (x)) c OPS(F ), c [3,4]

Gradients are cheap Function Costs!!

I Combination: OPS(y>,F (x)x) c OPS(F ), c [7,10]I Cost of higher derivatives grows quadratically in the degree


Conclusions


I Efficient evaluation of derivatives of any order

I Numerous tools available,based on source transformation or operator overloading

I Nondifferentiability only on meager set

I Optimal Jacobi accumulation is (NP-)hard

I Full Jacobians/Hessians often not needed or sparse

I see www.autodiff.org


Automatic differentiation by overloading in C++

ADOL-C version 2.1

I reorganization of tapingtape dependent information kept in separate structures

I different differentiation contextsI documented external function facilityI documented fixpoint iteration facilityI documented checkpointing facility based on revolve

I documented parallelization of derivative calculationI coupled with ColPack for exploitation of sparsityI available at COIN-OR since May 2009


Source Code ModificationsI Include needed header-files

easiest way: #include adolc.h

I Define region that has to be differentiated:

trace on(tag,keep); Start of... active sectiontrace off(file); and its end

I Mark independents and dependents in active section:

xa = yp; mark dependents

I Declare all active variables of type adoubleI Calculate derivative objects after trace off(file)



easiest way: #include adolc.hI Define region that has to be differentiated:











I Declare all active variables of type adouble

I Calculate derivative objects after trace off(file)


Evaluation Procedure (Lighthouse)

y1 = tan(t)tan(t)

y2 = tan(t)tan(t)

v3 = x1 = v2 = x2 = v1 = x3 = v0 = x4 = tv1 = v1 v0 1(v1, v0)v2 = tan(v1) 2(v1)v3 = v2 v2 3(v2, v2)v4 = v3 v2 4(v3, v2)v5 = v4/v3 5(v4, v3)v6 = v5 v2 6(v5, v2)y1 = v5y2 = v6


Model Generation with ADOL-C

v

v

v

v

x1

1

x2

x3x4

2

3 4

5

6

v

v



v

v

v

v

x1

1

x2

x3x4

2

3 4

5

6

v

v

double x1, x2, x3, x4;double v1, v2, . . .double y1, y2;

x1 = 3.7; x2 = 0.7;x3 = 0.5; x4 = 0.5;

v1 = x3*x4;v2 = tan(v1);v3 = x2-v2;v4 = x1*v2;v5 = v4/v3;v6 = v5*x2

y1 = v5; y2 = v6



v

v

v

v

x1

1

x2

x3x4

2

3 4

5

6

v

v

adouble x1, x2, x3, x4;adouble v1, v2, . . .double y1, y2;

trace on(tag);x1= 3.7; x2= 0.7;x3= 0.5; x4= 0.5;v1 = x3*x4;v2 = tan(v1);v3 = x2-v2;v4 = x1*v2;v5 = v4/v3;v6 = v5*x2

v5= y1; v6= y2;trace off();


Easy-to-use routinesint gradient (tag,n,x[n],g[n]): tag = tape number

n = # indepsx[n] = values of indepsg[n] = f (x)

int jacobian (tag,m,n,x[n],J[m][n]): tag = tape numberm = # depsn = # indepsx[n] = values of indepsJ[m][n] = F (x)

int hessian (tag,n,x[n],H[n][n]) tag = tape numbern = # indepsx[n] = values of indepsH[n][n] = 2f (x)


Further drivers for optimization:

I vec_jac(tag,m,n,repeat,x[n],u[m],z[n])

Computes z = uT F (x)I jac_vec(tag,m,n,x[n],v[n],z[n])

Computes z = F (x)vI hess_vec(tag,n,x[n],v[n],z[n])

Computes z = 2f (x)vI lagra_hess_vec(tag,n,m,x[n],v[n],u[m],h[n])

Computes h = uT F (x)vExtension to uT F (x)V available

I jac_solv(tag,n,x[n],b[n], sparse, mode)Computes w with F (x)w = b and store result in b


vec_jac(tag, m, n, repeat, x[n], u[m], z[n])jac_vec(tag, m, n, x[n], v[n], z[n])hess_vec(tag, n, x[n], v[n], z[n])lagra_hess_vec(tag, n, m, x[n], v[n], u[m], h[n])jac_solv(tag, n, x[n], b[n]

Drivers for sparse Derivative MatricesJacobian:sparse_jac(tag,m,n,repeat,x,&nnz,&row_ind,&col_ind,

&values,options);

jac_pat(tag,m,n,x,JP,options);

generate_seed_jac(m,n,JP,&seed,&p,option);

Hessian:sparse_hess(tag,n,repeat,x,&nnz,&row_ind,&col_ind,

&values,options);

hess_pat(tag,n,x,HP,options);

generate_seed_hess(n,HP,&seed,&p,option);

graph coloring routines by ColPack (Gebrehemedin, Pothen)


sparse_jac(tag, m, n, repeat, x, &nnz, &row_ind, &col_ind,& values, options);jac_pat(tag, m, n, x, JP, options);generate_seed_jac(m, n, JP, &seed, &p, option);sparse_hess(tag, n, repeat, x, &nnz, &row_ind, &col_ind,&values, options);hess_pat(tag, n, x, HP, options);generate_seed_hess(n, HP, &seed, &p, option);

Advanced taping techniques

Part A Part B Part C

ADOL-C tape (basic)




ADOL-C tape (basic)

ADOL-C tape (nested)




ADOL-C tape (basic)

ADOL-C tape (nested)

External differentiated functions


Differentiation of fixpoint iterations

ADOLC tape (basic)

initialization iteration evaluation

() () ()



ADOLC tape (basic)


ADOLC function:

ADOLC tape (nested)

fp_iteration()

() () ()



ADOLC tape (basic)


repeated subtape evaluation

reverse evaluation

()

()

()


Summary and Outlook

I AD for C and C++ derivatives of any orderI Easy-to-use routines for standard derivative objects,

optimization, ODEs, tensors, . . .I Randomly accessed memory of original magnitudeI Sequential access of data on tape, i.e., array or fileI Wide range of application:

I aerodynamic, chemical engineering, medicine, netword opt., . . .Several ideas for improvement:

I runtime activity, parallelization, front ends, . . .

Contact:http://www.coin-or.org/projects/ADOL-C.xml


introduction to algorithmic differentiation · 2012-08-13 · introduction to algorithmic...

Documents