I spent several months intermittently reading about Automatic Differentiation, and I was always left scratching my head. Until one day it clicked. Hopefully, I will spare you from all that and Dual Numbers will click on you by the end of this post.
The Quest to your Derivatives
If you want to differentiate a function, we usually take one of these options:
1- Numerical Differentiation (ND)
2- Symbolic Differentiation (SD)
Symbolic is the way you go when you took your Calculus course, most of the times you take a derivative on paper to solve a math problem, or if you use software such as Mathematica or Sympy.
Numerical is the way you go when you code in, say, C, or make physics simulations.
(* Of course, you could also make physics simulations using symbolic differentiation, or take a numerical derivative on paper, but I guess the other way round is more common)
Each one has pros and cons which we will analyze later.
But there happens to be a completely different new kid on the block:
3- Automatic Differentiation (AD)
Here is a comparison of the features of the 3 methods:
Symbolic | Numerical | Automatic | |
---|---|---|---|
Exact? | yes | no | yes |
Returns | a expression | a number | a number |
Fast? | Yes/no¹ | yes | yes |
Needs | symbolic expression | function | function |
handles arbitrary functions? | no² | yes | yes |
¹ computing and then simplifying a derivative is usually slow. But you do that only once. After that, you can use the resulting expression to calculate the value of the derivative in as many points you want, and it’s usually fast.
² functions with for
loops, if
branches and other programming constructs are beyond the scope of application of SD (what’s the derivative of a for
loop?). On the other hand, ND and AD handle these with ease.
If you are only familiar with Symbolic and Numerical, that table might look quite weird to you (it’s exact but returns a number? How come?)
In order to introduce you to Automatic Differentiation, I must first introduce you to Dual Numbers
Dual Numbers
Remember complex numbers? This has nothing to do with them, but…
You know how we defined
and then suddenly all functions defined on the reals were also defined on the complex numbers by virtue of their Taylor expansion? (If you don’t remember, that’s ok )
Well, this is similar, only this time we define
(but
)
You can think of as a number so small, that if you square it, the result is 0. Actually 0. Yet, it’s not so small, that we could actually say that
is just 0. 😕 does that make any sense? So far the intuition goes…
Unable to resist the temptation to follow the mathematical abstraction caravan, we go head and define Dual Numbers as any number of the form
, and we call
the real part of
and
the infinitesimal part of
, analogous to the real and imaginary parts of complex numbers.
What shall come out of this unholy abomination? Let’s see…
Evaluating functions on Dual Numbers
Let us take a dual number and see what’s the result of applying a function
to it:
by virtue of the Taylor expansion of in
. Using the definition of
above, we finally get that
For the special case when , we get:
That is, we get the value of in the real part of
and the derivative of
in the infinitesimal partof
“Allright, but what’s the point?
1- We still need to calculate both and
… 2- Also, setting the infinitesimal part of
,
, to anything other that 1 is quite pointless right?”
Well… hold on. I’ll answer question 2 first.
Chain Rule
I assume you took your basic Calculus course and remember the chain rule. Ok, so let’s see what’s the result of applying a composition of functions, and
to the dual number
. We know the result should be :
because we know the chain rule. But a computer doesn’t, and we don’t want to implement it. At most, we could implement a DualNumber class and overload some operators, but that’s it.
If we do so, and ask the computer to evaluate , it would first compute
and then
thus, effectively computing the derivative of without knowing anything about the chain rule, only knowing how to apply functions on dual numbers.
To sum up, having a non-1 infinitesimal part on dual number above, enabled us to compute the derivative of a composition of functions without knowing the chain rule. That answers question 2. Let’s tackle question 1.
Automatic Differentiation
If you understood everything above, you are pretty much done. You know that, if you had an implementation for a function , and evaluated it in
, you would get
thus, you get the value of and the derivative for free… Hence the name Automatic Differentiation.
“Wow, hold on, hold on, not so fast…
1- My function doesn’t take
DualNumber
, it takes, say float
.
2- Also, if I overloaded to also take
DualNumber
I would still need to implement the code to compute
3- Evaluating will be slow, because it has to compute both
and
”
1- yes, if you work in a statically typed language (say C++). If you work in a dynamically typed language (say Python), will take anything you throw at it. In a language like C++, you would ideally have your functions implemented with templates, so they would take any type of number.
2- No. will likely be composed of a chain of elementary functions, and all those would already work on
DualNumber
, since the class definition would overload all basic operators. And, by the section above, all compositions of functions which work on DualNumber
also work on DualNumber
.
3- That’s the worst case scenario. Even then it wouldn’t be that bad, since you are usually interested in both values. But many times, you can even compute both in the same time it would take to compute only one.
For example, to compute you would have to compute both
and
, but there is a
sincos
function in the standard C <math.h> library which returns both sin
and cos
in the same time it would take to compute either separately. It does so by using the Assembler FSINCOS
instruction.
As another example, has a trivial optimization (see why?)
The takeaway is that the intermediate results for computing and
can be shared, enabling optimizations; as well as there being some other more sophisticated methods for computing both simultaneously.
Either way, you wouldn’t have to worry for any of that, since your s would already work with
DualNumber
out of the box, because they are a composition of elementary functions.
Code Sample
Here I show you how this would look in practice. This will be an hypothetical API (at the end of the post I list real ones), which contains a Dual
class to represent dual numbers. Dual
receives two numbers as arguments, the first represents the real part and the second represents the infinitesimal part.
from AD import Dual def square_it_baby(x): return x**2 def factorial(x): """ function to showcase that AD handles programming constructs such as recursion and 'if' with no hassle """ if x<1: return 1 return x*factorial(x-1) print( square_it_baby(Dual(0, 1))) # prints Dual(0, 0) print( square_it_baby(Dual(-1, 1))) # prints Dual(1, -2) print( factorial(Dual(3, 1)) ) # prints Dual(6, 11) print( factorial(Dual(2.5, 1))) # prints Dual(3.75, 4)
N-dimensional case
Your functions will usually take many arguments, not just one. So how do we compute the derivative in such case?
It depends on what you mean by derivative. You could at least mean 3 things:
1- Partial Derivatives
If you evaluate you will get:
a partial derivative. What if you put dual numbers for the other arguments as well? We generalize this to…
2- Directional Derivatives
We can say your function takes a n-dimensional vector as an argument. Then, if you evaluate it in where
and
are vectors, you get:
by the Taylor expansion of , as usual. Note that what we get for the infinitesimal part is a directional derivative in the direction of
.
Directional derivatives are many times enough for our porpouses (such as optimization), but it may happen that what we really want is . Setting
won’t help us, since we would get the directional derivative in the direction of vector
. So, what we do is…
3- Vector Derivative (or Gradient)
We define a vector such that:
(but
)
Or, in terms of components, we define numbers
for
such that:
(but
)
Our multi-dual numbers will now have multiple infinitesimal parts (or a multi-infinitesimal part). They will be numbers of the form , and we will call
the real part of
and vector
the multi-infinitesimal part of
.
Then we evaluate our function in a vector of multi-dual numbers, where
is a vector and
is a square matrix. We get:
by the Taylor expansion of , as usual. For the special case when
(the identity matrix), we get:
that is, we get in the multi-infinitesimal part of
.
Application case: Backpropagation
N-dimensional code sample
Generalizations
Forward-reverse
Higher order Derivatives