Folding vs. Currying

In language design, there is a tradeoff to make between implicit folding and implicit currying.

Let’s take the + function as an example: it is a function which takes 2 arguments and returns 1 result. What if we passed 3 or more arguments to it instead? What if we passed less than 2? In principle, it doesn’t mean anything. So we have 3 options:

1- If the user passes more or less than 2 arguments, raise an error. We have left operand an right operand. More or less than 2 arguments has no meaning, what do you even mean by that?

This is the path that Python takes:

>>> from operator import add
>>> add(2,3)
5
>>> add(1,2,3) # oops! expected 2 arguments, got 3
>>> add(1)     # oops! expected 2 arguments, got 1

2- If the user passes any number of arguments, add them all. Thus, you are performing an implicit fold.

This is the path that Lisp takes:

(+ 1 2 3) ; => 6
(+ 1) ; => 1

3- If the user passes less than the required arguments (2 in this case), partially apply the function to the supplied arguments. Thus, the + is implicitly curried.

This is the path that Haskell takes:

(+ 7) -- a function of 1 argument that adds 7 to the number you pass as argument

PROS & CONS

Raising Error

By defining the meaning of passing fewer/more than required number of arguments to a function, you are able to squeeze more meaning into your language, potentially resulting in a more concise and small language. By choosing option number 1 (raising an error), you miss that opportunity. A pro might be that eventually there is going to be a newbie that passes a wrong number of arguments by accident, and you want to let them know.

Implicit folding

A pro of option 2 (implicit folding) is that you can pass any number of arguments and it always has a meaning. On the other hand, option number 3 (implicit currying) only has a meaning if you pass lower than required arguments (if you pass more it is an error).

For functions like + and *, this is very useful, you will use folding with these functions a lot. In general, it is useful for functions of signature a→b→a  (with + and * being a particular case, with signatures a→a→a ). As a cons, not all functions have these signatures and it is unclear how we should generalize folding to these other functions.

Implicit Currying

Option 3 is easily generalizable to any function, unlike option 2. If you think in terms of functions and function composition (instead of function application), this will be very handy, you will be creating partially applied functions all the time.

This won’t be a substitute for general partial application (what if you wanted to partially apply a function on its second argument?), but still will cover many use cases.

Conclusion

My personal favorite is option 3, implicit currying. What about you?

 

 

 

 

Advertisement

The Efficient Frontier

The concept of efficient frontier arises naturally when asking the portfolio selection question (i.e. What combination of stocks and bonds should I buy?)

The Efficient Frontier is the set of all portfolios which give a minimum risk for a given expected return (ergo, you should always seek to invest in the Efficient Frontier).

The derivation of the Efficient Frontier equation was first published in…. by…

I haven’t found many resources explaining the subject in-depth, beyond the original paper; so here is a derivation of the Efficient Frontier equation, modernized to soothe your vector-notation cravings.

Prerequisite: Lagrange multipliers

We often find the problem of constrained optimization, which can be stated as:

minimize f(X)

Subject to

\forall_i \: c_i (X) = 0

Lagrange multipliers reduce this problem to an unconstrained optimization problem. The previous problem is reduced to:
minimize f+\sigma \lambda_i c_i
(* A generalization of Lagrange multipliers is given by KKT conditions, in which constraints can be a mix of equalities and inequalities. We will revisit these when we analyze de case of portfolio selection when you can’t hold short positions.)

Derivation

Statement of the problem

We have the option to invest in several assets. Our portfolio is vector P , the ratios of our total capital that we invest in each asset. Since they are ratios and we are supposed to invest all our money, they add up to 1:

1 = \textbf{1}.P

The return of an asset on a given time interval is a random variable. As there are usually many assets in which we could invest, we are dealing with a vector of random variables X . The return of our portfolio will also be a random variable, given by:

x_p = X . P

The expected return of an asset, on the other hand, is just a number. Therefore, for a vector of expected returns \mu = E(X) , the expected return of our portfolio will be:

\mu_p = \mu . P

The risk of a single asset is statistically described by its variance \sigma^2 . But since we are dealing with a vector of assets, the quantity of interest is its covariance matrix C = cov(X) . By the definition of covariance, C will be a symmetric matrix. The risk of our portfolio will be:

\sigma_p^2 = P^T . C . P

We want to minimize this risk (or minimize half this risk, which is the same), for a given expected return \mu_p . We are now ready to formalize our problem:

minimize \frac{1}{2} P^T . C . P

Subject to:

1 = \textbf{1}.P
\mu_p = \mu . P

Solution

By Lagrange multipliers, an equivalent formulation of the above problem is

minimize \frac{1}{2} P^T . C . P + \lambda_1 (1 - \textbf{1}.P) + \lambda_2 (\mu_p - \mu . P)

Taking the derivative of this expression with respect to \lambda_1 , \lambda_2, P will yield an equation system that determines the solution of the problem. We get:

C . P = \lambda_1 \textbf{1} + \lambda_2 . \mu
1 = \textbf{1}.P
\mu_p = \mu . P

To simplify define this, let’s define:

\lambda = (\lambda_1, \lambda_2)
U = (\textbf{1}, \mu)
U_p = (1, \mu_p)

With this definitions, the above equations turn into :

C. P = U. \lambda
U_p = U^T . P

Solving these equations yields:

\lambda = M^{-1} . U_p
P = C^{-1}.U. M^{-1} . U_p
where M = U^T . C^{-1} . U

Note that we finally got our risk minimizing portfolios P , as a function of portfolio return \mu_p . By the definition of \sigma_p^2 in the previous section, the risk of such portfolios is:

\sigma_p^2 = U_p^T . M^{-1} . U_p

Note that this gives us the minimum possible risk for a given return \mu_p , and it turns out to be a quadratic function.

With Risk-Free Asset

Besides investing in risky assets (stocks), we can also invest in a risk-free asset (bonds). This particular case is usually given a separate treatment. Instead, we will just say that for this particular case, matrix C becomes singular, and we shall therefore replace inverses by pseudo-inverses when apropiate in the equations above. That’s all.

No short-position

Statement

It may happen that we are not allowed to hold short-positions in some assets. That means all components of P should be positive. The resulting problem is:

minimize \frac{1}{2} P^T . C . P

Subject to:

1 = \textbf{1}.P
\mu_p = \mu . P
P>0

Solution

Implementation

for I in range(10)

Ditching Backpropagation: Automatic Differentiation and Dual Numbers

I spent several months intermittently reading about Automatic Differentiation, and I was always left scratching my head. Until one day it clicked. Hopefully, I will spare you from all that and Dual Numbers will click on you by the end of this post.

The Quest to your Derivatives

If you want to differentiate a function, we usually take one of these options:

1- Numerical Differentiation (ND)
2- Symbolic Differentiation (SD)

Symbolic is the way you go when you took your Calculus course, most of the times you take a derivative on paper to solve a math problem, or if you use software such as Mathematica or Sympy.

Numerical is the way you go when you code in, say, C, or make physics simulations.

(* Of course, you could also make physics simulations using symbolic differentiation, or take a numerical derivative on paper, but I guess the other way round is more common)

Each one has pros and cons which we will analyze later.

But there happens to be a completely different new kid on the block:

3- Automatic Differentiation (AD)

Here is a comparison of the features of the 3 methods:

Symbolic Numerical Automatic
Exact? yes no yes
Returns a expression a number a number
Fast? Yes/no¹ yes yes
Needs symbolic expression function function
handles arbitrary functions? no² yes yes

¹ computing and then simplifying a derivative is usually slow. But you do that only once. After that, you can use the resulting expression to calculate the value of the derivative in as many points you want, and it’s usually fast.

² functions with for loops, if branches and other programming constructs are beyond the scope of application of SD (what’s the derivative of a for loop?). On the other hand, ND and AD handle these with ease.

If you are only familiar with Symbolic and Numerical, that table might look quite weird to you (it’s exact but returns a number? How come?)

In order to introduce you to Automatic Differentiation, I must first introduce you to Dual Numbers

Dual Numbers

Remember complex numbers? This has nothing to do with them, but…

You know how we defined
i^2 = -1
and then suddenly all functions defined on the reals were also defined on the complex numbers by virtue of their Taylor expansion? (If you don’t remember, that’s ok )

Well, this is similar, only this time we define

\epsilon ^ 2 = 0 (but \epsilon \neq 0 )

You can think of \epsilon as a number so small, that if you square it, the result is 0. Actually 0. Yet, it’s not so small, that we could actually say that \epsilon is just 0. 😕 does that make any sense? So far the intuition goes…

Unable to resist the temptation to follow the mathematical abstraction caravan, we go head and define Dual Numbers as any number w of the form a + \epsilon b, and we call a = Re(w) the real part of w and b = In(w) the infinitesimal part of w , analogous to the real and imaginary parts of complex numbers.

What shall come out of this unholy abomination? Let’s see…

Evaluating functions on Dual Numbers

Let us take a dual number w = a + \epsilon b and see what’s the result of applying a function f to it:

f(w) = f(a+\epsilon b) = f(a) + f'(a)(\epsilon b) + f''(a) \frac{(b \epsilon)^2}{2}+O(\epsilon^3)

by virtue of the Taylor expansion of f in a . Using the definition of \epsilon above, we finally get that

f(w) = f(a+\epsilon b) = f(a) + \epsilon (f'(a) b)

For the special case when b=1 , we get:

f(a+\epsilon) = f(a) +\epsilon f'(a)

That is, we get the value of f in the real part of f(w) and the derivative of f in the infinitesimal partof f(w)

“Allright, but what’s the point?
1- We still need to calculate both f(a) and f'(a) … 2- Also, setting the infinitesimal part of w , b , to anything other that 1 is quite pointless right?”

Well… hold on. I’ll answer question 2 first.

Chain Rule

I assume you took your basic Calculus course and remember the chain rule. Ok, so let’s see what’s the result of applying a composition of functions, f and g to the dual number a + \epsilon . We know the result should be :

h(a+\epsilon) = f(g(a+\epsilon)) = h(a) + \epsilon h'(a) = f(g(a)) + \epsilon f'(g(a)) g'(a)

because we know the chain rule. But a computer doesn’t, and we don’t want to implement it. At most, we could implement a DualNumber class and overload some operators, but that’s it.

If we do so, and ask the computer to evaluate h(a+\epsilon) = f(g(a+\epsilon )) , it would first compute

k = g(a+\epsilon) = g(a) + \epsilon g'(a)

and then

f(k) = f(Re(k)) + \epsilon f'(Re(k)) In(k) =  f(g(a)) + \epsilon f'(g(a)) g'(a)

thus, effectively computing the derivative of h without knowing anything about the chain rule, only knowing how to apply functions on dual numbers.

To sum up, having a non-1 infinitesimal part on dual number k above, enabled us to compute the derivative of a composition of functions without knowing the chain rule. That answers question 2. Let’s tackle question 1.

Automatic Differentiation

If you understood everything above, you are pretty much done. You know that, if you had an implementation for a function f , and evaluated it in a + \epsilon , you would get

f(a+\epsilon) = f(a) + \epsilon f'(a)

thus, you get the value of f and the derivative for free… Hence the name Automatic Differentiation.

“Wow, hold on, hold on, not so fast…
1- My function f doesn’t take DualNumber, it takes, say float.
2- Also, if I overloaded f to also take DualNumberI would still need to implement the code to compute f'
3- Evaluating f(a+\epsilon) will be slow, because it has to compute both f(a) and f'(a)

1- yes, if you work in a statically typed language (say C++). If you work in a dynamically typed language (say Python), f will take anything you throw at it. In a language like C++, you would ideally have your functions implemented with templates, so they would take any type of number.

2- No. f will likely be composed of a chain of elementary functions, and all those would already work on DualNumber, since the class definition would overload all basic operators. And, by the section above, all compositions of functions which work on DualNumber also work on DualNumber.

3- That’s the worst case scenario. Even then it wouldn’t be that bad, since you are usually interested in both values. But many times, you can even compute both in the same time it would take to compute only one.
For example, to compute sin(x+\epsilon) you would have to compute both sin(x) and cos(x) , but there is a sincos function in the standard C <math.h> library which returns both sin and cos in the same time it would take to compute either separately. It does so by using the Assembler FSINCOS instruction.
As another example, e^{x+\epsilon} has a trivial optimization (see why?)

The takeaway is that the intermediate results for computing f and f' can be shared, enabling optimizations; as well as there being some other more sophisticated methods for computing both simultaneously.
Either way, you wouldn’t have to worry for any of that, since your f s would already work with DualNumber out of the box, because they are a composition of elementary functions.

Code Sample

Here I show you how this would look in practice. This will be an hypothetical API (at the end of the post I list real ones), which contains a Dual class to represent dual numbers. Dual receives two numbers as arguments, the first represents the real part and the second represents the infinitesimal part.

from AD import Dual

def square_it_baby(x):
    return x**2

def factorial(x):
    """
    function to showcase that AD handles
    programming constructs such as
    recursion and 'if' with no hassle
    """
    if x<1:
        return 1
    return x*factorial(x-1)

print( square_it_baby(Dual(0, 1)))   # prints Dual(0, 0)
print( square_it_baby(Dual(-1, 1)))  # prints Dual(1, -2)
print( factorial(Dual(3, 1)) )       # prints Dual(6, 11)
print( factorial(Dual(2.5, 1)))     # prints Dual(3.75, 4)

N-dimensional case

Your functions will usually take many arguments, not just one. So how do we compute the derivative in such case?

It depends on what you mean by derivative. You could at least mean 3 things:

1- Partial Derivatives

If you evaluate f(x+\epsilon, y, z) you will get:

f(x+\epsilon, y, z) = f(x, y, z) +\epsilon f_x(x, y, z)

a partial derivative. What if you put dual numbers for the other arguments as well? We generalize this to…

2- Directional Derivatives

We can say your function takes a n-dimensional vector as an argument. Then, if you evaluate it in \textbf{a} +\epsilon \textbf{ b} where \textbf{ a} and \textbf{ b} are vectors, you get:

f(\textbf{ a} +\epsilon \textbf{ b} ) = f(\textbf{a} ) +\epsilon \nabla f(\textbf{a} ) . \textbf{ b}

by the Taylor expansion of f , as usual. Note that what we get for the infinitesimal part is a directional derivative in the direction of \textbf{ b} .

Directional derivatives are many times enough for our porpouses (such as optimization), but it may happen that what we really want is \nabla f . Setting \textbf{ b} = \textbf{ 1} won’t help us, since we would get the directional derivative in the direction of vector \textbf{ 1} . So, what we do is…

3- Vector Derivative (or Gradient)

We define a vector \boldsymbol{ \epsilon} such that:

\boldsymbol{ \epsilon} \otimes \boldsymbol{ \epsilon} = \textbf{0} (but \boldsymbol{ \epsilon} \neq \textbf{ 0} )

Or, in terms of components, we define n numbers \epsilon_i for i \in [1,n] such that:

\forall_{i, j} \:\epsilon_i \epsilon_j = 0  (but \forall_i \:\epsilon_i \neq  0)

Our multi-dual numbers will now have multiple infinitesimal parts (or a multi-infinitesimal part). They will be numbers of the form w=   a +\textbf{b} .\boldsymbol{\epsilon} , and we will call a=Re(w) the real part of w and vector \textbf{ b} =In(w) the multi-infinitesimal part of w .

Then we evaluate our function in a vector of multi-dual numbers, \textbf{ a} + \textbf{ B} . \boldsymbol{\epsilon} where \textbf{ a} is a vector and \textbf{ B} is a square matrix. We get:

f(\textbf{a} +\textbf{B} . \boldsymbol{\epsilon} ) = f(\textbf{a} ) + \nabla f(\textbf{a} ) .\textbf{B} . \boldsymbol{ \epsilon}

by the Taylor expansion of f , as usual. For the special case when \textbf{B} = \textbf{I} (the identity matrix), we get:

f(\textbf{a} +\boldsymbol{ \epsilon} ) = f(\textbf{a} ) +\nabla f(\textbf{a} ). \boldsymbol{ \epsilon}

that is, we get \nabla f in the multi-infinitesimal part of f(\textbf{a} +\boldsymbol{\epsilon} ) .

Application case: Backpropagation

N-dimensional code sample

Generalizations

Forward-reverse
Higher order Derivatives

AD libraries

A different view on Functional Programming

If you have never heard about functional programming before, I’ll have the honor to instill in you my view on it. 

If you have, you might know that it’s usually defined by some of its features, such as: reification of functions, avoidance or banning of functions with side-effects, use of higher-order functions and so on. But, doesn’t that sound like a collection of random features? What’s the motivation behind them?

I’ll present you a view which, at least to me, makes that set of features seem coherent; and functional programming, like a deep and beautiful paradigm:
Functional programming attempts to reduce programs to equations.

An example

Say we have a list of real numbers x and we wish to calculate the geometric mean of the positive numbers in that list. 

Good old C

# include <math.h>
// the caller of this function must keep track of
// the length of x and pass it explicitly,
// otherwise we couldn't know how long it is
double geometric_mean(double * x, int length) {
    int n = 0;
    double prod = 1.0;
    int i;
    for(i = 0; i< length; i++) {
        if( x[i] > 0 ){
            n++;
            prod = prod * x[i];
        } 
    }
    return pow(prod, 1./(double) n)
} 

 

Python

def geometric_mean(x):
    positives = [ e for e in x if e > 0 ]
    product = 1
    for p in positives:
        product *= p
    return product ** (1./len(positives))

 

Functional Python

def geometric_mean(x) :
    positives = filter(lambda a: a>0, x) 
    product = reduce(lambda a, b: a*b, 1, positives)
    return product ** (1. /len(positives)) 

 
“Well, that’s supposedly functional code, and it still doesn’t look like an equation” – I can hear you say

It does, we only need to change the syntax (semantics will stay the same), and the underlying mathematical structure will flourish before your eyes.

Meet APL

geometric \textunderscore mean \leftarrow \{(\times/P) \: \hat{} \div \rho P \leftarrow \omega [0<\omega]\}

Yes, that’s code. It’s a language and you can actually write your next program in it. Its called APL.
This is beautiful twofold: 
1- makes mathematical structure of programs explicit (so it’s easier to check its correctness) and makes code super concise (less room for bugs) 

2- extends the mathematical language so that you can use it to express algorithms (if you are a mathematical savvy, think: how would you describe how to sort a list, using the standard mathematical language alone?). This was the original intent of Ken Iverson, the creator of APL. Only later were computers able to execute it. (Isn’t it weird that we have to resort to natural language or pseudo-code to express algorithms, which are inherently mathematical entities?) 
thus effectively blurring the line between mathematics and code. 

J

People looked down on APL for its non-ascii character set. Therefore, Ken Iverson created J, a successor of APL which restricts to the ASCII character set, and incorporates more powerful concepts along the way, such as tacit programming
Here are the 3 pythagorean means implemented in J:

arithmetic_mean := (+/ % #) 
geometric_mean := (+/ % #) &. ^. 
harmonic_mean := (+/ % #) &. %

 

Those are not totem-pole smileys, they are wonderfully abstract and concise tacit functions

Conclusion

I hope I have wet your interest in functional programming. Remember you don’t need to learn APL or J to think along the lines of this paradigm, as long as you can see through the syntax of imperative languages. 

(* Just in case all that wasn’t enough to convince you about the superiority of the functional gospel, take this threat: FP will displace OOP, so start thinking along our lines our you will eventually lose your job)

A neat currying implementation in Python 

I wanted to share what I think is a neat implementation of currying in Python.

Getting the arity of a function

First we need to be able to get the arity of a function. This is how we do it:

import inspect

def arity(f):
  args, varargs, varkw, defaults = inspect.getargspec(f)
  return len(args)
  # use this in Python 3.3+ 
  # return len(inspect.signature(f).parameters)
  

 

Now, behold!

def curry(f):
  def g(*args):
    if len(args) < arity(f):
      return lambda *moreargs: g(*(args+moreargs))
    return f(*args)
  return g

Let’s test it

@curry
def f(a, b, c, d, e):
  print("finally calling 'f', with args:", a, b, c, d) 
  result = a + b + c + d + e
  print("result was:", result) 
  return result

f1 = f(2)         # f1 is a curried function of arity 4
f2 = f1(1, 3)     # f2 is a curried function of arity 2
f2(4, 5)          # returns 15
f2(4)(5)          # returns 15 too  

Papercraft with Blender

Blender version 2.79 incorporated functionality which enables you to fabricate 3D models by folding paper. I decided to try it out.

Drawing the model

I downloaded a 3D model of a goat and re-meshed it so that it had fewer polygons and it would be easier to assemble. This is how it looks.

Printing the templates

Now it was time to let the Blender plugin do the mathemagic. After some parameter tweaks, this is a sample of what I got printed on a cardboard.

Folding and glueing

I used a mix of glue and paper tape. 

After several hours of folding and pasting, here is the final result:

Conclusion 

Although this is a really time consuming endeavor, it’s a fun way to burn away a whole weekend + you’ll have a lasting reminder that you lived that weekend, as opposed to other weekends in which you just spent the time and got nothing for it. The end results look superb.

Some ideas floating my head:

  • What if I would plasma-cut it from a metal sheet with a CNC and tig-weld it? 🤔
  • What if I filled the hull with some other material? Would it have any application? 
  • How do I make it faster? 

Remember to be nice to goats and other animals, keep them out of your plate.