Erlang's Tail Recursion is Not a Silver Bullet

2011/07/12

Erlang's Tail Recursion is Not a Silver Bullet

One common belief held by Erlang programmers has to do with tail recursion generally being the best way around when writing code. It's absolutely vital when writing long-running processes (you do not want to go out of memory because of your stack!), I'm not debating that here. However, most sources show how to write Erlang functions by always favoring tail recursive solutions. I know Learn You Some Erlang does fall into the same trap.

Why do I call it a trap? Well, because tail recursion is sometimes vital but body recursion never is, it's kind of normal to always show examples with tail recursion so it becomes absolutely natural to everyone. It's also easier to avoid problems because bad tail recursion often has fewer negative side-effects than bad body recursion (the fibonacci function comes to mind here). However, in some cases, tail recursion is actually slowing your application down The cases where this happens include pretty much every function where you build a list. Now I figure this is somewhat limited given how many possible data structures can be used, but the truth is, lists are everywhere and will almost always be a sizeable portion of the code you write in Erlang.

Let's go with an example, the flush function. The flush function will look at the mailbox and remove messages from there until nothing's left. My version here will accumulate the messages in the order they happened:

tail_flush() -> tail_flush([]).
tail_flush(Acc) ->
    receive
        Msg -> tail_flush([Msg|Acc])
    after 0 -> lists:reverse(Acc)
    end.

This is a rather standard tail recursive function. Here's the body-recursive version:

flush() ->
    receive
        Msg -> [Msg|flush()]
    after 0 -> []
    end.

I think few Erlang programmers would disagree with me if I were to say that flush/0 is much simpler and elegant than tail_flush/0. It's easier to understand and more idiomatic. But here's the thing: the tail recursive one is likely the one we'll see the most in existing code (without proof).

I would dare say that tail recursion is not important here. We can just drop it and go with the simplest way of writing code. In both cases, I basically end up creating a list, which will hold all the values of all messages. In one case, I will be creating a list of N elements to be carried on the stack, in the other case, I'll be having a stack that is about the size of N elements, right? The difference is that the tail recursive version will create a lot of garbage by just dropping the accumulator after reversing it. A step by step explanation to this problem and hypothesis were given by Raimo Niskanen on the Erlang mailing list. Here's the theoretical conclusion:

tail recursion should require a top memory consumption of 2 * size(MsgList) words more than body recursion. This will cause extra garbage collection. Also the lists:reverse/1 is extra work that the body recursive variant avoids. In this case I speculate body recursion wins.

So I decided to test this with a crappy micro-benchmark. I usually do not like to trust micro-benchmarks as they are a bit unreliable if you don't think of everything, but still, this seemed a good proof enough for a blog post:

-module(bench).
-compile(export_all).

main() ->
    L = lists:seq(1,100000),
    F = fun(X) -> X end,
    spawn(fun() -> io:format("Tail:~p~n", [element(1, timer:tc(?MODULE, tail_map, [F, L]))]) end),
    spawn(fun() -> io:format("Body:~p~n", [element(1, timer:tc(?MODULE, body_map, [F, L]))]) end).

tail_map(F, L) ->
    tail_map(F, L, []).

tail_map(_, [], Acc) -> lists:reverse(Acc);
tail_map(F, [H|T], Acc) -> tail_map(F, T, [F(H)|Acc]).

body_map(_, []) -> [];
body_map(F, [H|T]) -> [F(H) | body_map(F, T)].

So each process will iterate (while doing nothing) over a list of 100,000 integers. Fairly simple. Here are the results of a few runs, in microseconds:

Body:9384
Tail:14556
Body:11150
Tail:16260
Body:9419
Tail:14732
Body:11386
Tail:17946
Body:11380
Tail:16512
Body:10885
Tail:16876
Body:9923
Tail:14085

Body recursion is consistently faster. This might have to do with the size of each process and garbage collection. Quick tracing showed that in the case of this benchmark, with the arguments given to the function, the body recursive version had to be garbage collected once, while the tail recursive version had to be garbage collected twice. This would seem to support the theory. But hey, this could be related to some concurrency or allocation of processes and whatnot. So here I'm adding another overly simplistic benchmark, where both calls will run one after the other, in a single process:

main2() ->
    spawn(fun() ->
        L = lists:seq(1,100000),
        F = fun(X) -> X end,
        tail_map(F,L),
        body_map(F,L),
        { {A,_}, {B,_}} = {timer:tc(?MODULE, tail_map, [F, L]), timer:tc(?MODULE, body_map, [F, L])},
        io:format("Tail:~p~nBody:~p~n", [A,B])
    end).

Here I'm running both functions once before the timing. In case I ever need to resize the existing process so everything won't need to be garbage collected as much:

Tail:5329
Body:8648
Tail:5292
Body:6873
Tail:6738
Body:8798
Tail:5718
Body:6818
Tail:5266
Body:8669
Tail:5470
Body:6716
Tail:8108
Body:8952
Tail:5461
Body:6545
Tail:5648
Body:8607
Tail:5424
Body:7153

Aw crap. All my pretty conclusions, gone. Now body recursion is slower than tail recursion. However, not all is lost. What we can see is that body recursion is somewhat stable: running the function once or many times seems to yield very similar performances. In the case of tail recursion, having all the required space allocated beforehand and the GC having done its thing already seems to vastly improve the run time compared to a cold start.

Moreover, if I change the benchmarks a bit so that each of the looping functions runs a few hundred times within their own process, body recursion becomes faster again:

main3() ->
    L = lists:seq(1,100000),
    F = fun(X) -> X end,
    spawn(fun() -> io:format("Tail:~p~n", [element(1, timer:tc(?MODULE, m_tail_map, [F, L, 500]))]) end),
    spawn(fun() -> io:format("Body:~p~n", [element(1, timer:tc(?MODULE, m_body_map, [F, L, 500]))]) end).

m_tail_map(_, _, 0) -> ok;
m_tail_map(F, L, N) ->
    tail_map(F,L),
    m_tail_map(F, L, N-1).

m_body_map(_, _, 0) -> ok;
m_body_map(F, L, N) ->
    body_map(F,L),
    m_body_map(F, L, N-1).

And the results:

Body:2807180
Tail:3577161
Body:2810430
Tail:3644639
Body:2852699
Tail:3611445
Body:2898529
Tail:3699570
Body:3016723
Tail:3711662
Body:3027023
Tail:3712915
Body:2933290
Tail:3623219

What does this mean? Well if I can get ahead of myself and pretend I can trust these results for the general case (which is probably not the smartest thing to do), it'd be that it seems right to believe that tail recursion does require more work. Tracing the main/0 function for garbage collection revealed that it ended up needing more heap space towards the end of its run time when calling lists:reverse/1. On the other hand, body recursion was able to do its thing within the space that was already allocated to it. It seems to be that this memory work is what makes tail recursion be slower for list building, although there are definitely cases where this is false.

This could also mean that using the most idiomatic form of recursion would be what you want to use when you need a lower memory footprint, sometimes more speed and cleaner code when building lists. This is a great property of Erlang: writing idiomatic and clean code often leads to optimal solutions. I would still say that it is very, very unlikely that the type of recursion you use when building lists would be the bottleneck of your application. In any case, to quote Raimo (who is quoting von Siemens), to measure is to know.

If you clicked on the Erlang mailing list link a few paragraphs earlier, you will have seen that Raimo added warnings relevent to this case:

In some other implementation where heap and stack are not in the same memory block the situation changes. I know of no such implementation yet. In the Erjang prototype stack space is limited so all body recursion should be avoided. That might be an unacceptable limitation since it impacts so hard at body recursion.

The measurements in this case were done on a single laptop, with a certain architecture, hardware and a given load on it, on the BEAM virtual machine, which has particular optimizations and a particular way to lay out a process' memory. These results are far from guaranteed to be repeatable in other settings (yay! no scientific proof here!). Measure, measure, measure.

One thing that's to remember no matter what is that writing clean code generally leads to better performances. This is something that many languages can not claim as true, but Erlang certainly can. In the general case, the optimisations in the Erlang VM come from observing idiomatic code in production, and clean one at that. This is why body recursion might be better, and why many other optimizations were made. The best recent example I can think of has to do with synchronous messaging patterns, where mailbox scanning has been optimized a little while ago. Clean code is the best code.