On the hunt for a bug with a hump and a long tail

2023/03/06

On the hunt for a bug with a hump and a long tail

In many of my ReVault posts, I've hinted at performance issues that needed a look. One of the patterns I had noticed by looking into my laptop's network monitor was one of a hump and a long tail, which I finally managed to reproduce with OpenTelemetry instrumentation, here rendered by Honeycomb:

Metrics from a server showing network transfers: count (of receive operations), the rate of growth of the buffered data, the number of packets received, and the number of bytes received all display a pattern of an early, spiky hump, following by a long flat tail

This graph shows, for a single query, the number of receive calls (COUNT), the rate of growth of in-memory buffered data (RATE_MAX(buf)), the number of packets received (RATEMAX(sock.recvcnt)) and the number of bytes received for a file's transfer (RATEMAX(sock.recvoct)).

The pattern is shared across all charts: a rapid raise followed by a quick decrease (the hump), and a long tail of data being received. This felt abnormal—the usual pattern would be more like the following one, where as the connection warms up and the TCP window widens, more and more traffic is allowed until it sort of stabilizes at a stable point:

a graph showing the y-axis (bandwidth) and x-axis (time) with a logarithmic curve, rapidly raising and then flattening out

And well my graph looks like that at first but then it craters. I ran another transfer while capturing the packets to know what was going on with the connection, and I spotted a bunch of events where the TCP window would be filled, and it sort of lined up with the gaps on the same graph I posted earlier:

same graph as the first one in the article but with the dips highlighted

The TCP window filling turns into a reduced window size that quickly ramps back up, and the way it happened repeatedly pointed at an issue where the connection was indeed getting saturated, but not because of a pipe issue; most likely because the consumer side was too slow to get data out of the buffer.

I have written proxies and routers in Erlang and I damn well know that it can do better than this laughable 50KB/s on a local WiFi with no transmission error. So something weird was happening with the code I had written.

From the resource monitor's graph this could make sense: the buffer is empty and a glut of packets are sent without a problem; that's the hump. Then the buffer fills and transmission dies down to what the software can pull out; that's the long tail. But the diagram above is measured from within Erlang. That means the explanation is wrong: the graphs themselves come from my software, so they must be representing actual in-code operations going fast then slow.

So I did the reasonable thing, and ran an even bigger transfer with even more metrics graphed:

The key metrics standing out here are:

Process memory values, which are flat all through. Accumulating more and more data should grow that set but not here.
The number of reductions, which slowly tapers out. Reductions are Erlang's way of measuring all work. Constant work should show a continuous, linear increase. The flattening out means that the process gets slower and slower, and progressively runs less code.

On their own these may not feel too significant, but here's why they matter. First, there's the code that the server actually runs:

worker_loop(Dir, C=#conn{localname=Name, sock=Sock, buf=Buf0}) ->
    receive
        %% ... when sent TLS data
        {ssl, Sock, Data} ->
            %% Take the data and append it to the buffer, then try to
            %% decode it
            {Unwrapped, IncompleteBuf} = unwrap_all(<<Buf0/binary, Data/binary>>),

            %% Forward unpacked messages if any
            [revault_tls:send_local(Name, {revault, {self(), Marker}, Msg})
             || {revault, Marker, Msg} <- Unwrapped],

            %% recurse
            worker_loop(Dir, C#conn{buf = IncompleteBuf});
        %% ...
    end.

The fun bit here is <<Buf0/binary, Data/binary>>, which is a construct where the new data (Data) is appended to the old buffer (Buf0) in a single binary construct. This operation is generally "constant" time, in that the size of the buffer or the data does not matter, they just get moved into a new area as a single new data structure (it's linear but with a nearly imperceptible effect compared to other operations). This pattern is a common one used in most Erlang and Elixir web servers, though these tend to parse short header lines and not multi-megabyte well-delimited payloads.

Another gotcha is that while Erlang has isolated, immutable process memory, binary blobs like that have a special rule. Past 64 bytes, rather than being stored on a process's heap, binaries become reference-counted and stored elsewhere in the VM (only the reference is on the heap). This is useful for most network handling, such as with TLS servers like mine, because the process that deals with receiving data from the socket and decrypting data can do all of that in its own memory, then pass the resulting reference in a message to cheaply make it available to any consumer.

So quickly enough, we can assume that this is why the process memory is mostly flat: the actual storage is pretty much all spent on that big binary buffer that keeps being reallocated outside of the process heap.

What about reductions? Well I'd have expected reductions to reduce as garbage collections ran, but we need to run nearly none here since the memory usage of the process is flat. But the cost of copying a binary buffer continuously would probably be rather high, especially since thousands of packets would need to be received to reconstitute the whole 30MB message I wanted to send (breaking down larger files into small segments to dump on disk is something I will implement later, right now they're all in RAM).

That copying could certainly explain the long tail, but not the hump. Linear allocation should show linear growth. For that, I had a fancier theory, which involved how the Erlang VM allocates memory.

a visual representation of Erlang allocators

From my free ebook, Erlang in Anger:

By default, there will be one instance of each allocator per scheduler (and you should have one scheduler per core) [...]

Each of these sub-allocators will request memory from mseg_alloc and sys_alloc depending on the use case, and in two possible ways. The first way is to act as a multiblock carrier (mbcs), which will fetch chunks of memory that will be used for many Erlang terms at once. For each mbc, the VM will set aside a given amount of memory (about 8MB by default in our case, which can be configured by tweaking VM options), and each term allocated will be free to go look into the many multiblock carriers to find some decent space in which to reside.

Whenever the item to be allocated is greater than the single block carrier threshold, the allocator switches this allocation into a single block carrier (sbcs). A single block carrier will request memory directly from mseg_alloc for the first [N] entries, and then switch over to sys_alloc and store the term there until it’s deallocated.

Basically, while data is small enough, it fits into the fast, pre-allocated pools of memory within multiblock carriers. When it starts growing, then the large binaries need to be allocated individually in single block carriers via malloc calls. That's when consuming gets drastically slow.

I could have tested this hypothesis with some systemtap/dtrace (or eBPF I guess?) shenanigans, but decided to do it experimentally, with a small code change. First I introduced a new buffer data structure, which mostly kept identical semantics as before:

worker_loop(Dir, C=#conn{localname=Name, sock=Sock, buf=Buf0}) ->
    receive
        %% ...
        {ssl, Sock, Data} ->
            %% Take the data and append it to the buffer
            Buf1 = revault_tls:buf_add(Data, Buf0),
            %% then try to decode it
            {Unwrapped, IncompleteBuf} = revault_tls:unwrap_all(Buf1),

            %% Forward unpacked messages if any
            [revault_tls:send_local(Name, {revault, {self(), Marker}, Msg})
             || {revault, Marker, Msg} <- Unwrapped],

            %% then recurse
            worker_loop(Dir, C#conn{buf = IncompleteBuf});
        %% ...
    end.

The internal buffer interface is what really changed though:

%% If this is a brand new buffer
buf_add(Bin, B=#buf{seen=0, needed=0, acc=Acc}) ->
    %% check the first message for its header and find the final size
    case iolist_to_binary([lists:reverse(Acc),Bin]) of
        <<Size:64/unsigned, ?VSN:16/unsigned, _/binary>> = NewBin ->
            %% Add 10 bytes to the needed size to cover the Size+Vsn
            B#buf{seen=byte_size(NewBin), needed=Size+10, acc=[NewBin]};
        IncompleteBin ->
            B#buf{acc=[IncompleteBin]}
    end;
%% if this buffer already has content
buf_add(Bin, B=#buf{acc=Acc, seen=N}) ->
    %% just add to it and track its size
    B#buf{acc=[Bin|Acc], seen=N+byte_size(Bin)}.

The main difference is that the buffer keeps a process-local list of all the binaries received in order, and only peeks at the first one received (which holds the full length required) to know when to even try collating all the data into a single binary, which can then be used to de-serialize data:

unwrap(B=#buf{seen=S, needed=N, acc=Acc}) when S > 0, S >= N ->
    Bin = iolist_to_binary(lists:reverse(Acc)),
    <<Size:64/unsigned, ?VSN:16/unsigned, Payload/binary>> = Bin,
    <<Term:Size/binary, Rest/binary>> = Payload,
    {revault, Marker, Msg} = binary_to_term(Term),
    {ok, ?VSN, {revault, Marker, unpack(Msg)}, buf_add(Rest, buf_reset(B))};
unwrap(B=#buf{}) ->
    {error, incomplete, B}.

The theory went that since each packet only contains small bits of data—large enough to be refc binaries still, but small enough that these could all individually fit into multiblock carriers (which are fast)—then they could accumulate real fast and we'd pay the single block carrier allocation cost once per file, and suffer none of the slowdowns.

(Late edit: a friend pointed out a possibility where the old code had specific access patterns preventing common compiler optimizations and that the new one fixed that as well. He may very well be right)

Well lo and behold:

Everything's linear! And fast! The graph looks really jagged, which (along with the message queue never really filling up) points at the network being faster than what the client fills it with and hints at further investigation points. But these jagged bits are also visible because of how zoomed in we are given how much faster the code is. Here it is with the same scale as the initial slow run:

$Same graph as before, but taking a tiny fraction of the screen$

That is tiny by comparison, and measurements show that the same file transfer is now taking a fraction of the old one; the slowest receive operation, which was at 31ms, is now at 0.645ms. In fact, the whole transfer is ~20x faster (~20 seconds to below 1 second for a 28MB file on my network). The bigger the file, the more notable the speed gain should be.

There's more juice to press out of the lemon, but going from 50KB/s to 30MB/s by just changing the way the buffer is tracked seems like a good enough win for a night.