My bad opinions

2019/08/24

Clever Functional Design

One of the best bits of software design I recall participating in was something done a few years ago at Heroku. It's been long enough since then that I feel comfortable boasting about it here. It's one small piece of data-centred functional design, where thinking a bit harder about the problem at hand greatly simplified what would have been a more straightforward and obvious implementation choice in a mutable language.

The Feature

Heroku's routing stack had one very interesting feature it provided to all users: a router log. The router log contains fields such as the overall request time, the time it took to send the request, the time it took to send the body, heroku-specific error codes and so on. They're broader categories of interesting time span.

At the same time, Heroku engineers had internal logs for all the requests for which users had public logs. These internal logs contained more detailed information that more often made sense to display during specific debugging activities. They included information such as time spent parsing headers from the client, time to first response packet (calculating what would essentially be the time after which the whole request was sent, but before which the back-end application would respond), POSIX statuses detected by the proxy on socket issues, and so on.

During more intricate debugging, other values could be required from the logs, but adding them would need code changes. This information was maintained in what was essentially a monolithic proxy that contained both the business logic (what logs to create, which servers to route to, and so on) and the proxying logic (how to shuttle HTTP from point A to point B).

At some point during my Heroku days, we rewrote the entire routing stack to clearly divide the business concerns (routing and features for Heroku apps) and the proxying logic. The idea was to clean up deeply intertwined code, clarify and properly specify the proxying behaviour, and allow to reason about and change what we offered to customers without having to know all the innards of HTTP proxying logic to do so. This divide was successful, and eventually allowed us to open source the proxying logic: since it was no longer business related and was commodity infrastructure, vegur became public.

This division came with a few challenges though, and one of them was with the logs: how were we to take a Heroku feature, such as the router logs we had, with their own specific needs that could change according to business requirements, and bake them into a generic proxy library, where all the interesting measurements and samplings were to take place?

The Straightforward Design

The approach we took in the original router was to just take all the samples as we needed them, mostly as spans. You essentially just intersperse the logging and observational needs with the actual business end of the code, and do what you must.

One straightforward way to do this is with timestamps and might look something like this:

T1 = stamp(),
do_something(),
T2 = stamp(),
report_duration("some label", T2-T1)

You take a timestamp, do the thing, take a second timestamp, and report the difference as the duration of the operation. Eventually you get tired of doing this, and you might wrap them up in a helper that takes a closure (or wraps some object in OO design) and hides the reporting:

with_timestamp("some label", fun() -> do_something() end)

Both approaches work fine. The latter offers slightly more encapsulation, and also prevents having overlapping timestamps where two measurements intersect. The logic is always set at the call-site, and things can be a bit tricky with error handling, but that's generally how you do it. The same kind of approach is still rather broadly used in distributed tracing with the addition of a context, which lets you define some lineage or nesting of operations:

%% approach 1
Ctx = new_span("some label"),
T1 = stamp(),
NewCtx = do_something(Ctx),
T2 = stamp(),
close_span(NewCtx)

%% approach 2
with_span("some_label", fun(Ctx) -> do_something(Ctx) end)

Of course if you've got mutability going on and some global scope available, you'll cheat a bit and hide the span context within the program's state differently.

In any case, the old approach was based on these kinds of mechanism. When time came to split them up into their business and general parts, the tools used for logging needed to be decoupled as well. The general and straightforward approach to that is to do it through dependency injection. Our new approaches might now look something like this:

f(Logger) ->
    T1 = stamp(),
    do_something(Logger),
    T2 = stamp(),
    Logger("some label", T2-T1)

Kind of similarly to passing the context or a span in the distributed tracing approach, you now parametrize each contract of dependent functions to take some contextual cues that explain how to do things. It would have become a bit cumbersome to do it through all involved components of a proxying library, but it would have been possible to do it, and even more easily with tool or IDE support.

This, however, was the road we decided not to take.

The Weakness of the Straightforward Approach

The problem with the dependency injection approach, aside from its cumbersomeness, is that it did not sufficiently decouple what was a business concern from what was a generic library. Sure, we would hide and abstract away the tools chosen—which logging or tracing library would be used—but in no way would it really de-couple the design.

It would be tricky, for example, to properly track the concerns of "public user logs" and "internal engineer logs". The biggest design issue was something we uncovered by simply asking ourselves this question: if some other project were to use this library, would their reporting need to change every time Heroku decided to log new metrics?

Sure, the implementation could be independent. But the straightforward design only de-coupled the technical dependencies and which code was used. It did not get rid of the logical coupling that still existed between Heroku's business logic and the proxy's need to just shuttle HTTP. If we went with that approach, there was still a very deep dependency between both code bases. Heroku would unsurprisingly rely on the proxy, but it felt weird that the proxy's instrumentation would have to be defined by the Heroku product requirements.

Another slightly less important issue came from implementation details. I mention it not because it was a huge blocker to logical decoupling, but because this implementation detail ended up providing the key to the nicer design. The Vegur proxy had been written to use passive TCP sockets used in a blocking mode, because those were faster back in the day (implementation changes and optimizations within the BEAM VM have since then made this unnecessary). This, and other earlier design choices, made it so the proxy itself had 3 major moving parts:

  1. an HTTP server parsing module, which would listen to incoming requests from the public Internet and parse them
  2. an HTTP client parsing module, which would forward the traffic to a customer's back-end and listen to the responses
  3. an inner loop that would use both the client and server bits and would handle the data transfer across both of them as a bridge.

This meant that some concerns we had in terms of metrics would sometimes reside all within one bit of code (i.e. the time it takes to parse headers is self-contained to any of the components), but sometimes it would cross boundaries. For example, knowing the time it took to parse the request body required taking measurements in the HTTP server, but which could be intertwined with operations taking place both in the inner loop and the HTTP client. It had no clear structural hierarchy.

Worse, some debugging scenarios required taking some measurements that started in the HTTP server, and finished in the HTTP client. That made it particularly difficult to localize and isolate concerns well, and the overall requirements of Heroku's reporting risked having a huge impact on the structure of the proxy.

Dependency injection would not be enough to fix this, we needed to think about the problem differently.

A Functional Design

Even in today's modern distributed tracing ecosystem, the design of most tracing libraries is deeply centered on the concept of a span: a pre-defined point A to point B period of time, which is written and instrumented as such in the code.

The big Eureka! moment for our approach in Vegur was in realizing that what when debugging, what we care about is picking arbitrary points on a timeline. Spans let you represent things a bit like this:

|------------------- Request --------------------|
 |--- Header Parsing ---|-- Body parsing --| ...
     |-- Cookie --|
        | ... |

Those are contiguous subdivisions of the whole timeline. What we wanted, instead, was a flat timeline on which we could pick arbitrary intervals:

request start                                          end of request
|                                                       |
| start header parsing                                  |
| |                first packet sent           ...      |
| |                |                             |      |
|-x--x--x------x---x----x------------------x-----x---...|
     |  |      |        |                  |
     |  |     ...      start body parsing  |
     | end cookie                     end body parsing
 start cookie

All these things happen in a continuum. The divisions of what we wanted to report were not structural to this timeline, they were views or selections of various points of interests, and a measurement between them. The "first packet sent" event is something that could be useful to multiple metrics:

and so on. Being able to report on all of these was context-dependent for the consumer, meaning that's usually a business concern. But the proxying library itself only cared about specific arbitrary points we thought could be useful to its technical users.

That distinction and approach as a timeline was the pivotal point we needed in the design. What the proxying library needed to do was not provide all the metrics and spans Heroku expected. What it needed to provide was the list of all important points on a timeline, whether they represented a singular event, or a duration. It would then be up to the consumer to report things however they wanted, whenever they wanted.

The flat timeline was particularly interesting for this because it is easily representable as an immutable data structure. If all you have is a bunch of local monotonic timestamps, all you need to do is maintain a local data structure that maintains sequences of labelled points in time: [{Label1, T1}, {Label2, T2}, ...]

Since timestamps are generally sortable locally—you need some fancy cheats to make it work in a distributed setting—then all the local timelines between the HTTP client, server, and inner loop modules could be maintained independently, but merged reliably: just sort by timestamp.

|-x----------------x---------------------------------...|

  |--x--x------x--------x------------------|

                   |—---------------------------x---...|

         |               |             |
         v               v             v

|-x--x--x------x---x----x------------------x----x----...|

This would let us write one generic well-defined data structure, use it wherever and whenever we needed it, and just merge them near the end of each timeline. No need to coordinate context-passing around, just a fetch and a merge once per item.

Then, the business end of code in Heroku's router could ask for that timeline once the request was ready to be logged, get one well-known data structure, and do as many selections for as many debugging reports as it required. If you wanted to send 15 logs out of there, it did not matter to the proxy library. Just analyze the timeline, generate what you need, and that's it.

Interestingly, since the final data structure could be represented easily in the base types of the language, Heroku's router was able to create its own compatible timeline that itself could be cast and merged with the proxy's timeline, without having them actually share the implementation (which would also have been fine). This would later let us augment the proxying logs with all the routing and business decisions for all kinds of debugging purposes (how much time would we spend queued to find an available back-end to route to?). This turned into app-specific routing flags that could allow to do deeper introspection of routing logic for specific applications, at nearly no overhead in code.

Lesson Learned

The approach itself here is mildly interesting. It has some intriguing implications in the context of designing implementations of distributed tracing libraries. The implementation is so straightforward in Vegur that I didn't spend the time to describe it here.

The true bigger lesson here is in systems design. It relates to functional, immutable, and declarative approaches to structuring communication flows.

The straightforward answer to our decoupling issue was to respond to the technical concern: just make it so the dependency does not know about the libraries used by its parent. I think it would have strictly speaking solved the most blocking problem in getting the code to build. This would have been easy to do, and the only thing that made it annoying was the fact we were using a functional programming language with no mutable data structures nor global shared context. But satisfying the compiler is not enough to make for good design. This approach would have made it hard to get maintainable code given implementation details, and did not remove any logical coupling.

Rather than dismissing this challenge as "a bad fit for functional programming", it is what led to a better solution: re-think the data structure, gain better insights in the distinction between how the data is produced and how the data is consumed. Take that separation, and make it explicit. Build around it. Turn it into a data contract. This, in turn, lets you more transparently change either ends. You might need to add new measurement points in the producer-side when the consumer needs it, but the properly declared abstraction makes it so the other consumers will not be effected by the change.

The end result was a purely functional data structure that was mergeable, testable, and in line with functional design, but that's just a technical aspect of the result: it was the structural constraint of already being in an immutable context that prompted the cleaner design. Most challenges of working in a functional, declarative, or immutable language are not necessarily due to the language itself. They come from being thrown in a context where easier shortcuts are not as practical as we are used for them to be, and having to re-think both our problem and our solutions.