My bad opinions

2024/02/07

A Distributed Systems Reading List

This document contains various resources and quick definition of a lot of background information behind distributed systems. It is not complete, even though it is kinda sorta detailed. I had written it some time in 2019 when coworkers at the time had asked for a list of references, and I put together what I thought was a decent overview of the basics of distributed systems literature and concepts.

Since I was asked for resources again recently, I decided to pop this text into my blog. I have verified the links again and replaced those that broke with archive links or other ones, but have not sought alternative sources when the old links worked, nor taken the time to add any extra content for new material that may have been published since then.

It is meant to be used as a quick reference to understand various distsys discussions, and to discover the overall space and possibilities that are around this environment.

Foundational theory

This is information providing the basics of all the distsys theory. Most of the papers or resources you read will make references to some of these concepts, so explaining them makes sense.

Models

In a Nutshell

There are three model types used by computer scientists doing distributed system theory:

  1. synchronous models
  2. semi-synchronous models
  3. asynchronous models

A synchronous model means that each message sent within the system has a known upper bound on communications (max delay between a message being sent and received) and the processing speed between nodes or agents. This means that you can know for sure that after a period of time, a message was missed. This model is applicable in rare cases, such as hardware signals, and is mostly beginner mode for distributed system proofs.

An asynchronous model means that you have no upper bound. It is legit for agents and nodes to process and delay things indefinitely. You can never assume that a "lost" message you haven't seen for the last 15 years won't just happen to be delivered tomorrow. The other node can also be stuck in a GC loop that lasts 500 centuries, that's good.

Proving something works on asynchronous model means it works with all other types. This is expert mode for proofs and is even trickier than real world implementations to make work in most cases.

The Semi-synchronous models are the cheat mode for real world. There are upper-bounds to the communication mechanisms and nodes everywhere, but they are often configurable and unspecified. This is what lets a protocol designer go "you know what, we're gonna stick a ping message in there, and if you miss too many of them we consider you're dead."

You can't assume all messages are delivered reliably, but you give yourself a chance to say "now that's enough, I won't wait here forever."

Protocols like Raft, Paxos, and ZAB (quorum protocols behind etcd, Chubby, and ZooKeeper respectively) all fit this category.

Theoretical Failure Modes

The way failures happen and are detected is important to a bunch of algorithms. The following are the most commonly used ones:

  1. Fail-stop failures
  2. Crash failures
  3. Omission failures
  4. Performance failures
  5. Byzantine failures

First, Fail-stop failures mean that if a node encounters a problem, everyone can know about it and detect it, and can restore state from stable storage. This is easy mode for theory and protocols, but super hard to achieve in practice (and in some cases impossible)

Crash failures mean that if a node or agent has a problem, it crashes and then never comes back. You are either correct or late forever. This is actually easier to design around than fail-stop in theory (but a huge pain to operate because redundancy is the name of the game, forever).

Omission failures imply that you give correct results that respect the protocol or never answer.

Performance failures assumes that while you respect the protocol in terms of the content of messages you send, you will also possibly send results late.

Byzantine failures means that anything can go wrong (including people willingly trying to break you protocol with bad software pretending to be good software). There's a special class of authentication-detectable byzantine failures which at least put the constraint that you can't forge other messages from other nodes, but that is an optional thing. Byzantine modes are the worst.

By default, most distributed system theory assumes that there are no bad actors or agents that are corrupted and willingly trying to break stuff, and byzantine failures are left up to blockchains and some forms of package management.

Most modern papers and stuff will try and stick with either crash or fail-stop failures since they tend to be practical.

See this typical distsys intro slide deck for more details.

Consensus

This is one of the core problems in distributed systems: how can all the nodes or agents in a system agree on one value? The reason it's so important is that if you can agree on just one value, you can then do a lot of things.

The most common example of picking a single very useful value is the name of an elected leader that enforces decisions, just so you can stop having to build more consensuses because holy crap consensuses are painful.

Variations exist on what exactly is a consensus, including does everyone agree fully? (strong) or just a majority? (t-resilient) and asking the same question in various synchronicity or failure models.

Note that while classic protocols like Paxos use a leader to ensure consistency and speed up execution while remaining consistent, a bunch of systems will forgo these requirements.

FLP Result

In A Nutshell

Stands for Fischer-Lynch-Patterson, the authors of a 1985 paper that states that proper consensus where all participants agree on a value is unsolvable in a purely asynchronous model (even though it is in a synchronous model) as long as any kind of failure is possible, even if they're just delays.

It's one of the most influential papers in the arena because it triggered a lot of other work for other academics to define what exactly is going on in distributed systems.

Detailed reading

Fault Detection

Following FLP results, which showed that failure detection was kind of super-critical to making things work, a lot of computer science folks started working on what exactly it means to detect failures.

This stuff is hard and often much less impressive than we'd hope for it to be. There are strong and weak fault detectors. The former implies all faulty processes are eventually identified by all non-faulty ones, and the latter that only some non-faulty processes find out about faulty ones.

Then there are degrees of accuracy:

  1. Nobody who has not crashed is suspected of being crashed
  2. It's possible that a non-faulty process is never suspected at all
  3. You can be confused because there's chaos but at some point non-faulty processes stop being suspected of being bad
  4. At some point there's at least one non-faulty process that is not suspected

You can possibly realize that a strong and fully accurate detector (said to be perfect) kind of implies that you get a consensus, and since consensus is not really doable in a fully asynchronous system model with failures, then there are hard limits to things you can detect reliably.

This is often why semi-synchronous system models make sense: if you treat delays greater than T to be a failure, then you can start doing adequate failure detection.

See this slide deck for a decent intro

CAP Theorem

The CAP theorem was for a long while just a conjecture, but has been proven in the early 2000s, leading to a lot of eventually consistent databases.

In A Nutshell

There are three properties to a distributed system:

In theory, you can get a system that is both available and consistent, but only under synchronous models on a perfect network. Those don't really exist so in practice P is always there.

What the CAP theorem states is essentially that given P, you have to choose either A (keep accepting writes and potentially corrupt data) or C (stop accepting writes to save the data, and go down).

Refinements

CAP is a bit strict in what you get in practice. Not all partitions in a network are equivalent, and not all consistency levels are the same.

Two of the most common approaches to add some flexibility to the CAP theorem are the Yield/Harvest models and PACELC.

Yield/Harvest essentially says that you can think of the system differently: yield is your ability to fulfill requests (as in uptime), and harvest is the fraction of all the potential data you can actually return. Search engines are a common example here, where they will increase their yield and answer more often by reducing their harvest when they ignore some search results to respond faster if at all.

PACELC adds the idea that eventually-consistent databases are overly strict. In case of network Partitioning you have to choose between Availability or Consistency, but Else --when the system is running normally--one has to choose between Latency and Consistency. The idea is that you can decide to degrade your consistency for availability (but only when you really need to), or you could decide to always forego consistency because you gotta go fast.

It is important to note that you cannot beat the CAP theorem (as long as you respect the models under which it was proven), and anyone claiming to do so is often a snake oil salesman.

Resources

There's been countless rehashes of the CAP theorem and various discussions over the years; the results are mathematically proven even if many keep trying to make the argument that they're so reliable it doesn't matter.

Message Passing Definitions

Messages can be sent zero or more times, in various orderings. Some terms are introduced to define what they are:

Regarding ordering:

There isn't a "best" ordering, each provides different possibilities and comes with different costs, optimizations, and related failure modes.

Idempotence

Idempotence is important enough to warrant its own entry. Idempotence means that when messages are seen more than once, resent or replayed, they don't impact the system differently than if they were sent just once.

Common strategies is for each message to be able to refer to previously seen messages so that you define an ordering that will prevent replaying older messages, setting unique IDs (such as transaction IDs) coupled with a store that will prevent replaying transactions, and so on.

See Idempotence is not a medical condition for a great read on it, with various related strategies.

State Machine Replication

This is a theoretical model by which, given the same sequences of states and the same operations applied to them (disregarding all kinds of non-determinism), all state machines will end up with the same result.

This model ends up being critical to most reliable systems out there, which tend to all try to replay all events to all subsystems in the same order, ensuring predictable data sets in all places.

This is generally done by picking a leader; all writes are done through the leader, and all the followers get a consistent replicated state of the system, allowing them to eventually become leaders or to fan-out their state to other actors.

State-Based Replication

State-based replication can be conceptually simpler to state-machine replication, with the idea that if you only replicate the state, you get the state at the end!

The problem is that it is extremely hard to make this fast and efficient. If your state is terabytes large, you don't want to re-send it on every operation. Common approaches will include splitting, hashing, and bucketing of data to detect changes and only send the changed bits (think of rsync), merkle trees to detect changes, or the idea of a patch to source code.

Practical Matters

Here are a bunch of resources worth digging into for various system design elements.

End-to-End Argument in System Design

Foundational practical aspect of system design for distributed systems:

The conclusion is that if you want anything to be reliable, you need an end-to-end acknowledgement, usually written by the application layer.

These ideas are behind the design of TCP as a protocol, but the authors also note that it wouldn't be sufficient to leave it at the protocol, the application layer must be involved.

Fallacies of Distributed Computing

The fallacies are:

Partial explanations on the Wiki page or full ones in the paper.

Common Practical Failure Modes

In practice, when you switch from Computer Science to Engineering, the types of faults you will find are a bit more varied, but can map to any of the theoretical models.

This section is an informal list of common sources of issues in a system. See also the CAP theorem checklist for other common cases.

Netsplit

Some nodes can talk to each other, but some nodes are unreachable to others. A common example is that a US-based network can communicate fine internally, and so could a EU-based network, but both would be unable to speak to each-other

Asymmetric Netsplit

Communication between groups of nodes is not symmetric. For example, imagine that the US network can send messages to the EU network, but the EU network cannot respond back.

This is a rarer mode when using TCP (although it has happened before), and a potentially common one when using UDP.

Split Brain

The way a lot of systems deal with failures is to keep a majority going. A split brain is what happens when both sides of a netsplit think they are the leader, and starts making conflicting decisions.

Timeouts

Timeouts are particularly tricky because they are non-deterministic. They can only be observed from one end, and you never know if a timeout that is ultimately interpreted as a failure was actually a failure, or just a delay due to networking, hardware, or GC pauses.

There are times where retransmissions are not safe if the message has already been seen (i.e. it is not idempotent), and timeouts essentially make it impossible to know if retransmission is safe to try: was the message acted on, dropped, or is it still in transit or in a buffer somewhere?

Missing Messages due to Ordering

Generally, using TCP and crashes will tend to mean that few messages get missed across systems, but frequent cases can include:

Clock Drift

Not all clocks on all systems are synchronized properly (even using NTP) and will go at different speeds.

Using a timestamp to sort through events is almost guaranteed to be a source of bugs, even moreso if the timestamps come from multiple computers.

The Client is Part of the System

A very common pitfall is to forget that the client that participates in a distributed system is part of it. Consistency on the server-side will not necessarily be worth much if the client can't make sense of the events or data it receives.

This is particularly insidious for database clients that do a non-idempotent transactions, time out, and have no way to know if they can try it again.

Restoring from multiple backups

A single backup is kind of easy to handle. Multiple backups run into a problem called consistent cuts (high level view) and distributed snapshots, which means that not all the backups are taken at the same time, and this introduces inconsistencies that can be construed as corrupting data.

The good news is there's no great solution and everyone suffers the same.

Consistency Models

There are dozens different levels of consistency, all of which are documented on Wikipedia, by Peter Bailis' paper on the topic, or overviewed by Kyle Kingsbury post on them

Note that while these definitions have clear semantics that academics tend to respect, they are not adopted uniformly or respected in various projects' or vendors' documentation in the industry.

Database Transaction Scopes

By default, most people assume database transactions are linearizable, and they tend not to be because that's way too slow as a default.

Each database might have different semantics, so the following links may cover the most major ones.

Be aware that while the PostgreSQL documentation is likely the clearest and most easy to understand one to introduce the topic, various vendors can assign different meanings to the same standard transaction scopes.

Logical Clocks

Those are data structures that allow to create either total or partial orderings between messages or state transitions.

Most common ones are:

CRDTs

CRDTs essentially are data structures that restrict operations that can be done such that they can never conflict, no matter which order they are done in or how concurrently this takes place.

Think of it as the specification on how someone would write a distributed redis that was never wrong, but only left maths behind.

This is still an active area of research and countless papers and variations are always coming out.

Other interesting material

The bible for putting all of these views together is Designing Data-Intensive Applications by Martin Kleppmann. Be advised however that everyone I know who absolutely loves this book are people who had a good foundation in distributed systems from reading a bunch of papers, and greatly appreciated having it all put in one place. Most people I've seen read it in book clubs with the aim get better at distributed systems still found it challenging and confusing at times, and benefitted from having someone around to whom they could ask questions in order to bridge some gaps. It is still the clearest source I can imagine for everything in one place.