Distributed Systems Foundations

11 May 2025

Distributed systems lie at the heart of modern computing infrastructure, enabling fault-tolerant, scalable services that span data centers, continents, and networks of unreliable components. The challenge of designing such systems is to reconcile correctness and performance under the realities of partial failure, concurrency, and asynchrony.

This literature review traces the intellectual evolution of key ideas that define the distributed systems landscape. It begins with foundational work by Leslie Lamport on the logical structure of time and causality in asynchronous systems, proceeds through impossibility results like the FLP theorem, and culminates in practical consensus algorithms that form the backbone of today’s distributed databases and coordination services.

Alongside consensus theory, we explore the articulation of tradeoff models that formalize the limits of distributed consistency. The CAP theorem, and later PACELC, provide a vocabulary for understanding what can be achieved—and what must be sacrificed—when systems face network partitions or demand low-latency responses.

Finally, we examine architectural responses to these theoretical constraints. The Lambda and Kappa architectures represent two distinct attempts to structure computation in the presence of uncertainty and scale: one by separating batch and stream computation, the other by consolidating them into a unified streaming model.

Evolution of Distributed Systems Theory and Architecture

year	milestone
1978	Lamport introduces Logical Clocks and the “happened-before” relation
1982	Byzantine Generals Problem formalizes consensus under arbitrary failures
1985	FLP Impossibility Theorem proves deterministic consensus is impossible with one crash in asynchrony
1998	Paxos protocol published as “The Part-Time Parliament”
2002	CAP Theorem formalized by Gilbert & Lynch
2010	PACELC model proposed by Abadi, introducing latency-consistency tradeoff
2013	Marz proposes the Lambda Architecture with immutable data and dual-path computation
2014	Kreps critiques Lambda and introduces the Kappa Architecture as a streaming-first alternative
2014	Ongaro & Ousterhout present Raft, emphasizing understandability and implementability

Several broad themes emerge across the literature:

Causality and Ordering: Without synchronized clocks, distributed systems rely on logical or physical surrogates to preserve causal reasoning.
Fault Tolerance Through Redundancy and Agreement: Quorums and majority-based protocols are central to tolerating crashes while ensuring safety.
Tradeoff-Aware Design: CAP and PACELC enforce humility—designers must consciously choose what to sacrifice under different conditions.
Architectural Patterns for Scalability: Event logs, immutability, and reprocessing offer powerful techniques to build scalable, resilient applications.

Time, Clocks, and the Ordering of Events in a Distributed System (1978)

Leslie Lamport, Communications of the ACM 21(7), 1978. [pdf]

Leslie Lamport’s seminal 1978 paper introduces the concept of logical clocks to formalize the partial ordering of events in a distributed system. The paper demonstrates that a total ordering can be imposed upon a system of partially ordered events using a simple algorithm for maintaining logical clocks. This ordering is essential for synchronizing distributed operations without relying on synchronized physical clocks, which may drift or be inconsistent. The work laid the groundwork for reasoning about causality and synchronization in distributed environments, influencing virtually every distributed algorithm that followed.

The “Happened-Before” Relation

Lamport defines a partial order among events based on causality, denoted by \(\rightarrow\):

Let \(a\) and \(b\) be events.

\(a \rightarrow b\) if:

\(a\) and \(b\) occur in the same process and \(a\) precedes \(b\)
\(a\) is the sending of a message and \(b\) is the receipt of that message
\(\exists c\) such that \(a \rightarrow c \rightarrow b\)

This relation is a partial ordering. Two events are concurrent if neither \(a \rightarrow b\) nor \(b \rightarrow a\).

graph LR
    subgraph Process P1
        A1[Event a] --> A2[Send m]
    end

    subgraph Process P2
        B1[Receive m] --> B2[Event b]
    end

    A2 --> B1

The “happened-before” relationship between events in different processes via message passing.

Logical Clocks

To extend the partial order to a total order, each process maintains a logical clock \(C_i\). The clocks are updated with the following rules:

Each event at process \(P_i\): increment \(C_i \leftarrow C_i + d\), typically \(d = 1\)
On sending a message \(m\): send timestamp \(T_m = C_i\)
On receiving a message \(m\) with timestamp \(T_m\):
- \(C_j \leftarrow \max(C_j, T_m) + 1\)

This ensures that \(a \rightarrow b \Rightarrow C(a) < C(b)\).

Total Ordering Using Logical Clocks

To define a total order \(\Rightarrow\) on events:

\[ a \Rightarrow b \iff \begin{cases} C(a) < C(b), & \\\\ C(a) = C(b) \text{ and } a < b \text{ (tie-breaker)}, & \end{cases} \]

where \(a < b\) is an arbitrary but consistent global ordering of process IDs or event indices.

This total order respects causality and can be used to serialize distributed operations in a consistent way.

Application: Mutual Exclusion

Using the logical clock, Lamport proposes a distributed algorithm for mutual exclusion:

When a process wants access to a resource, it timestamps a request and multicasts it.
It enters the critical section only after:
- Receiving its own request timestamp
- Receiving acknowledgements from all other processes with timestamps greater than or equal to its own

This guarantees safety (no two processes enter the critical section simultaneously) and fairness (requests granted in order).

Physical Clock Synchronization

Later in the paper, Lamport considers real physical clocks and shows how to synchronize them within a bound \(\epsilon\) by bounding message transmission delays and adjusting local clocks to maintain that constraint.

Let \(C_i(t)\) be the clock of process \(P_i\) at real time \(t\), then the condition to maintain is: \(|C_i(t) - C_j(t)| < \epsilon,\) for all \(i\), \(j\).

If clock drift is bounded and messages arrive within known delay bounds, synchronization can be maintained using offset correction.

This paper formalizes causality and time in distributed systems and introduces tools to make reasoning about distributed algorithms tractable. It is foundational for all work on synchronization, causality tracking, and distributed state consistency.

The Byzantine Generals Problem (1982)

Leslie Lamport, Robert Shostak, and Marshall Pease,
ACM Transactions on Programming Languages and Systems 4(3), 1982. [pdf]

This foundational paper formalizes the problem of achieving consensus in distributed systems in the presence of arbitrary (Byzantine) failures—those in which components may act maliciously or unpredictably, including sending inconsistent or false messages. Inspired by an allegorical military scenario, the paper sets out conditions under which agreement among non-faulty participants (generals) is possible and proposes a class of algorithms that can solve this problem.

The key result:

No solution exists using only oral (unauthenticated) messages if \(n ≤ 3f\), where n is the number of processes and f is the number of faulty ones. However, when \(n > 3f\), interactive consistency can be achieved, guaranteeing that all loyal nodes agree on the same values, and that values from loyal senders are faithfully received.

This result laid the theoretical groundwork for Byzantine Fault Tolerance (BFT), now foundational in distributed consensus protocols such as PBFT and Tendermint.

Interactive Consistency Requirements

The core formalization is the Interactive Consistency (IC) problem:

IC1: All loyal lieutenants obey the same order.
IC2: If the commander is loyal, all loyal lieutenants obey the order sent.

To achieve this, each general recursively relays the messages received from others, building up a complete picture of the communication network’s state. The algorithm is structured in rounds, where each general plays the role of transmitter and relay.

Let:

n be the total number of generals.
f be the number of faulty generals.

Oral Messages Protocol: OM(m)

The authors define a recursive algorithm \(\text{OM}(m)\), where \(m\) is the maximum number of traitors to tolerate.

Protocol \(\text{OM}(m)\):

The commander sends its value to every lieutenant.
Each lieutenant receives the message and:
- If \(m > 0\), acts as a commander in \(\text{OM}(m-1)\), sending the received value to the other lieutenants.
- If \(m = 0\), uses the received value directly.
Each lieutenant aggregates received messages using a majority function.

Correctness Condition

They prove:

\(\text{OM}(m)\) satisfies interactive consistency if and only if \(n > 3m\).

Recursive Structure

Using mathematical induction, they show that with each recursive level, a Byzantine process may distort one set of messages, but not all. If the majority function aggregates enough overlapping input, loyal participants can infer the true value.

Formal Guarantees

Let:

\(v(i)\) be the value proposed by general \(i\).
\( \text{Maj}(x_1, …, x_k) \) be a majority function.

Then:

For \(m = 0\): All loyal generals use the value received from the commander.

For \(m > 0\): Each general applies

\[v(i) = \text{Maj}\left(v_{ij}, \dots\right)\]

where each \(v_{ij}\) is a recursively received value from general \(j\).

graph TD
  Commander --> L1["Lieutenant 1"]
  Commander --> L2["Lieutenant 2"]
  Commander --> L3["Lieutenant 3"]

  L1 --> L2_L1["Relay to L2"]
  L1 --> L3_L1["Relay to L3"]
  L2 --> L1_L2["Relay to L1"]
  L2 --> L3_L2["Relay to L3"]
  L3 --> L1_L3["Relay to L1"]
  L3 --> L2_L3["Relay to L2"]

Message Tree
In the second round, each lieutenant relays what they received from the commander to the others.

Significance

The Byzantine Generals Problem is one of the most important theoretical results in distributed computing. It introduced the class of Byzantine failures and showed fundamental limits on consensus. Its results continue to influence:

Blockchain consensus (PBFT, Tendermint)
Fault-tolerant state machine replication
Cloud-native resilient architectures

It also sparked extensive follow-up work on authenticated (signed) messages, reducing the necessary quorum size.

Impossibility of Distributed Consensus with One Faulty Process (1985)

Michael J. Fischer, Nancy Lynch, and Michael S. Paterson,
Journal of the ACM 32(2), 1985. [pdf]

This seminal paper proves a fundamental result in distributed computing: in an asynchronous system, consensus is impossible if even one process can crash.

Known as the FLP impossibility result, the core conclusion is:

No deterministic consensus protocol can guarantee both termination and agreement in an asynchronous system with even one crash-fault.

The result had deep implications, showing that additional assumptions (such as partial synchrony or failure detectors) are required to build robust distributed consensus.

Problem Setting

The consensus problem requires processes to:

Agreement: All non-faulty processes must decide on the same value.
Validity: If all start with the same value \(v\), they must decide \(v\).
Termination: All non-faulty processes eventually decide.

The system is:

Asynchronous: Arbitrary delays in computation and communication.
Crash-prone: Processes may halt at any time (fail-stop).
Deterministic: No randomness used in decision-making.

Key Definitions

Configuration: \[\text{Global system state} = \text{local states} + \text{message buffers}\]
Event/Step: An atomic message delivery or local state transition.
Bivalent Configuration: A configuration from which both decisions (0 and 1) are still reachable.
Univalent Configuration: A configuration from which only one decision is possible.

The FLP Proof Strategy

The proof proceeds by contradiction:

Existence of Bivalence There exists an initial configuration that is bivalent. Why? If initial values differ, decision outcome depends on message delivery order.
Bivalence Persistence From a bivalent configuration, it is always possible for the adversary (scheduler) to choose a step that results in a bivalent configuration.
Non-Termination The adversary avoids committing to a decision by:
- Delivering non-critical messages first.
- Never allowing decisive communication to occur.

Thus, the system never reaches a univalent (decided) configuration ⇒ violates termination.

Formal Result

Let:

\(n \geq 2\): number of processes,
1 crash failure allowed,
Asynchronous message-passing,
No timing assumptions.

Then:

No deterministic protocol exists that solves consensus while ensuring termination, agreement, and validity.

graph TD
    B0[Bivalent Configuration]
    B0 -->|deliver m1| U0[Univalent: 0]
    B0 -->|deliver m2| U1[Univalent: 1]
    B0 -->|deliver m3| B1[Still Bivalent]

Bivalent Transition Control
By always choosing transitions like m3, the scheduler keeps the system in a bivalent state.

Implications and Legacy

Consensus requires assumptions beyond pure asynchrony:
- Failure detectors (e.g., \(\Omega\))
- Partial synchrony (e.g., DLS model)
- Randomization (e.g., Ben-Or’s protocol)
Inspired modern consensus protocols like:
- Paxos
- Raft
- PBFT

FLP showed that distributed consensus is not just hard, but impossible without extra assumptions.

The Part-Time Parliament (1998)

Leslie Lamport, ACM Transactions on Computer Systems 16(2), 1998. [pdf]

In this paper, Lamport presents a rigorous formalization of the Paxos consensus algorithm, disguised as the legislative process of a fictional civilization: the Paxons. The story metaphorically introduces how processes in a distributed system can reach agreement on a sequence of values, even in the presence of partial failures and message delays.

The main contribution is a protocol for reaching consensus in an asynchronous, crash-prone system, ensuring safety and progress using quorum-based voting and ballot numbers. Although the presentation is playful, the underlying protocol has become the canonical solution for replicated state machines in distributed systems.

Paxos ensures that multiple participants (replicas) can agree on a consistent, totally ordered log of commands, provided a majority of participants are available.

Problem: Replicated State Machines

Each node maintains a log of commands applied to a deterministic state machine.

Goal: All correct nodes agree on the same sequence of commands (decrees).
Challenge: Nodes may fail, rejoin, or communicate unreliably.

System Model

Crash-stop failures only (no Byzantine faults).
Asynchronous network (no timing guarantees).
Nodes communicate via messages (which can be delayed or duplicated).

Core Concepts

Let:

\(v\) be a proposed value (e.g., a command or decree).
\(b \in \mathbb{N}\) be a ballot number (unique and increasing).
\(Q \subseteq \text{Nodes}\) be a quorum (majority set).

Ballot

A ballot is an attempt to get a value chosen. It consists of:

A unique ballot number \(b\),
A quorum \(Q\),
A proposed value \(v\),
A record of votes.

Safety and Progress Invariants

Only one value can be chosen per decree index.
A value can be chosen only if a majority of acceptors accept it.
Any two quorums intersect → ensures consistent knowledge.

Paxos Protocol: Phases

sequenceDiagram
    participant Proposer
    participant Acceptor1
    participant Acceptor2
    participant Acceptor3

    Proposer->>Acceptor1: Prepare(b)
    Proposer->>Acceptor2: Prepare(b)
    Proposer->>Acceptor3: Prepare(b)

    Acceptor1-->>Proposer: Promise(b), prior vote
    Acceptor2-->>Proposer: Promise(b), prior vote
    Acceptor3-->>Proposer: Promise(b), prior vote

    Proposer->>Acceptor1: Accept(b, v)
    Proposer->>Acceptor2: Accept(b, v)

    Note right of Proposer: Value v chosen if majority accepts

The Paxos Protocol

Phase 1: Prepare

A proposer sends Prepare(b) to all acceptors.
Acceptors respond:
- If \(b\) > any ballot they’ve seen, they promise not to accept lower ballots,
- Else they return the highest ballot they’ve voted for (if any) and its value.

Phase 2: Propose

If proposer receives responses from a quorum, it chooses the highest-valued prior vote seen, else it proposes its own value \(v\)
Sends Accept(b, v) to quorum.
Acceptors accept the proposal unless they’ve promised a higher ballot.

Decision

Once a quorum has accepted \((b, v)\), the value \(v\) is chosen.

Progress Mechanism

Liveness depends on a stable proposer (leader).
If multiple proposers compete, ballots may clash, but eventual stable leadership allows progress.
This inspired Multi-Paxos, which amortizes consensus over many log entries once a leader is elected.

Paxos Properties

Property	Guarantee
Safety	No two different values can be chosen
Consistency	All learners agree on the same value
Liveness	Guaranteed only under eventual stability
Fault Tolerance	Tolerates up to \(\left\lfloor \frac{n-1}{2} \right\rfloor\) crash failures

Significance

Paxos provides the theoretical backbone for many real-world systems:
- Chubby (Google), Zookeeper (Apache), Consul, Etcd, and more.
Introduced quorum-based voting and majority intersection as foundations for consensus.
While conceptually elegant, Paxos was historically criticized for being difficult to understand, motivating later variants like Raft.

Multi-Paxos: Extends Paxos to a stream of decisions (replicated log).
Cheap Paxos, Fast Paxos, Paxos Made Simple: Variants improving performance or clarity.

Brewer’s Conjecture and the Feasibility of CAP Web Services (2002)

Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2), 2002. [pdf]

This influential paper formally proves the CAP theorem, a conjecture originally posited by Eric Brewer in 2000. The theorem asserts:

It is impossible for a distributed data system to simultaneously provide: 1. Consistency (C), 2. Availability (A), 3. Partition Tolerance (P).

In any system where network partitions are possible, one must choose between consistency and availability. The result became a cornerstone of reasoning about distributed database and storage system design.

Definitions

Let us define the three properties more formally:

Consistency (C): All nodes see the same data at the same time (linearizability).
Availability (A): Every request receives a (non-error) response, without guarantee that it contains the most recent write.
Partition Tolerance (P): The system continues to operate despite arbitrary message loss or partitioning of the network.

The CAP Theorem Statement

Let a system be modeled as an asynchronous network (unbounded message delay, no global clock), and let the system be partition-tolerant. Then:

No algorithm can simultaneously guarantee both availability and consistency in all executions.

This is shown via a proof by contradiction using partitions as adversarial delay.

Proof Sketch

Assume:

The system is asynchronous,
There is a network partition between two replicas,
The system attempts to provide availability and consistency.

Let:

Replica A receive a write \(w\),
Replica B later receive a read \(r\), while A and B cannot communicate due to the partition.

Because of availability, B must return a result for \(r\), but because of consistency, the result must reflect \(w\), which B hasn’t seen due to the partition.

Contradiction ⇒ Cannot guarantee both C and A during a partition.

Formalization

Assuming linearizability:

A system that ensures availability and consistency must be able to immediately propagate writes.

But in the presence of a partition, this is not possible. Therefore:

\[ \textbf{Partition Tolerance} \Rightarrow \neg \left(\textbf{Availability} \land \textbf{Consistency}\right) \]

graph TD
    C[Consistency]
    A[Availability]
    P[Partition Tolerance]

    C -- can't coexist w/ A during P --> A
    A -- can't coexist w/ C during P --> C
    P --> C
    P --> A

Tradeoff Triangle

Real-World Interpretations

Systems must pick two out of three under failure:
- CA (no P): systems that fail under partition (e.g., traditional RDBMS).
- CP (sacrifice A): e.g., HBase, MongoDB in strong-consistency mode.
- AP (sacrifice C): e.g., Dynamo, Cassandra, Riak.
During normal operation (no partition), systems can offer all three.

Misinterpretations Clarified

CAP does not apply to normal operations, only in failure conditions (partitioned network).
Linearizability is a stronger form of consistency than eventual consistency.
Eventual consistency systems favor A + P, sacrificing strong C.

Significance

The paper gave a formal backbone to Brewer’s conjecture, shifting CAP from an engineering intuition to a rigorous theorem.
Inspired decades of system design, influencing:
- Dynamo, Cassandra, Riak (AP systems),
- Spanner, Zookeeper (CP systems),
- PACELC model (adds latency/consistency tradeoff).

CAP remains one of the most cited results in distributed systems theory.

Consistency Tradeoffs in Modern Distributed Database System Design (2012)

Daniel J. Abadi, IEEE Computer 45(2), 2012. [pdf]

This paper critiques the widespread (and often incorrect) interpretation of the CAP theorem and introduces a more comprehensive model for analyzing distributed database tradeoffs: PACELC.

PACELC states: If a network Partition occurs (P), then a system must trade off between Availability (A) and Consistency (C); Else (E), when the system is running normally, it must trade off between Latency (L) and Consistency (C).

The model captures the reality that tradeoffs are present not only during failure but also in the normal, partition-free operation of distributed systems.

From CAP to PACELC

CAP Theorem (Gilbert & Lynch, 2002) only applies during network partitions.
But real systems face another critical design axis: latency vs. consistency in the absence of partitions.
PACELC addresses both situations:
- P → A or C (during failure)
- EL → L or C (during normal ops)

Thus:

\[ \textbf{PACELC} = \text{If P then A or C, Else L or C} \]

PACELC Design Taxonomy

Distributed databases can be classified using PACELC, where the design of a given system prioritizes, under a partition, Availability or Consistency, or, absent a partition, latency or consistency.

System	Under P	Under \(\neg\text{P}\)
Dynamo	A over C	L over C
Cassandra	A over C	L over C
BigTable	C over A	L over C
Spanner	C over A	C over L
PNUTS	A over C	L over C
MongoDB	A over C	L over C
HBase	C over A	L over C

Tradeoff Formalization

Let:

\(L\): latency (time to complete an operation)
\(C\): consistency (e.g. linearizability)
\(A\): availability (response under partition)

Under PACELC, a system is evaluated not only by its fault response, but by its steady-state behavior:

In the absence of partitions, a system can choose: \(L \leftrightarrow C\).

This shifts attention from just “what fails under partition” to “how fast and correct the system is when healthy.”

Latency/Consistency Tension

Example:

Read-after-write consistency may require coordination across regions (adds latency).
Geo-replicated systems may offer eventual consistency to avoid wide-area coordination latency.

Latency-Consistency Conflict:

Immediate responses require giving up global knowledge (weaker consistency).
Strong consistency requires synchronous coordination (increased latency).

graph TD
    Start[System Operating?] -->|Partitioned| P{Partition Present?}
    P -->|Yes| PC[Tradeoff: Availability vs Consistency]
    P -->|No| EC[Tradeoff: Latency vs Consistency]

PACELC Logic Flow

Significance

Extends CAP to reflect real-world tradeoffs faced even in healthy networks.
Bridges theory and practice in database design.
Widely adopted to classify NoSQL systems and cloud-native databases.
Encourages evaluation of systems along both:
- Fault tolerance dimension (P)
- Performance-consistency dimension (E)

Comparison with CAP

Model	Context	Tradeoff
CAP	Failure (partition)	Availability vs Consistency
PACELC	Normal + Partition	Latency vs Consistency + CAP

CAP explains failure-time design. PACELC explains total behavior over time.

How to Beat the CAP Theorem (2013)

Nathan Marz, Thoughts from the Red Planet (blog), 2013. [pdf]

Nathan Marz introduces a new systems design perspective to address the limitations imposed by the CAP theorem. Instead of avoiding tradeoffs, his proposal isolates and manages them through a principled approach that emphasizes immutability, append-only data, and batch + streaming pipelines.

The central claim: We can “beat” the CAP theorem not by violating it, but by shifting the complexity it causes into system design patterns where it’s more manageable.

This gives rise to the Lambda Architecture, a system architecture that combines batch and real-time computation to ensure correctness, fault tolerance, and low-latency.

The Lambda Architecture

The architecture is split into three layers:

Batch Layer
- Stores the immutable master dataset
- Periodically recomputes results from scratch
- Ensures accuracy and reprocessability
- Implemented via MapReduce, Spark, etc.
Speed Layer (Real-Time)
- Provides low-latency updates
- Computes incremental views using recent data
- May be eventually consistent or approximate
- Implemented via Storm, Flink, Samza, etc.
Serving Layer
- Merges batch and speed outputs
- Exposes queryable views (e.g., precomputed tables)

Key Design Principle: Immutable Data

Marz argues that much of the complexity in distributed systems arises from:

Mutable state
Incremental, in-place updates

Instead, systems should treat data as immutable and append-only, allowing:

Auditable logs
Deterministic recomputation
Simplified recovery (no “undo” necessary)

This enables the use of pure functions:

\[ \text{Query} = f(\text{All Data}) \]

Where recomputation = recomputing the function \(f\) from scratch.

Data and Time

Two key properties of “data” in this architecture:

Time-indexed: Every record exists at a moment in time.
Immutable: Once recorded, never altered.

Hence, “updates” are new records, and “deletes” are logical tombstones.

graph TD
  RawData[Immutable Raw Data] --> Batch[Batch Layer]
  RawData --> Stream[Speed Layer]

  Batch --> Merge[Merged Views]
  Stream --> Merge
  Merge --> Client[Serving Layer / Queries]

Lambda Architecture Flow

Beating the CAP Theorem?

CAP states: During a partition, you must trade off between C and A.

Marz acknowledges this.
But argues the Lambda Architecture sidesteps the pain by:
- Separating correctness (batch layer) from availability (speed layer).
- Embracing eventual consistency in streaming, correctness in batch.
- Designing for human fault-tolerance—reducing likelihood of developer mistakes via immutable data flow.

Practical Consequences

Reprocessing is easy: re-run the batch job.
Debugging is easier: full lineage via immutable logs.
Fault isolation: real-time bugs don’t corrupt historical truth.

Significance

Inspired widespread adoption of Lambda Architecture in real-time analytics systems.
Precursor to modern Kappa Architecture (streaming-only).
Shifted emphasis from correctness-in-place to recomputability and lineage.
Spark, Flink, Kafka Streams, and others echo these principles.

Criticism and Evolution

Requires code duplication (batch + streaming logic).
Complex to operate and maintain two paths.
Kappa Architecture (Jay Kreps) proposed as a simplification: unify batch/stream in a single stream-processing path.

Questioning the Lambda Architecture (2014)

Jay Kreps, O’Reilly Radar (blog), 2014. [pdf]

Jay Kreps, co-creator of Apache Kafka, critiques the Lambda Architecture—an influential design pattern combining batch and stream processing—by highlighting its operational complexity, code duplication, and unnecessary dual-path logic.

He argues that its goals (fault-tolerance, correctness, reprocessability) can be achieved more cleanly via a unified streaming architecture built on append-only logs, specifically through systems like Kafka and Samza.

“The Lambda Architecture deserves credit for identifying key problems… but running two systems in parallel, with duplicate logic, is just too complex.”

Lambda Architecture Recap

Marz’s Lambda model separates concerns:

Batch Layer: Immutable historical truth
Speed Layer: Real-time low-latency computation
Serving Layer: Merges batch + stream views

Kreps praises:

Emphasis on immutability
Attention to reprocessing
Support for debuggability via raw data retention

But notes that these goals are achievable with a single system designed properly.

Criticism of Lambda

Code Duplication
- Business logic must be implemented twice (batch + stream).
- Hard to test for correctness equivalence.
Operational Overhead
- Running and maintaining two stacks is difficult.
- Synchronizing state and logic becomes fragile.
Stream Limitations Overstated
- Many claim streaming is inherently less powerful.
- Kreps disagrees: stream processors can now offer exactly-once semantics, state management, and windowing.
Query-Time Merging is brittle
- Combining batch and speed results at query time introduces latency, complexity, and inconsistency.

Kreps’s Alternative: Unified Streaming

Use a log-based architecture (e.g., Apache Kafka) as the system backbone.
Treat all data as immutable events.
Build views incrementally via stream processing (e.g., Samza, Flink, Kafka Streams).
If a bug occurs, reprocess the entire log to rebuild the derived state.

This reflects the Kappa Architecture:

One path (streaming) for all processing.
Reprocessing via replaying the log.

graph TB
  subgraph Kappa Architecture
    Raw2["Raw Data (Append-Only Log)"] --> StreamOnly
    StreamOnly --> Output2
  end
  subgraph Lambda Architecture
    Raw[Raw Data] --> Batch
    Raw --> Stream
    Batch --> Merge
    Stream --> Merge
    Merge --> Output1
  end

Lambda vs. Kappa architectures

Key Principle: Logs as the Central Abstraction

Kreps’s philosophy centers on:

Immutable, replayable logs
Streaming as default processing model
Replay = Reprocessing = Debuggability

\[\text{Views} = f(\text{Event Log}) \]

This removes the need for a separate batch system to correct errors.

Significance

Shifted the industry’s mindset toward stream-first architecture.
Influenced development of systems like:
- Apache Kafka
- Kafka Streams
- Flink
- Materialize
Popularized the Kappa Architecture, now widely used in modern data platforms.

Kreps’s critique helped refine Marz’s vision into more operationally elegant patterns for real-time systems.

In Search of an Understandable Consensus Algorithm (2014)

Diego Ongaro and John Ousterhout, USENIX Annual Technical Conference, 2014. [pdf]

This paper introduces Raft, a consensus algorithm designed to be more understandable than Paxos while providing equivalent fault-tolerance and correctness guarantees. Raft simplifies distributed consensus by decomposing it into three relatively independent subproblems:

Leader Election
Log Replication
Safety

Raft was developed in response to the well-known difficulty of implementing Paxos correctly and intuitively. The authors validate Raft’s learnability through a controlled user study, demonstrating that students understand Raft significantly better than Paxos.

Problem Domain: Replicated State Machines

Raft solves the problem of maintaining a consistent replicated log across nodes in a distributed system.

Key requirements:

Safety: No two nodes decide differently.
Availability: Continue processing if a majority is alive.
Efficiency: Make progress with a majority and a stable leader.

Raft Architecture

Raft separates consensus into three subcomponents:

Leader Election
- Each term begins with an election.
- Nodes start as followers, become candidates if no heartbeats are received, and may become leaders if they win a majority.
- Randomized election timeouts prevent split votes.
Log Replication
- Leader appends entries to its log and sends them to followers.
- Followers accept entries if they match the leader’s log prefix.
- Once a log entry is stored on a majority and committed, it is executed.
Safety Rules
- Only entries from the current term may be committed via the leader.
- Log matching: entries are only accepted if preceding entries match (index and term).
- Elections are only won by candidates with the most up-to-date logs.

stateDiagram-v2
  [*] --> Follower
  Follower --> Candidate : Election timeout
  Candidate --> Leader : Wins majority vote
  Candidate --> Follower : Receives valid heartbeat
  Leader --> Follower : Steps down on higher term

Raft Node State Transitions

Cluster Membership Changes

Raft introduces a joint consensus mechanism:

Old and new configurations must overlap in quorum.
Transition proceeds via a two-phase configuration change.

This allows safe reconfiguration while continuing to make progress.

Formal Properties

Raft ensures:

Leader Completeness: If a log entry is committed in a term, all future leaders have that entry.
State Machine Safety: No two machines apply different commands at the same log index.

Comparison with Paxos

Feature	Paxos	Raft
Understandability	Low	High
Leader-based	Optional	Core to design
Log replication	Not fully specified	Explicitly integrated
Membership change	Informal & Complex	Joint consensus mechanism

Evaluation

User study (43 students) found:
- Raft was more easily learned than Paxos.
- Students answered more questions correctly about Raft than Paxos.

Significance

Raft has become the de facto standard for building replicated systems with strong consistency.
Used in: etcd, Consul, RethinkDB, TiKV, and many others.
Sparked wide interest in formalizing distributed protocols in a human-accessible way.

Raft’s success illustrates that understandability is not in conflict with rigor or performance.

Discussion

This set of writings collectively defines the core intellectual scaffolding of distributed systems. Lamport’s logical clocks and Paxos, the FLP impossibility, the Byzantine consensus bound, and CAP/PACELC tradeoffs are canonical knowledge. Later works like Raft and critiques of Lambda architecture represent efforts to tame the complexity these theories introduced.

Temporal Ordering and Causality

Lamport’s 1978 paper introduced a conceptual breakthrough: time in distributed systems can be modeled without synchronized clocks by defining a partial ordering of events via the “happened-before” relation. This enabled a formal treatment of causality and laid the groundwork for reasoning about system-wide consistency despite the absence of a global clock. Logical clocks and vector clocks continue to be used to infer causal structure in messaging and version histories.

Consensus: From Impossibility to Implementation

The FLP impossibility result (1985) demonstrated that no deterministic consensus protocol can guarantee termination in a purely asynchronous system with even a single crash failure. This result was both a barrier and a motivator. It forced system designers to either relax assumptions (e.g., allow randomization or partial synchrony) or redefine liveness expectations.

Paxos (1998), while notoriously difficult to understand, provided a workable and rigorous solution to consensus under crash failures and asynchronous assumptions. Raft (2014) offered a simplification by decomposing consensus into more intuitive subproblems (election, log replication, safety), increasing accessibility and correctness in real-world implementations.

Gilbert and Lynch’s formalization of Brewer’s conjecture (2002) articulated a hard limit: in the presence of partitions, systems cannot simultaneously guarantee availability and consistency. This result reframed systems design around strategic tradeoffs.

PACELC (2010) expanded this model, recognizing that latency-consistency tradeoffs also apply during normal operations. Systems thus face choices not just in failure conditions, but in everyday performance tuning.

Immutability as a Systems Pattern

Nathan Marz’s Lambda Architecture (2013) proposed a conceptual workaround to CAP by decoupling correctness (via batch recomputation) from responsiveness (via real-time computation). The design emphasized immutability, append-only logs, and recomputability—anticipating trends now foundational in event-sourced architectures.

Jay Kreps’ critique (2014) acknowledged the insights of Lambda but argued for a more streamlined approach—Kappa Architecture—where a unified, replayable event log (e.g., Kafka) serves both real-time and recovery purposes. This eliminated code duplication and operational burden, favoring simplicity and auditability.

Distributed Systems Foundations

Time, Clocks, and the Ordering of Events in a Distributed System (1978)

The “Happened-Before” Relation

Logical Clocks

Total Ordering Using Logical Clocks

Application: Mutual Exclusion

Physical Clock Synchronization

The Byzantine Generals Problem (1982)

Interactive Consistency Requirements

Oral Messages Protocol: OM(m)

Protocol \(\text{OM}(m)\):

Correctness Condition

Recursive Structure

Formal Guarantees

Significance

Impossibility of Distributed Consensus with One Faulty Process (1985)

Problem Setting

Key Definitions

The FLP Proof Strategy

Formal Result

Implications and Legacy

The Part-Time Parliament (1998)

Problem: Replicated State Machines

System Model

Core Concepts

Ballot

Safety and Progress Invariants

Paxos Protocol: Phases

Phase 1: Prepare

Phase 2: Propose

Decision

Progress Mechanism

Paxos Properties

Significance

Further Refinements

Brewer’s Conjecture and the Feasibility of CAP Web Services (2002)

Definitions

The CAP Theorem Statement

Proof Sketch

Formalization

Real-World Interpretations

Misinterpretations Clarified

Significance

Consistency Tradeoffs in Modern Distributed Database System Design (2012)

From CAP to PACELC

PACELC Design Taxonomy

Tradeoff Formalization

Latency/Consistency Tension

Significance

Comparison with CAP

How to Beat the CAP Theorem (2013)

The Lambda Architecture

Key Design Principle: Immutable Data

Data and Time

Beating the CAP Theorem?

Practical Consequences

Significance

Criticism and Evolution

Questioning the Lambda Architecture (2014)

Lambda Architecture Recap

Criticism of Lambda

Kreps’s Alternative: Unified Streaming

Key Principle: Logs as the Central Abstraction

Significance

In Search of an Understandable Consensus Algorithm (2014)

Problem Domain: Replicated State Machines

Raft Architecture

Cluster Membership Changes

Formal Properties

Comparison with Paxos

Evaluation

Significance

Discussion

Temporal Ordering and Causality

Consensus: From Impossibility to Implementation

The CAP Theorem and Its Refinements

Immutability as a Systems Pattern