Distributed Systems Foundations
Distributed systems lie at the heart of modern computing infrastructure, enabling fault-tolerant, scalable services that span data centers, continents, and networks of unreliable components. The challenge of designing such systems is to reconcile correctness and performance under the realities of partial failure, concurrency, and asynchrony.
This literature review traces the intellectual evolution of key ideas that define the distributed systems landscape. It begins with foundational work by Leslie Lamport on the logical structure of time and causality in asynchronous systems, proceeds through impossibility results like the FLP theorem, and culminates in practical consensus algorithms that form the backbone of today’s distributed databases and coordination services.
Alongside consensus theory, we explore the articulation of tradeoff models that formalize the limits of distributed consistency. The CAP theorem, and later PACELC, provide a vocabulary for understanding what can be achieved—and what must be sacrificed—when systems face network partitions or demand low-latency responses.
Finally, we examine architectural responses to these theoretical constraints. The Lambda and Kappa architectures represent two distinct attempts to structure computation in the presence of uncertainty and scale: one by separating batch and stream computation, the other by consolidating them into a unified streaming model.
Evolution of Distributed Systems Theory and Architecture
year | milestone |
---|---|
1978 | Lamport introduces Logical Clocks and the “happened-before” relation |
1982 | Byzantine Generals Problem formalizes consensus under arbitrary failures |
1985 | FLP Impossibility Theorem proves deterministic consensus is impossible with one crash in asynchrony |
1998 | Paxos protocol published as “The Part-Time Parliament” |
2002 | CAP Theorem formalized by Gilbert & Lynch |
2010 | PACELC model proposed by Abadi, introducing latency-consistency tradeoff |
2013 | Marz proposes the Lambda Architecture with immutable data and dual-path computation |
2014 | Kreps critiques Lambda and introduces the Kappa Architecture as a streaming-first alternative |
2014 | Ongaro & Ousterhout present Raft, emphasizing understandability and implementability |
Several broad themes emerge across the literature:
- Causality and Ordering: Without synchronized clocks, distributed systems rely on logical or physical surrogates to preserve causal reasoning.
- Fault Tolerance Through Redundancy and Agreement: Quorums and majority-based protocols are central to tolerating crashes while ensuring safety.
- Tradeoff-Aware Design: CAP and PACELC enforce humility—designers must consciously choose what to sacrifice under different conditions.
- Architectural Patterns for Scalability: Event logs, immutability, and reprocessing offer powerful techniques to build scalable, resilient applications.
Time, Clocks, and the Ordering of Events in a Distributed System (1978)
Leslie Lamport, Communications of the ACM 21(7), 1978. [pdf]
Leslie Lamport’s seminal 1978 paper introduces the concept of logical clocks to formalize the partial ordering of events in a distributed system. The paper demonstrates that a total ordering can be imposed upon a system of partially ordered events using a simple algorithm for maintaining logical clocks. This ordering is essential for synchronizing distributed operations without relying on synchronized physical clocks, which may drift or be inconsistent. The work laid the groundwork for reasoning about causality and synchronization in distributed environments, influencing virtually every distributed algorithm that followed.
The “Happened-Before” Relation
Lamport defines a partial order among events based on causality, denoted by \(\rightarrow\):
Let \(a\) and \(b\) be events.
\(a \rightarrow b\) if:
- \(a\) and \(b\) occur in the same process and \(a\) precedes \(b\)
- \(a\) is the sending of a message and \(b\) is the receipt of that message
- \(\exists c\) such that \(a \rightarrow c \rightarrow b\)
This relation is a partial ordering. Two events are concurrent if neither \(a \rightarrow b\) nor \(b \rightarrow a\).
graph LR
subgraph Process P1
A1[Event a] --> A2[Send m]
end
subgraph Process P2
B1[Receive m] --> B2[Event b]
end
A2 --> B1
The “happened-before” relationship between events in different processes via message passing.
Logical Clocks
To extend the partial order to a total order, each process maintains a logical clock \(C_i\). The clocks are updated with the following rules:
- Each event at process \(P_i\): increment \(C_i \leftarrow C_i + d\), typically \(d = 1\)
- On sending a message \(m\): send timestamp \(T_m = C_i\)
- On receiving a message \(m\) with timestamp \(T_m\):
- \(C_j \leftarrow \max(C_j, T_m) + 1\)
This ensures that \(a \rightarrow b \Rightarrow C(a) < C(b)\).
Total Ordering Using Logical Clocks
To define a total order \(\Rightarrow\) on events:
\[ a \Rightarrow b \iff \begin{cases} C(a) < C(b), & \\\\ C(a) = C(b) \text{ and } a < b \text{ (tie-breaker)}, & \end{cases} \]
where \(a < b\) is an arbitrary but consistent global ordering of process IDs or event indices.
This total order respects causality and can be used to serialize distributed operations in a consistent way.
Application: Mutual Exclusion
Using the logical clock, Lamport proposes a distributed algorithm for mutual exclusion:
- When a process wants access to a resource, it timestamps a request and multicasts it.
- It enters the critical section only after:
- Receiving its own request timestamp
- Receiving acknowledgements from all other processes with timestamps greater than or equal to its own
This guarantees safety (no two processes enter the critical section simultaneously) and fairness (requests granted in order).
Physical Clock Synchronization
Later in the paper, Lamport considers real physical clocks and shows how to synchronize them within a bound \(\epsilon\) by bounding message transmission delays and adjusting local clocks to maintain that constraint.
Let \(C_i(t)\) be the clock of process \(P_i\) at real time \(t\), then the condition to maintain is: \(|C_i(t) - C_j(t)| < \epsilon,\) for all \(i\), \(j\).
If clock drift is bounded and messages arrive within known delay bounds, synchronization can be maintained using offset correction.
This paper formalizes causality and time in distributed systems and introduces tools to make reasoning about distributed algorithms tractable. It is foundational for all work on synchronization, causality tracking, and distributed state consistency.
The Byzantine Generals Problem (1982)
Leslie Lamport, Robert Shostak, and Marshall Pease,
ACM Transactions on Programming Languages and Systems 4(3), 1982.
[pdf]
This foundational paper formalizes the problem of achieving consensus in distributed systems in the presence of arbitrary (Byzantine) failures—those in which components may act maliciously or unpredictably, including sending inconsistent or false messages. Inspired by an allegorical military scenario, the paper sets out conditions under which agreement among non-faulty participants (generals) is possible and proposes a class of algorithms that can solve this problem.
The key result:
No solution exists using only oral (unauthenticated) messages if \(n ≤ 3f\), where n is the number of processes and f is the number of faulty ones. However, when \(n > 3f\), interactive consistency can be achieved, guaranteeing that all loyal nodes agree on the same values, and that values from loyal senders are faithfully received.
This result laid the theoretical groundwork for Byzantine Fault Tolerance (BFT), now foundational in distributed consensus protocols such as PBFT and Tendermint.
Interactive Consistency Requirements
The core formalization is the Interactive Consistency (IC) problem:
- IC1: All loyal lieutenants obey the same order.
- IC2: If the commander is loyal, all loyal lieutenants obey the order sent.
To achieve this, each general recursively relays the messages received from others, building up a complete picture of the communication network’s state. The algorithm is structured in rounds, where each general plays the role of transmitter and relay.
Let:
- n be the total number of generals.
- f be the number of faulty generals.
Oral Messages Protocol: OM(m)
The authors define a recursive algorithm \(\text{OM}(m)\), where \(m\) is the maximum number of traitors to tolerate.
Protocol \(\text{OM}(m)\):
- The commander sends its value to every lieutenant.
- Each lieutenant receives the message and:
- If \(m > 0\), acts as a commander in \(\text{OM}(m-1)\), sending the received value to the other lieutenants.
- If \(m = 0\), uses the received value directly.
- Each lieutenant aggregates received messages using a majority function.
Correctness Condition
They prove:
\(\text{OM}(m)\) satisfies interactive consistency if and only if \(n > 3m\).
Recursive Structure
Using mathematical induction, they show that with each recursive level, a Byzantine process may distort one set of messages, but not all. If the majority function aggregates enough overlapping input, loyal participants can infer the true value.
Formal Guarantees
Let:
- \(v(i)\) be the value proposed by general \(i\).
- \( \text{Maj}(x_1, …, x_k) \) be a majority function.
Then:
For \(m = 0\): All loyal generals use the value received from the commander.
For \(m > 0\): Each general applies
\[v(i) = \text{Maj}\left(v_{ij}, \dots\right)\]
where each \(v_{ij}\) is a recursively received value from general \(j\).
graph TD
Commander --> L1["Lieutenant 1"]
Commander --> L2["Lieutenant 2"]
Commander --> L3["Lieutenant 3"]
L1 --> L2_L1["Relay to L2"]
L1 --> L3_L1["Relay to L3"]
L2 --> L1_L2["Relay to L1"]
L2 --> L3_L2["Relay to L3"]
L3 --> L1_L3["Relay to L1"]
L3 --> L2_L3["Relay to L2"]
Message Tree
In the second round, each lieutenant relays what they received from the commander to the others.
Significance
The Byzantine Generals Problem is one of the most important theoretical results in distributed computing. It introduced the class of Byzantine failures and showed fundamental limits on consensus. Its results continue to influence:
- Blockchain consensus (PBFT, Tendermint)
- Fault-tolerant state machine replication
- Cloud-native resilient architectures
It also sparked extensive follow-up work on authenticated (signed) messages, reducing the necessary quorum size.
Impossibility of Distributed Consensus with One Faulty Process (1985)
Michael J. Fischer, Nancy Lynch, and Michael S. Paterson,
Journal of the ACM 32(2), 1985.
[pdf]
This seminal paper proves a fundamental result in distributed computing: in an asynchronous system, consensus is impossible if even one process can crash.
Known as the FLP impossibility result, the core conclusion is:
No deterministic consensus protocol can guarantee both termination and agreement in an asynchronous system with even one crash-fault.
The result had deep implications, showing that additional assumptions (such as partial synchrony or failure detectors) are required to build robust distributed consensus.
Problem Setting
The consensus problem requires processes to:
- Agreement: All non-faulty processes must decide on the same value.
- Validity: If all start with the same value \(v\), they must decide \(v\).
- Termination: All non-faulty processes eventually decide.
The system is:
- Asynchronous: Arbitrary delays in computation and communication.
- Crash-prone: Processes may halt at any time (fail-stop).
- Deterministic: No randomness used in decision-making.
Key Definitions
- Configuration: \[\text{Global system state} = \text{local states} + \text{message buffers}\]
- Event/Step: An atomic message delivery or local state transition.
- Bivalent Configuration: A configuration from which both decisions (0 and 1) are still reachable.
- Univalent Configuration: A configuration from which only one decision is possible.
The FLP Proof Strategy
The proof proceeds by contradiction:
Existence of Bivalence There exists an initial configuration that is bivalent. Why? If initial values differ, decision outcome depends on message delivery order.
Bivalence Persistence From a bivalent configuration, it is always possible for the adversary (scheduler) to choose a step that results in a bivalent configuration.
Non-Termination The adversary avoids committing to a decision by:
- Delivering non-critical messages first.
- Never allowing decisive communication to occur.
Thus, the system never reaches a univalent (decided) configuration ⇒ violates termination.
Formal Result
Let:
- \(n \geq 2\): number of processes,
- 1 crash failure allowed,
- Asynchronous message-passing,
- No timing assumptions.
Then:
No deterministic protocol exists that solves consensus while ensuring termination, agreement, and validity.
graph TD
B0[Bivalent Configuration]
B0 -->|deliver m1| U0[Univalent: 0]
B0 -->|deliver m2| U1[Univalent: 1]
B0 -->|deliver m3| B1[Still Bivalent]
Bivalent Transition Control
By always choosing transitions like m3
, the scheduler keeps the system in a bivalent state.
Implications and Legacy
- Consensus requires assumptions beyond pure asynchrony:
- Failure detectors (e.g., \(\Omega\))
- Partial synchrony (e.g., DLS model)
- Randomization (e.g., Ben-Or’s protocol)
- Inspired modern consensus protocols like:
- Paxos
- Raft
- PBFT
FLP showed that distributed consensus is not just hard, but impossible without extra assumptions.
The Part-Time Parliament (1998)
Leslie Lamport, ACM Transactions on Computer Systems 16(2), 1998. [pdf]
In this paper, Lamport presents a rigorous formalization of the Paxos consensus algorithm, disguised as the legislative process of a fictional civilization: the Paxons. The story metaphorically introduces how processes in a distributed system can reach agreement on a sequence of values, even in the presence of partial failures and message delays.
The main contribution is a protocol for reaching consensus in an asynchronous, crash-prone system, ensuring safety and progress using quorum-based voting and ballot numbers. Although the presentation is playful, the underlying protocol has become the canonical solution for replicated state machines in distributed systems.
Paxos ensures that multiple participants (replicas) can agree on a consistent, totally ordered log of commands, provided a majority of participants are available.
Problem: Replicated State Machines
Each node maintains a log of commands applied to a deterministic state machine.
- Goal: All correct nodes agree on the same sequence of commands (decrees).
- Challenge: Nodes may fail, rejoin, or communicate unreliably.
System Model
- Crash-stop failures only (no Byzantine faults).
- Asynchronous network (no timing guarantees).
- Nodes communicate via messages (which can be delayed or duplicated).
Core Concepts
Let:
- \(v\) be a proposed value (e.g., a command or decree).
- \(b \in \mathbb{N}\) be a ballot number (unique and increasing).
- \(Q \subseteq \text{Nodes}\) be a quorum (majority set).
Ballot
A ballot is an attempt to get a value chosen. It consists of:
- A unique ballot number \(b\),
- A quorum \(Q\),
- A proposed value \(v\),
- A record of votes.
Safety and Progress Invariants
- Only one value can be chosen per decree index.
- A value can be chosen only if a majority of acceptors accept it.
- Any two quorums intersect → ensures consistent knowledge.
Paxos Protocol: Phases
sequenceDiagram
participant Proposer
participant Acceptor1
participant Acceptor2
participant Acceptor3
Proposer->>Acceptor1: Prepare(b)
Proposer->>Acceptor2: Prepare(b)
Proposer->>Acceptor3: Prepare(b)
Acceptor1-->>Proposer: Promise(b), prior vote
Acceptor2-->>Proposer: Promise(b), prior vote
Acceptor3-->>Proposer: Promise(b), prior vote
Proposer->>Acceptor1: Accept(b, v)
Proposer->>Acceptor2: Accept(b, v)
Note right of Proposer: Value v chosen if majority accepts
The Paxos Protocol
Phase 1: Prepare
- A proposer sends
Prepare(b)
to all acceptors. - Acceptors respond:
- If \(b\) > any ballot they’ve seen, they promise not to accept lower ballots,
- Else they return the highest ballot they’ve voted for (if any) and its value.
Phase 2: Propose
- If proposer receives responses from a quorum, it chooses the highest-valued prior vote seen, else it proposes its own value \(v\)
- Sends
Accept(b, v)
to quorum. - Acceptors accept the proposal unless they’ve promised a higher ballot.
Decision
- Once a quorum has accepted \((b, v)\), the value \(v\) is chosen.
Progress Mechanism
- Liveness depends on a stable proposer (leader).
- If multiple proposers compete, ballots may clash, but eventual stable leadership allows progress.
- This inspired Multi-Paxos, which amortizes consensus over many log entries once a leader is elected.
Paxos Properties
Property | Guarantee |
---|---|
Safety | No two different values can be chosen |
Consistency | All learners agree on the same value |
Liveness | Guaranteed only under eventual stability |
Fault Tolerance | Tolerates up to \(\left\lfloor \frac{n-1}{2} \right\rfloor\) crash failures |
Significance
- Paxos provides the theoretical backbone for many real-world systems:
- Chubby (Google), Zookeeper (Apache), Consul, Etcd, and more.
- Introduced quorum-based voting and majority intersection as foundations for consensus.
- While conceptually elegant, Paxos was historically criticized for being difficult to understand, motivating later variants like Raft.
Further Refinements
- Multi-Paxos: Extends Paxos to a stream of decisions (replicated log).
- Cheap Paxos, Fast Paxos, Paxos Made Simple: Variants improving performance or clarity.
Brewer’s Conjecture and the Feasibility of CAP Web Services (2002)
Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2), 2002. [pdf]
This influential paper formally proves the CAP theorem, a conjecture originally posited by Eric Brewer in 2000. The theorem asserts:
It is impossible for a distributed data system to simultaneously provide: 1. Consistency (C), 2. Availability (A), 3. Partition Tolerance (P).
In any system where network partitions are possible, one must choose between consistency and availability. The result became a cornerstone of reasoning about distributed database and storage system design.
Definitions
Let us define the three properties more formally:
- Consistency (C): All nodes see the same data at the same time (linearizability).
- Availability (A): Every request receives a (non-error) response, without guarantee that it contains the most recent write.
- Partition Tolerance (P): The system continues to operate despite arbitrary message loss or partitioning of the network.
The CAP Theorem Statement
Let a system be modeled as an asynchronous network (unbounded message delay, no global clock), and let the system be partition-tolerant. Then:
No algorithm can simultaneously guarantee both availability and consistency in all executions.
This is shown via a proof by contradiction using partitions as adversarial delay.
Proof Sketch
Assume:
- The system is asynchronous,
- There is a network partition between two replicas,
- The system attempts to provide availability and consistency.
Let:
- Replica A receive a write \(w\),
- Replica B later receive a read \(r\), while A and B cannot communicate due to the partition.
Because of availability, B must return a result for \(r\), but because of consistency, the result must reflect \(w\), which B hasn’t seen due to the partition.
Contradiction ⇒ Cannot guarantee both C and A during a partition.
Formalization
Assuming linearizability:
A system that ensures availability and consistency must be able to immediately propagate writes.
But in the presence of a partition, this is not possible. Therefore:
\[ \textbf{Partition Tolerance} \Rightarrow \neg \left(\textbf{Availability} \land \textbf{Consistency}\right) \]
graph TD
C[Consistency]
A[Availability]
P[Partition Tolerance]
C -- can't coexist w/ A during P --> A
A -- can't coexist w/ C during P --> C
P --> C
P --> A
Tradeoff Triangle
Real-World Interpretations
- Systems must pick two out of three under failure:
- CA (no P): systems that fail under partition (e.g., traditional RDBMS).
- CP (sacrifice A): e.g., HBase, MongoDB in strong-consistency mode.
- AP (sacrifice C): e.g., Dynamo, Cassandra, Riak.
- During normal operation (no partition), systems can offer all three.
Misinterpretations Clarified
- CAP does not apply to normal operations, only in failure conditions (partitioned network).
- Linearizability is a stronger form of consistency than eventual consistency.
- Eventual consistency systems favor A + P, sacrificing strong C.
Significance
The paper gave a formal backbone to Brewer’s conjecture, shifting CAP from an engineering intuition to a rigorous theorem.
Inspired decades of system design, influencing:
- Dynamo, Cassandra, Riak (AP systems),
- Spanner, Zookeeper (CP systems),
- PACELC model (adds latency/consistency tradeoff).
CAP remains one of the most cited results in distributed systems theory.
Consistency Tradeoffs in Modern Distributed Database System Design (2012)
Daniel J. Abadi, IEEE Computer 45(2), 2012. [pdf]
This paper critiques the widespread (and often incorrect) interpretation of the CAP theorem and introduces a more comprehensive model for analyzing distributed database tradeoffs: PACELC.
PACELC states: If a network Partition occurs (P), then a system must trade off between Availability (A) and Consistency (C); Else (E), when the system is running normally, it must trade off between Latency (L) and Consistency (C).
The model captures the reality that tradeoffs are present not only during failure but also in the normal, partition-free operation of distributed systems.
From CAP to PACELC
- CAP Theorem (Gilbert & Lynch, 2002) only applies during network partitions.
- But real systems face another critical design axis: latency vs. consistency in the absence of partitions.
- PACELC addresses both situations:
- P → A or C (during failure)
- EL → L or C (during normal ops)
Thus:
\[ \textbf{PACELC} = \text{If P then A or C, Else L or C} \]
PACELC Design Taxonomy
Distributed databases can be classified using PACELC, where the design of a given system prioritizes, under a partition, Availability or Consistency, or, absent a partition, latency or consistency.
System | Under P | Under \(\neg\text{P}\) |
---|---|---|
Dynamo | A over C | L over C |
Cassandra | A over C | L over C |
BigTable | C over A | L over C |
Spanner | C over A | C over L |
PNUTS | A over C | L over C |
MongoDB | A over C | L over C |
HBase | C over A | L over C |
Tradeoff Formalization
Let:
- \(L\): latency (time to complete an operation)
- \(C\): consistency (e.g. linearizability)
- \(A\): availability (response under partition)
Under PACELC, a system is evaluated not only by its fault response, but by its steady-state behavior:
In the absence of partitions, a system can choose: \(L \leftrightarrow C\).
This shifts attention from just “what fails under partition” to “how fast and correct the system is when healthy.”
Latency/Consistency Tension
Example:
- Read-after-write consistency may require coordination across regions (adds latency).
- Geo-replicated systems may offer eventual consistency to avoid wide-area coordination latency.
Latency-Consistency Conflict:
- Immediate responses require giving up global knowledge (weaker consistency).
- Strong consistency requires synchronous coordination (increased latency).
graph TD
Start[System Operating?] -->|Partitioned| P{Partition Present?}
P -->|Yes| PC[Tradeoff: Availability vs Consistency]
P -->|No| EC[Tradeoff: Latency vs Consistency]
PACELC Logic Flow
Significance
Extends CAP to reflect real-world tradeoffs faced even in healthy networks.
Bridges theory and practice in database design.
Widely adopted to classify NoSQL systems and cloud-native databases.
Encourages evaluation of systems along both:
- Fault tolerance dimension (P)
- Performance-consistency dimension (E)
Comparison with CAP
Model | Context | Tradeoff |
---|---|---|
CAP | Failure (partition) | Availability vs Consistency |
PACELC | Normal + Partition | Latency vs Consistency + CAP |
CAP explains failure-time design. PACELC explains total behavior over time.
How to Beat the CAP Theorem (2013)
Nathan Marz, Thoughts from the Red Planet (blog), 2013. [pdf]
Nathan Marz introduces a new systems design perspective to address the limitations imposed by the CAP theorem. Instead of avoiding tradeoffs, his proposal isolates and manages them through a principled approach that emphasizes immutability, append-only data, and batch + streaming pipelines.
The central claim: We can “beat” the CAP theorem not by violating it, but by shifting the complexity it causes into system design patterns where it’s more manageable.
This gives rise to the Lambda Architecture, a system architecture that combines batch and real-time computation to ensure correctness, fault tolerance, and low-latency.
The Lambda Architecture
The architecture is split into three layers:
- Batch Layer
- Stores the immutable master dataset
- Periodically recomputes results from scratch
- Ensures accuracy and reprocessability
- Implemented via MapReduce, Spark, etc.
- Speed Layer (Real-Time)
- Provides low-latency updates
- Computes incremental views using recent data
- May be eventually consistent or approximate
- Implemented via Storm, Flink, Samza, etc.
- Serving Layer
- Merges batch and speed outputs
- Exposes queryable views (e.g., precomputed tables)
Key Design Principle: Immutable Data
Marz argues that much of the complexity in distributed systems arises from:
- Mutable state
- Incremental, in-place updates
Instead, systems should treat data as immutable and append-only, allowing:
- Auditable logs
- Deterministic recomputation
- Simplified recovery (no “undo” necessary)
This enables the use of pure functions:
\[ \text{Query} = f(\text{All Data}) \]
Where recomputation = recomputing the function \(f\) from scratch.
Data and Time
Two key properties of “data” in this architecture:
- Time-indexed: Every record exists at a moment in time.
- Immutable: Once recorded, never altered.
Hence, “updates” are new records, and “deletes” are logical tombstones.
graph TD
RawData[Immutable Raw Data] --> Batch[Batch Layer]
RawData --> Stream[Speed Layer]
Batch --> Merge[Merged Views]
Stream --> Merge
Merge --> Client[Serving Layer / Queries]
Lambda Architecture Flow
Beating the CAP Theorem?
CAP states: During a partition, you must trade off between C and A.
Marz acknowledges this.
But argues the Lambda Architecture sidesteps the pain by:
- Separating correctness (batch layer) from availability (speed layer).
- Embracing eventual consistency in streaming, correctness in batch.
- Designing for human fault-tolerance—reducing likelihood of developer mistakes via immutable data flow.
Practical Consequences
- Reprocessing is easy: re-run the batch job.
- Debugging is easier: full lineage via immutable logs.
- Fault isolation: real-time bugs don’t corrupt historical truth.
Significance
- Inspired widespread adoption of Lambda Architecture in real-time analytics systems.
- Precursor to modern Kappa Architecture (streaming-only).
- Shifted emphasis from correctness-in-place to recomputability and lineage.
- Spark, Flink, Kafka Streams, and others echo these principles.
Criticism and Evolution
Requires code duplication (batch + streaming logic).
Complex to operate and maintain two paths.
Kappa Architecture (Jay Kreps) proposed as a simplification: unify batch/stream in a single stream-processing path.
Questioning the Lambda Architecture (2014)
Jay Kreps, O’Reilly Radar (blog), 2014. [pdf]
Jay Kreps, co-creator of Apache Kafka, critiques the Lambda Architecture—an influential design pattern combining batch and stream processing—by highlighting its operational complexity, code duplication, and unnecessary dual-path logic.
He argues that its goals (fault-tolerance, correctness, reprocessability) can be achieved more cleanly via a unified streaming architecture built on append-only logs, specifically through systems like Kafka and Samza.
“The Lambda Architecture deserves credit for identifying key problems… but running two systems in parallel, with duplicate logic, is just too complex.”
Lambda Architecture Recap
Marz’s Lambda model separates concerns:
- Batch Layer: Immutable historical truth
- Speed Layer: Real-time low-latency computation
- Serving Layer: Merges batch + stream views
Kreps praises:
- Emphasis on immutability
- Attention to reprocessing
- Support for debuggability via raw data retention
But notes that these goals are achievable with a single system designed properly.
Criticism of Lambda
- Code Duplication
- Business logic must be implemented twice (batch + stream).
- Hard to test for correctness equivalence.
- Operational Overhead
- Running and maintaining two stacks is difficult.
- Synchronizing state and logic becomes fragile.
- Stream Limitations Overstated
- Many claim streaming is inherently less powerful.
- Kreps disagrees: stream processors can now offer exactly-once semantics, state management, and windowing.
- Query-Time Merging is brittle
- Combining batch and speed results at query time introduces latency, complexity, and inconsistency.
Kreps’s Alternative: Unified Streaming
- Use a log-based architecture (e.g., Apache Kafka) as the system backbone.
- Treat all data as immutable events.
- Build views incrementally via stream processing (e.g., Samza, Flink, Kafka Streams).
- If a bug occurs, reprocess the entire log to rebuild the derived state.
This reflects the Kappa Architecture:
- One path (streaming) for all processing.
- Reprocessing via replaying the log.
graph TB
subgraph Kappa Architecture
Raw2["Raw Data (Append-Only Log)"] --> StreamOnly
StreamOnly --> Output2
end
subgraph Lambda Architecture
Raw[Raw Data] --> Batch
Raw --> Stream
Batch --> Merge
Stream --> Merge
Merge --> Output1
end
Lambda vs. Kappa architectures
Key Principle: Logs as the Central Abstraction
Kreps’s philosophy centers on:
- Immutable, replayable logs
- Streaming as default processing model
- Replay = Reprocessing = Debuggability
\[\text{Views} = f(\text{Event Log}) \]
This removes the need for a separate batch system to correct errors.
Significance
- Shifted the industry’s mindset toward stream-first architecture.
- Influenced development of systems like:
- Apache Kafka
- Kafka Streams
- Flink
- Materialize
- Popularized the Kappa Architecture, now widely used in modern data platforms.
Kreps’s critique helped refine Marz’s vision into more operationally elegant patterns for real-time systems.
In Search of an Understandable Consensus Algorithm (2014)
Diego Ongaro and John Ousterhout, USENIX Annual Technical Conference, 2014. [pdf]
This paper introduces Raft, a consensus algorithm designed to be more understandable than Paxos while providing equivalent fault-tolerance and correctness guarantees. Raft simplifies distributed consensus by decomposing it into three relatively independent subproblems:
- Leader Election
- Log Replication
- Safety
Raft was developed in response to the well-known difficulty of implementing Paxos correctly and intuitively. The authors validate Raft’s learnability through a controlled user study, demonstrating that students understand Raft significantly better than Paxos.
Problem Domain: Replicated State Machines
Raft solves the problem of maintaining a consistent replicated log across nodes in a distributed system.
Key requirements:
- Safety: No two nodes decide differently.
- Availability: Continue processing if a majority is alive.
- Efficiency: Make progress with a majority and a stable leader.
Raft Architecture
Raft separates consensus into three subcomponents:
- Leader Election
- Each term begins with an election.
- Nodes start as followers, become candidates if no heartbeats are received, and may become leaders if they win a majority.
- Randomized election timeouts prevent split votes.
- Log Replication
- Leader appends entries to its log and sends them to followers.
- Followers accept entries if they match the leader’s log prefix.
- Once a log entry is stored on a majority and committed, it is executed.
- Safety Rules
- Only entries from the current term may be committed via the leader.
- Log matching: entries are only accepted if preceding entries match (index and term).
- Elections are only won by candidates with the most up-to-date logs.
stateDiagram-v2
[*] --> Follower
Follower --> Candidate : Election timeout
Candidate --> Leader : Wins majority vote
Candidate --> Follower : Receives valid heartbeat
Leader --> Follower : Steps down on higher term
Raft Node State Transitions
Cluster Membership Changes
Raft introduces a joint consensus mechanism:
- Old and new configurations must overlap in quorum.
- Transition proceeds via a two-phase configuration change.
This allows safe reconfiguration while continuing to make progress.
Formal Properties
Raft ensures:
- Leader Completeness: If a log entry is committed in a term, all future leaders have that entry.
- State Machine Safety: No two machines apply different commands at the same log index.
Comparison with Paxos
Feature | Paxos | Raft |
---|---|---|
Understandability | Low | High |
Leader-based | Optional | Core to design |
Log replication | Not fully specified | Explicitly integrated |
Membership change | Informal & Complex | Joint consensus mechanism |
Evaluation
- User study (43 students) found:
- Raft was more easily learned than Paxos.
- Students answered more questions correctly about Raft than Paxos.
Significance
- Raft has become the de facto standard for building replicated systems with strong consistency.
- Used in: etcd, Consul, RethinkDB, TiKV, and many others.
- Sparked wide interest in formalizing distributed protocols in a human-accessible way.
Raft’s success illustrates that understandability is not in conflict with rigor or performance.
Discussion
This set of writings collectively defines the core intellectual scaffolding of distributed systems. Lamport’s logical clocks and Paxos, the FLP impossibility, the Byzantine consensus bound, and CAP/PACELC tradeoffs are canonical knowledge. Later works like Raft and critiques of Lambda architecture represent efforts to tame the complexity these theories introduced.
Temporal Ordering and Causality
Lamport’s 1978 paper introduced a conceptual breakthrough: time in distributed systems can be modeled without synchronized clocks by defining a partial ordering of events via the “happened-before” relation. This enabled a formal treatment of causality and laid the groundwork for reasoning about system-wide consistency despite the absence of a global clock. Logical clocks and vector clocks continue to be used to infer causal structure in messaging and version histories.
Consensus: From Impossibility to Implementation
The FLP impossibility result (1985) demonstrated that no deterministic consensus protocol can guarantee termination in a purely asynchronous system with even a single crash failure. This result was both a barrier and a motivator. It forced system designers to either relax assumptions (e.g., allow randomization or partial synchrony) or redefine liveness expectations.
Paxos (1998), while notoriously difficult to understand, provided a workable and rigorous solution to consensus under crash failures and asynchronous assumptions. Raft (2014) offered a simplification by decomposing consensus into more intuitive subproblems (election, log replication, safety), increasing accessibility and correctness in real-world implementations.
The CAP Theorem and Its Refinements
Gilbert and Lynch’s formalization of Brewer’s conjecture (2002) articulated a hard limit: in the presence of partitions, systems cannot simultaneously guarantee availability and consistency. This result reframed systems design around strategic tradeoffs.
PACELC (2010) expanded this model, recognizing that latency-consistency tradeoffs also apply during normal operations. Systems thus face choices not just in failure conditions, but in everyday performance tuning.
Immutability as a Systems Pattern
Nathan Marz’s Lambda Architecture (2013) proposed a conceptual workaround to CAP by decoupling correctness (via batch recomputation) from responsiveness (via real-time computation). The design emphasized immutability, append-only logs, and recomputability—anticipating trends now foundational in event-sourced architectures.
Jay Kreps’ critique (2014) acknowledged the insights of Lambda but argued for a more streamlined approach—Kappa Architecture—where a unified, replayable event log (e.g., Kafka) serves both real-time and recovery purposes. This eliminated code duplication and operational burden, favoring simplicity and auditability.