Distributed Data-Intensive Systems. Understanding Replication. Part-2

This is a series of blog posts about distributed data-intensive systems, what they are, how they're designed and the problems, trade-offs and optimizations in their context.

An excellent read on a more general overview of distributed systems and data-intensive applications can be found in the book by Martin Kleppmann. This post derives a lot of theoretical underpinnings from it.

Here is the entire series so far:

Introduction
Read-after-write consistency
Monotic Reads
1. Kafka's approach to Monotonic Reads
Consistent Prefix Reads
Conclusion

Introduction

Replication enables data to be copied across multiple nodes in a distributed system. It is a fundamental technique for fault tolerance, availability, and scalability. In data-intensive systems like databases, replication is used to ensure that data is available even if some nodes fail. It also helps in scaling out read queries by distributing the load across multiple nodes.

In systems that use Leader-based replication, one of the replicas is designated as the leader, and all writes are sent to the leader. The leader then replicates the writes to the followers. This is a common pattern in databases like MySQL, PostgreSQL, and MongoDB. Reads however, can be served by any replica, not just the leader. This is called a read-scaling pattern.

The leader is responsible for ensuring that all writes are replicated to the followers. This is done by sending the writes to the followers and waiting for an acknowledgment from them. Once the leader receives an acknowledgment from a majority of the followers, it can acknowledge the write to the client. This is called a quorum based replication.
As the leader emits changes to the followers asynchronously, the followers may lag behind the leader. This is called replication lag. The lag can be due to network latency, disk I/O, or the processing time of the followers. The lag can be monitored and managed by the system. This is important because if the lag is too high, the followers may not have the latest data, and the system may not be able to provide strong consistency guarantees. The effect of followers lagging behind the leader to eventually converge to the same state is called eventual consistency.

However, this replication lag between the followers and the leader can cause inconsistencies in applications.

Read-after-write consistency

Consider a scenario where a client writes to the leader and then reads from a follower. If the follower has not yet received the write from the leader, the client may not see the write. This is called a read-after-write inconsistency.

In a system with leader based replication, we can use some techniques:

Read from Leader

In case the user has created, updated or deleted some data, the user can read the data from the leader. This ensures that the user reads the latest data along with the modifications made by them. In all the other cases, the user can be directed to read from the followers. However this can lead to undue load on the leader.

Read from Leader for a specific time

In case the user modifies something, they should read from the leader for a specific period of time. After this interval has passed, we can assume that the changes have been propagated to the followers/replicas and the user can read from the followers.

Logical Timestamps

The client can also store the timestamp of its recent modification. The client can then read from the follower and check if the timestamp of the modification is greater than the timestamp of the read. If it is, the client can read from the leader or the query can wait till the follower has caught up with the leader. However, this needs clock synchronization across all the nodes in the system.

Monotic Reads

Another anomaly can be if a user makes multiple reads from different replicas. The user may see different data in each read. This is called a monotonic read anomaly.
When the different replicas are at different states catching up with the leader, the user may see different result sets in each read.
For example, if a user sees a record x = 2 to the leader and then reads it from the follower, if the write has not yet been propagated to the follower, the user might not see the this recent value. A more severe case would be in cases like a social network where a post from a user is not visible to another user when he refreshes the web page and the request is routed to a random server.

Monotic reads is the guarantee that if a user reads a value x from a replica, they should not see a value y from another replica that is older than x. Jepsen defines monotonic reads as:

Monotonic reads ensures that if a process performs read r1, then r2, then r2 cannot observe a state prior to the writes which were reflected in r1; intuitively, reads cannot go backwards.

A simple way to ensure monotonic reads is to always read from the same replica. This is called sticky reads. However, this can lead to uneven load distribution across the replicas. If the replica fails, the queries would be routed to another replica, which may not have the latest data.

Some ways to handle monotonic reads are:

Versioning: Each write is assigned a version number and each read operation checks and records the version number it has read. If the version number of subsequent reads is less than the version number of the previous read, the read operation can be retried or redirected to the leader.
Timestamps: Each write is assigned a timestamp and each read operation checks and records the timestamp it has read. Each read operation considers the timestamp of the last read and ensures that the timestamp of the subsequent read is greater than or equal to the timestamp of the previous read.
Logical Clocks: Logical clocks are used to order events in a distributed system. Clients track the highest logical clock value they have seen and ensure that the subsequent reads have a higher logical clock value.

Kafka's approach to Monotonic Reads

Kafka is a distributed streaming system that uses leader-based replication. It employs various techniques to ensure monotonic reads:

Partition Leadership and Consumer offset tracking

Each partition has a single leader broker
Each consumer tracks the offset of the messages it has read from the partition
Offsets are monotonically increasing and are used to track the messages that have been read by the consumer

Consumers of a single partition will always see messages in an increasing order if they commit the offsets they've processed correctly. Since each consumer consumes from a single partition, they will always see messages in the order they were written.

Consumer Group Coordination

Kafka uses a consumer group to coordinate the consumption of messages across multiple consumers
One consumer consumes from one partition at a time
Ordering is only guaranteed within a partition, not across partitions
Partition assignments are sticky unless a rebalance is triggered

Consistent Prefix Reads

Another anamoly with replication lags happens when there is a causal dependency between operations. That is, the operations make sense only in the order in which they were performed.
Consistent prefix consistency guarantees that read operations see a consistent prefix of the operations that have been performed. That means, if operations occur in the order A, B, C, then a read operation should see the operations in the order A, B, C or A, B but not A, C.

In a system like Kafka, consistent prefix reads are achieved primary through:

Ordering within Partitions

Messages within a partition are ordered
Producers must also make sure that related writes are always written to the same partition by using a consistent partitioning strategy

Replication and In-sync replicas

Topics are replicated across brokers
Each partition has a leader and multiple followers but consumers always consume from the leader for that partition

In distributed databases like CockroachDB, consistent prefix reads are achieved through serializable transactions, MVCC (multi-version concurrency control), and Raft consensus protocol.

Conclusion

Replication is a fundamental technique in distributed systems that enables fault tolerance, availability, and scalability. However, it can lead to inconsistencies in applications. Techniques like read from leader, logical timestamps, versioning, and logical clocks can be used to handle read-after-write inconsistencies and monotonic reads. Consistent prefix reads can be achieved through ordering within partitions and replication across brokers.

Do leave a like if you liked this post and make sure to stay tuned for upcoming posts on distributed data intensive systems.

Understanding Replication in Distributed Data Systems. Part-2