Kafka Replication Configuration: Ensuring Data Durability and Availability

Kafka replication configuration is where a cluster stops being a pile of brokers and starts behaving like a system you can trust during failures. The settings are not complicated on their own: replication factor, in-sync replicas, producer acknowledgments, leader election, and rack placement. The tricky part is that they only make sense together.

A topic with three replicas can still lose acknowledged data if producers use weak acknowledgments. A producer using acks=all can still fail writes if min.insync.replicas is too strict for the number of brokers currently alive. A cluster spread across availability zones can still have a bad day if all replicas for a hot partition land in the same failure domain. Replication is not a single checkbox.

The way I like to think about Kafka replication is simple: for each partition, Kafka keeps several copies, chooses one copy to accept reads and writes, and keeps the other copies close enough that one of them can take over. Your job is to decide how many copies are enough, how many must be caught up before a write is considered successful, and whether the cluster should ever prefer availability over data safety.

A Kafka topic is split into partitions. Each partition has one leader replica and zero or more follower replicas. Producers write to the leader. Consumers normally read from the leader. Followers fetch records from the leader and keep their local logs aligned. If the leader's broker fails, Kafka elects a new leader from replicas that are considered safe candidates.

That safe candidate list is the ISR, short for in-sync replicas. A replica is in the ISR when it is keeping up with the leader closely enough according to Kafka's replica lag rules. If a follower stops fetching, falls behind for too long, or the broker disappears, Kafka removes it from the ISR. When it catches up, it can rejoin.

This detail matters because the ISR is what makes Kafka durability more than wishful thinking. With acks=all, the leader does not acknowledge a produce request until the record has been replicated to the required in-sync replicas. The exact requirement is controlled by min.insync.replicas. If the topic has replication.factor=3 and min.insync.replicas=2, Kafka requires at least two in-sync replicas before an acks=all write can succeed.

That combination is common in production because it gives you a practical balance. One broker can fail, and the topic can still accept strongly acknowledged writes. If a second broker fails before the first comes back, producers using acks=all should start seeing errors such as NotEnoughReplicas or NotEnoughReplicasAfterAppend. That is annoying during an incident, but it is usually the correct behavior. Kafka is refusing to pretend a write is durable when there are not enough safe copies.

Here is the typical production baseline for a normal three-broker-or-larger cluster:

default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

Those values do not make every workload safe automatically, but they give you a sane starting point. default.replication.factor=3 means new topics get three copies unless the topic creation command says otherwise. min.insync.replicas=2 means at least two replicas must be in sync for strong writes. unclean.leader.election.enable=false tells Kafka not to elect a stale replica as leader just to keep a partition writable.

Do not set replication factor higher than your broker count. Kafka cannot place three replicas on three different brokers if only two brokers exist. In small development clusters, replication.factor=1 is fine because convenience matters more than failure tolerance. In production, 1 means a single broker loss can make data unavailable and can permanently lose records stored only on that broker.

The producer side must match the topic side. For important data, use acks=all. Also enable idempotence unless you have a specific reason not to. In modern Kafka clients, idempotent producers are the normal choice for reducing duplicates caused by retries.

acks=all
enable.idempotence=true
retries=2147483647
max.in.flight.requests.per.connection=5

Do not copy the retry value blindly into every client without understanding your client version and delivery requirements. The important idea is that durable Kafka production usually needs retries, idempotence, and acks=all together. If you set acks=1, the leader can acknowledge a record before followers have copied it. If that leader dies at the wrong time, an acknowledged record may disappear. That is acceptable for some telemetry streams. It is not acceptable for payments, audit trails, inventory movements, or anything a downstream team treats as a source of truth.

When you create a topic, set the replication choices deliberately instead of relying on whatever broker defaults happen to be present:

kafka-topics.sh --create   --bootstrap-server broker1:9092   --topic orders.v1   --partitions 12   --replication-factor 3   --config min.insync.replicas=2

The partition count is separate from replication. Twelve partitions with replication factor three means thirty-six partition replicas in total. That has storage, network, file handle, and controller metadata costs. Replication improves durability, but it is not free.

For existing topics, changing min.insync.replicas is straightforward:

kafka-configs.sh --alter   --bootstrap-server broker1:9092   --entity-type topics   --entity-name orders.v1   --add-config min.insync.replicas=2

Changing the replication factor for an existing topic depends on Kafka version and tooling. Newer Kafka releases support kafka-reassign-partitions.sh and, in some cases, topic alteration workflows that make increases easier. In older clusters, increasing replication usually means generating and executing a partition reassignment plan. Decreasing replication is more sensitive because you are removing copies. Treat it as a planned operation, not a casual command typed during a noisy incident.

A reassignment should be throttled if the topic is large or the cluster is already busy. Replication catch-up reads old data from existing replicas and writes it to new ones. That can steal disk and network capacity from live producers and consumers. A safe runbook usually includes a maintenance window, before-and-after --describe output, reassignment throttles, and a rollback plan.

You can inspect a topic like this:

kafka-topics.sh --describe   --bootstrap-server broker1:9092   --topic orders.v1

Look at three fields in the output: Leader, Replicas, and Isr. Replicas is the assigned set. Isr is the set currently caught up. If Replicas is 1,2,3 but Isr is 1,2, broker 3 is behind or unavailable for that partition. If many partitions show a missing broker in ISR, look at that broker's disk, network, process health, and logs. If only a few hot partitions are affected, the leader may be overloaded or the partition may have unusually high traffic.

Unclean leader election deserves special care. If all in-sync replicas for a partition are gone, Kafka has two choices. It can leave the partition unavailable until a safe replica returns, or it can elect an out-of-sync replica and risk losing records that were acknowledged on the old leader. unclean.leader.election.enable=false chooses safety. true chooses availability at the risk of data loss.

There are workloads where unclean election may be defensible: short-lived clickstream data, disposable metrics, or a pipeline where upstream systems can replay everything. For most business data, leave it disabled. Losing availability for a partition is painful, but silent data loss is worse because consumers may continue as if nothing happened.

Rack-aware replication helps with a different class of failure. If your brokers are split across racks, zones, or hosts with shared power/network paths, tell Kafka where each broker lives:

broker.rack=zone-a

Set the correct value on every broker. Kafka will try to spread replicas across racks so a single zone failure is less likely to remove every copy of a partition. This is not magic. You still need enough brokers in each zone, enough disk, and careful partition placement. But without broker.rack, Kafka has no way to know that two brokers share the same failure domain.

Monitor replication continuously. The most useful early warning signs are under-replicated partitions, offline partitions, ISR shrink events, and produce errors related to insufficient replicas. In Prometheus-based setups, teams commonly watch Kafka broker metrics for under-replicated partitions and offline partitions, then pair those alerts with broker disk, network, and JVM metrics.

A good incident question is: did ISR shrink because a broker died, because replication cannot keep up, or because the network is unreliable? The fix differs. A dead broker needs service recovery. A slow broker may need disk replacement, I/O investigation, or fewer partition leaders. A network problem may show up as repeated disconnects and fetcher lag even when CPU and disk look fine.

Rolling broker restarts are another place replication settings show their value. Restart one broker at a time. Wait for partitions to regain healthy ISR before restarting the next broker. If you restart brokers too quickly with min.insync.replicas=2, producers may begin failing because too few replicas are in sync. That failure is expected, but you can avoid it with patience and monitoring.

The practical checklist is short. Use replication factor three for most production topics. Use min.insync.replicas=2 with producer acks=all for important data. Keep unclean leader election disabled unless the data is explicitly disposable. Spread replicas across failure domains with rack awareness. Watch ISR health, not just broker uptime. And test your assumptions by restarting a broker in a controlled window before a real outage does it for you.

One detail that helps during reviews is separating durability from availability in plain language. Durability asks, "After Kafka says the write succeeded, how many failures can happen before that acknowledged record is at risk?" Availability asks, "Can producers and consumers still use the partition right now?" Strong settings sometimes reduce availability because Kafka will reject writes rather than accept weakly replicated data. That is not a failure of Kafka. That is Kafka honoring the contract you configured.

For example, imagine a topic with replication factor three, min.insync.replicas=2, and producers using acks=all. Broker 1 is leader, brokers 2 and 3 are followers. If broker 3 goes down, ISR becomes 1,2. Writes still succeed because two replicas are in sync. If broker 2 then goes down before broker 3 returns, ISR becomes only 1. Writes fail. Some teams first see this in production and ask why Kafka is down when the leader is still alive. The answer is that the topic is still available for some reads, but it is not safe for strongly acknowledged writes.

You should also think about consumer recovery. Replication protects broker-side copies of records. It does not automatically protect consumer offsets from every workflow mistake. Consumer offsets are stored in Kafka too, usually in __consumer_offsets, so that internal topic also needs healthy replication. If the user topics are carefully configured but internal topics were created with weak replication in an early cluster build, failover behavior can still be worse than expected. Check internal topic replication as part of a production readiness review.

In multi-tenant clusters, not every topic deserves the same configuration. A throwaway metrics topic with high volume and low business value may use shorter retention and tolerate weaker guarantees. A billing topic should not. The mistake is letting accidental defaults decide that distinction. Put topic classes in writing: critical event streams, replayable telemetry, compacted state topics, temporary development topics. Then map each class to replication, ISR, retention, and producer settings.

During incidents, avoid changing durability settings just to quiet errors unless everyone understands the tradeoff. Lowering min.insync.replicas from 2 to 1 may get producers moving, but it also means acknowledged writes can live on one broker. Enabling unclean leader election may restore partition availability, but stale replicas can lose records. Sometimes the business may choose that tradeoff. It should be a conscious incident decision, not a hidden operator shortcut.