Mastering Kafka Topic Configuration: A Comprehensive Guide
A practical guide to Kafka topic partitions, replication, retention, compaction, and safe configuration changes.
Mastering Kafka Topic Configuration: A Comprehensive Guide
Kafka topic configuration decides how your data is stored, copied, expired, compacted, and consumed. You can run Kafka with defaults for a while, especially in a development cluster, but production topics need more care. A bad partition count can trap a busy workload. Weak replication can turn a broker failure into data loss. Loose retention can fill disks. Compaction can surprise you if keys are missing or inconsistent.
The useful way to approach Kafka topic configuration is not to memorize every setting. Start with the questions a real system asks: how much parallelism do I need, how long must data remain available, how much data can I afford to store, what happens during a broker failure, and do consumers need a full event history or only the latest value per key?
A topic is split into partitions. Each partition is an ordered log. Kafka preserves order within a partition, not across the whole topic. If all events for a customer must be processed in order, use a stable key such as customer_id so those events land in the same partition. If you key randomly, you may get better distribution but lose per-entity ordering.
Partition count is one of the first choices people regret. More partitions allow more consumer parallelism because, within one consumer group, a partition is consumed by only one group member at a time. If a topic has six partitions, a consumer group can actively use up to six consumers for that topic. Adding a seventh consumer will not increase consumption for that topic unless there are other assigned partitions.
More partitions also cost something. They increase metadata, open files, replication work, leader election work, and recovery time after broker failures. Very high partition counts can make cluster operations slower even if each partition has modest traffic. There is no universal best number. A small internal topic may be fine with three partitions. A busy event stream may need dozens. A very large Kafka installation may use far more, but that should come from measured throughput and operational capacity, not habit.
Create a topic with explicit settings:
kafka-topics.sh --create --bootstrap-server broker1:9092 --topic user-events.v1 --partitions 12 --replication-factor 3 --config min.insync.replicas=2
The topic name should also carry some intent. Names like events or data become useless once the cluster grows. user-events.v1, billing-invoices.v1, or inventory-adjustments.v1 tells future operators what the stream is and gives you room for a breaking schema change later.
Replication factor controls how many copies Kafka keeps for each partition. In production, 3 is a common default because it lets one broker fail while still leaving another replica available. It does not mean you can ignore producer settings. If producers use acks=1, Kafka may acknowledge records before followers have copied them. For important topics, pair replication factor three with topic-level min.insync.replicas=2 and producer acks=all.
min.insync.replicas is often misunderstood. It does not create replicas. It says how many in-sync replicas must be available for an acks=all write to succeed. With replication factor three and min.insync.replicas=2, the topic can tolerate one broker being unavailable. If only one in-sync replica remains, Kafka should reject strong writes instead of accepting data with too few safe copies.
Retention settings decide when Kafka can delete old log segments. Time-based retention is controlled by retention.ms at the topic level. Size-based retention is controlled by retention.bytes. Older broker-level names such as log.retention.ms are broker defaults; topic configuration commonly uses retention.ms.
For example, to retain a topic for seven days:
kafka-configs.sh --alter --bootstrap-server broker1:9092 --entity-type topics --entity-name user-events.v1 --add-config retention.ms=604800000
To cap storage per partition, use retention.bytes:
kafka-configs.sh --alter --bootstrap-server broker1:9092 --entity-type topics --entity-name user-events.v1 --add-config retention.bytes=10737418240
Remember that retention.bytes is usually per partition, not total topic size. A topic with twelve partitions and retention.bytes=10GB can use roughly 120GB before replication, and roughly 360GB with replication factor three. This is the kind of detail that causes surprise disk alerts.
Kafka deletes data by log segment, not record by record. If you set a short retention period but large segments, deletion may not happen at the exact minute you expect. Segment settings such as segment.bytes and segment.ms influence when Kafka rolls to a new segment, and only closed segments are eligible for deletion or compaction. Smaller segments can make cleanup more responsive, but they add overhead.
cleanup.policy decides what Kafka does with old data. The default is delete, which removes old segments based on retention. compact keeps the latest record for each key and eventually removes older records with the same key. You can also use delete,compact for topics that need compaction plus a retention window.
Compaction is useful for state-like streams: user profile updates, feature flag values, account settings, or database change events keyed by primary key. It is a poor fit for event history where every event matters. If you compact an audit log, older events for the same key may eventually disappear. That may be exactly wrong for compliance or debugging.
Compaction also depends on keys. A compacted topic with null or inconsistent keys will not behave like a clean key-value changelog. If producers send user updates sometimes keyed by user_id and sometimes keyed by email, Kafka sees different keys. It cannot infer that they represent the same user.
Compression can be set by producers, and a topic can define compression.type to control broker behavior. Common values include producer, gzip, snappy, lz4, and zstd, depending on Kafka version. Many teams leave the topic at producer and standardize producer compression. lz4 and zstd are common choices, but the right answer depends on CPU budget, message shape, and network pressure.
You can inspect topic configuration like this:
kafka-configs.sh --describe --bootstrap-server broker1:9092 --entity-type topics --entity-name user-events.v1
And inspect partition placement like this:
kafka-topics.sh --describe --bootstrap-server broker1:9092 --topic user-events.v1
Use both commands. Topic configs tell you retention, compaction, and ISR rules. Topic description tells you leaders, replicas, and ISR state. A topic can have perfect config and still be unhealthy because replicas are out of sync.
Some changes are easy. Retention, compaction policy, min.insync.replicas, and several other topic configs can be altered dynamically. Some changes require more caution. You can increase partition count, but you cannot safely decrease it with a simple command. Increasing partitions also changes key distribution for future records because the partitioning calculation has more target partitions. Existing records stay where they are; new records for the same key may go to a different partition after the increase, depending on the partitioner. If strict per-key ordering across the change matters, plan carefully.
Replication factor changes are operational work. Increasing replicas for an existing topic means Kafka must copy existing data to new brokers. That can be a lot of I/O. Use reassignment tooling, monitor progress, and throttle if needed. Do not start a large reassignment during peak traffic unless you already know the cluster has enough spare capacity.
For a normal production event topic, a practical starting point might look like this:
kafka-topics.sh --create --bootstrap-server broker1:9092 --topic payments-authorized.v1 --partitions 24 --replication-factor 3 --config min.insync.replicas=2 --config retention.ms=1209600000 --config cleanup.policy=delete
That says: enough partitions for parallelism, three copies for availability, two in-sync replicas required for strong writes, fourteen days of retention, and no compaction because every payment authorization event matters.
For a state topic, the shape is different:
kafka-topics.sh --create --bootstrap-server broker1:9092 --topic user-preferences.v1 --partitions 12 --replication-factor 3 --config min.insync.replicas=2 --config cleanup.policy=compact
That topic should be keyed by user ID. Consumers rebuilding state can read the compacted log and eventually see the latest value for each user. They should not expect every historical preference change to remain forever.
The best topic configuration is boring to operate. It has enough partitions but not thousands without reason. It has replication that matches the value of the data. It has retention that matches recovery and compliance needs. It uses compaction only when keys are meaningful. It is described in code or documentation so another engineer can recreate it without guessing.
A useful review habit is to write down the consumer story before choosing topic settings. Who reads this topic? Do they need to replay from the beginning? How long would a full rebuild take? Can the source system republish old data? If a consumer is down for three days, should Kafka still have the missed records? Those answers drive retention more honestly than a default seven-day setting.
Consider a fraud detection consumer that reads payment events. If it is down for six hours, you almost certainly want it to catch up from Kafka. If it is down for thirty days, you may expect a separate backfill process from the payment database. That topic might need two weeks of retention, not forever. A security audit topic may have a different requirement, perhaps shipping to object storage for long-term retention while Kafka keeps only the hot replay window.
Message size also belongs in the topic conversation. Kafka can handle larger records when configured for them, but large messages affect producers, brokers, consumers, replication, and fetch memory. If teams start putting multi-megabyte JSON blobs or encoded files into a topic, do not only raise max.message.bytes and move on. Ask whether the payload belongs in object storage with a reference in Kafka. Kafka is usually best at moving events, not acting as a blob store.
Schema evolution is not a topic config setting, but it shapes topic design. A topic named with a version suffix, such as orders.v1, gives you an escape hatch when a breaking change is unavoidable. Compatible changes can stay in the same topic if consumers and producers follow a schema policy. Breaking changes should not be slipped into the same topic because one team controls the producer. Kafka decouples systems, but only if the contract is respected.
Finally, document topic ownership. Every production topic should have an owning team, expected producers, expected consumers, retention reason, and data sensitivity notes. This sounds administrative until disk fills at 02:00 and nobody knows whether a topic can be shortened, deleted, compacted, or throttled. Good topic configuration is partly technical and partly operational memory.
A final check before publishing a topic is to run through a failure scenario. If one broker disappears, can producers still write? If a consumer group is down over the weekend, will retention cover the gap? If a producer sends bad data, can consumers skip, quarantine, or replay safely? If the topic grows twice as fast as expected, which limit protects the cluster: retention time, retention bytes, quotas, or an alert?
Quotas are worth mentioning because topic configuration alone does not protect a shared cluster from a noisy producer. Kafka supports client quotas that can limit produce and fetch rates. If several teams share one cluster, quotas can keep an accidental replay or runaway producer from overwhelming brokers. They should be paired with alerts so teams know they are being throttled instead of silently blaming Kafka.
Do not forget deletion policy. Some clusters disable topic deletion at the broker level to prevent accidents. That can be sensible, but it means abandoned topics must be handled through a controlled cleanup process. A topic inventory review every month or quarter can reclaim a surprising amount of disk, especially in development and staging clusters where experiments leave old topics behind.