Where Kafka Fits
Where Kafka Fits, Where It Does Not
Kafka is often called a "queue," but more precisely it is a distributed commit log. Its strengths show beyond queues. At the same time, it is overkill for a simple work queue.
1. About Kafka
Kafka is a distributed messaging and log system that began at LinkedIn. It started as an internal project in 2010, was incubated by the Apache Software Foundation in 2011, and became a top-level project in 2012. The 1.0 release came in 2017.
| Event | Year |
|---|---|
| Internal development begins (LinkedIn) | 2010 |
| Apache incubation | 2011 |
| Apache top-level project | 2012 |
| Kafka Streams introduced | 0.10 (2016) |
| Exactly-once semantics | 0.11 (2017) |
| 1.0 GA | 2017-11 |
| KRaft (ZooKeeper-free mode) | 3.3 (2022) |
| Non-ZooKeeper as default option | 3.5+ |
The design intent from the start was "high throughput, retention extensible by adding disk, reprocessing possible." Some describe it not as a generalization of queues but as the discovery of a distributed log.
2. Topic, partition, consumer group
- Topic — a logical channel for messages.
- Partition — the unit of split that lets a topic be handled in parallel and distributed. Order is guaranteed within a partition.
- Message — key, value, headers, offset. The partition is commonly determined by the key hash.
- Offset — the position within a partition. Consumers record their progress.
- Consumer group — consumers in the same group split partitions. Within a group, a partition is assigned to only one consumer.
Thanks to this model, different groups can read the same topic at their own progress. Unlike queues where "pulled means gone," Kafka keeps messages on disk until retention expires.
3. Retention policies
- Time-based (
retention.ms) — for example, 7 days. - Size-based (
retention.bytes) — for example, 100 GB. - Compaction (
compaction) — keep only the last value per key — used as key-value snapshot topics.
4. Delivery guarantees
| Guarantee | Configuration |
|---|---|
| at-most-once | producer does not wait for ack plus consumer auto-commit. Loss possible. |
| at-least-once | default. ack=all plus manual commit. Duplicates possible. |
| exactly-once | producer's idempotent and transactional plus consumer's read-committed. Holds only within Kafka topics. With external systems, idempotent consumers are still recommended. |
acks (producer), enable.idempotence, and isolation.level (consumer) are the core settings.
5. Storage, replication, KRaft
Each partition is replicated as leader plus followers. replication.factor is usually 3. Replicas inside the ISR (In-Sync Replicas) are synchronized with the leader. On leader failure, one replica from the ISR becomes the new leader.
Metadata management long used ZooKeeper. From 2022, KRaft (based on the Raft consensus algorithm) emerged so Kafka can run with only its own nodes, no ZooKeeper. Many report a smaller operational surface.
6. Where Kafka is strong
- Event sourcing and CDC — preservation and replay of every change.
- Places where multiple consumers read the same stream at different speeds — publish once, consume by many groups.
- High-throughput log collection — hundreds of thousands of messages per second.
- Entry to real-time analytics — Flink, Spark Streaming, Kafka Streams.
- Backfill and reprocessing through message retention.
7. Where Kafka is overkill
- Simple work queues (email sending, background processing) — RabbitMQ, Redis, SQS are simpler.
- Short TTL, low throughput — Kafka's operational cost is not justified.
- Workflows where humans want to look at each task — Airflow-family tools fit better.
8. Other candidates
| System | Origin and year | Model | Memo |
|---|---|---|---|
| RabbitMQ | 2007, AMQP 0-9-1 based | queues, exchanges, routing | Routing, round-robin, DLQ. Message persistence and retention are not on Kafka's level. |
| NATS | 2010, Derek Collison | pub/sub, JetStream | Light, low-latency. JetStream (2020) added persistence. |
| Redis Streams | 2018, Redis 5.0 | log + consumer group | A model resembling a scaled-down Kafka. Fits places with small data volume. |
| AWS SQS | 2006 | simple queue | Managed. FIFO queue option. Single message ≤ 256KB. |
| AWS Kinesis | 2013 | stream | Managed with a model similar to Kafka. 24h to 365d retention. |
| Google Pub/Sub | 2015 | pub/sub | Managed. Auto-scaling. Ordering option. |
| Apache Pulsar | 2016, Yahoo (open source) | tiered (broker + bookie) | Multi-tenancy and geo-replication emphasized. |
The deciding factor narrows down to one or two of the following.
- Data retention duration (minutes or days).
- Throughput (tens to hundreds of thousands per second).
- Availability of a managed offering.
- Whether routing and filtering is complex (RabbitMQ excels).
- Whether multiple consumers read one topic at different speeds (Kafka-style models fit).
9. Topics, consumers, operations
Topic naming — the format <domain>.<entity>.<event> is common (for example, orders.created). Separate environments by prefix or by separate cluster. Manage schemas with a Schema Registry (Avro, Protobuf, JSON Schema).
Consumer design — idempotent processing is the baseline. A DLQ (Dead Letter Queue) sends repeatedly failing messages to a separate topic. For transient external dependencies (e.g. API 5xx), bundle retry plus backoff plus DLQ.
Partition count caps both throughput and consumer count. Setting it too small at first lets us increase it later, but the key-to-partition mapping changes and order assumptions can break.
Monitoring — lag (how far the consumer trails the leader's end), message rate, replication lag.
10. Common pitfalls
Order assumption — order is guaranteed within a partition, not across the topic. With multiple partitions there is no global order.
Changing partition count — increasing is possible, but the key → partition mapping changes. Messages with the same key may now go to a new partition, which can lead to operational accidents.
Consumer group rebalancing — partition reassignment happens when a new consumer joins or leaves. Processing may pause during that (cooperative rebalancing eases it).
Scope of exactly-once — only within Kafka. Consumers writing to an external DB still need idempotent design.
Operational resources — self-hosted Kafka without a managed offering is a heavy load on a small team. Consider managed offerings like Confluent Cloud, MSK, or Aiven.
Closing thoughts
Kafka is not always the answer to "do we need a queue?" It shines only where retention, reprocessing, and multi-consumer truly matter. For small teams, starting with Redis Streams or RabbitMQ and growing from there is safer for operations.
Next
- pgvector-rag
- supabase
References: Apache Kafka official docs, Kafka design, KRaft guide, Confluent blog, RabbitMQ official, NATS JetStream, Apache Pulsar.