Where Kafka Fits, Where It Does Not

Kafka is often called a "queue," but more precisely it is a distributed commit log. Its strengths show beyond queues. At the same time, it is overkill for a simple work queue.

1. About Kafka

Kafka is a distributed messaging and log system that began at LinkedIn. It started as an internal project in 2010, was incubated by the Apache Software Foundation in 2011, and became a top-level project in 2012. The 1.0 release came in 2017.

Event	Year
Internal development begins (LinkedIn)	2010
Apache incubation	2011
Apache top-level project	2012
Kafka Streams introduced	0.10 (2016)
Exactly-once semantics	0.11 (2017)
1.0 GA	2017-11
KRaft (ZooKeeper-free mode)	3.3 (2022)
Non-ZooKeeper as default option	3.5+

The design intent from the start was "high throughput, retention extensible by adding disk, reprocessing possible." Some describe it not as a generalization of queues but as the discovery of a distributed log.

2. Topic, partition, consumer group

Topic — a logical channel for messages.
Partition — the unit of split that lets a topic be handled in parallel and distributed. Order is guaranteed within a partition.
Message — key, value, headers, offset. The partition is commonly determined by the key hash.
Offset — the position within a partition. Consumers record their progress.
Consumer group — consumers in the same group split partitions. Within a group, a partition is assigned to only one consumer.

Thanks to this model, different groups can read the same topic at their own progress. Unlike queues where "pulled means gone," Kafka keeps messages on disk until retention expires.

3. Retention policies

Time-based (retention.ms) — for example, 7 days.
Size-based (retention.bytes) — for example, 100 GB.
Compaction (compaction) — keep only the last value per key — used as key-value snapshot topics.

4. Delivery guarantees

Guarantee	Configuration
at-most-once	producer does not wait for ack plus consumer auto-commit. Loss possible.
at-least-once	default. ack=all plus manual commit. Duplicates possible.
exactly-once	producer's idempotent and transactional plus consumer's read-committed. Holds only within Kafka topics. With external systems, idempotent consumers are still recommended.

acks (producer), enable.idempotence, and isolation.level (consumer) are the core settings.

5. Storage, replication, KRaft

Each partition is replicated as leader plus followers. replication.factor is usually 3. Replicas inside the ISR (In-Sync Replicas) are synchronized with the leader. On leader failure, one replica from the ISR becomes the new leader.

Metadata management long used ZooKeeper. From 2022, KRaft (based on the Raft consensus algorithm) emerged so Kafka can run with only its own nodes, no ZooKeeper. Many report a smaller operational surface.

6. Where Kafka is strong

Event sourcing and CDC — preservation and replay of every change.
Places where multiple consumers read the same stream at different speeds — publish once, consume by many groups.
High-throughput log collection — hundreds of thousands of messages per second.
Entry to real-time analytics — Flink, Spark Streaming, Kafka Streams.
Backfill and reprocessing through message retention.

7. Where Kafka is overkill

Simple work queues (email sending, background processing) — RabbitMQ, Redis, SQS are simpler.
Short TTL, low throughput — Kafka's operational cost is not justified.
Workflows where humans want to look at each task — Airflow-family tools fit better.

8. Other candidates

System	Origin and year	Model	Memo
RabbitMQ	2007, AMQP 0-9-1 based	queues, exchanges, routing	Routing, round-robin, DLQ. Message persistence and retention are not on Kafka's level.
NATS	2010, Derek Collison	pub/sub, JetStream	Light, low-latency. JetStream (2020) added persistence.
Redis Streams	2018, Redis 5.0	log + consumer group	A model resembling a scaled-down Kafka. Fits places with small data volume.
AWS SQS	2006	simple queue	Managed. FIFO queue option. Single message ≤ 256KB.
AWS Kinesis	2013	stream	Managed with a model similar to Kafka. 24h to 365d retention.
Google Pub/Sub	2015	pub/sub	Managed. Auto-scaling. Ordering option.
Apache Pulsar	2016, Yahoo (open source)	tiered (broker + bookie)	Multi-tenancy and geo-replication emphasized.

The deciding factor narrows down to one or two of the following.

Data retention duration (minutes or days).
Throughput (tens to hundreds of thousands per second).
Availability of a managed offering.
Whether routing and filtering is complex (RabbitMQ excels).
Whether multiple consumers read one topic at different speeds (Kafka-style models fit).

9. Topics, consumers, operations

Topic naming — the format <domain>.<entity>.<event> is common (for example, orders.created). Separate environments by prefix or by separate cluster. Manage schemas with a Schema Registry (Avro, Protobuf, JSON Schema).

Consumer design — idempotent processing is the baseline. A DLQ (Dead Letter Queue) sends repeatedly failing messages to a separate topic. For transient external dependencies (e.g. API 5xx), bundle retry plus backoff plus DLQ.

Partition count caps both throughput and consumer count. Setting it too small at first lets us increase it later, but the key-to-partition mapping changes and order assumptions can break.

Monitoring — lag (how far the consumer trails the leader's end), message rate, replication lag.

10. Common pitfalls

Order assumption — order is guaranteed within a partition, not across the topic. With multiple partitions there is no global order.

Changing partition count — increasing is possible, but the key → partition mapping changes. Messages with the same key may now go to a new partition, which can lead to operational accidents.

Consumer group rebalancing — partition reassignment happens when a new consumer joins or leaves. Processing may pause during that (cooperative rebalancing eases it).

Scope of exactly-once — only within Kafka. Consumers writing to an external DB still need idempotent design.

Operational resources — self-hosted Kafka without a managed offering is a heavy load on a small team. Consider managed offerings like Confluent Cloud, MSK, or Aiven.

Closing thoughts

Kafka is not always the answer to "do we need a queue?" It shines only where retention, reprocessing, and multi-consumer truly matter. For small teams, starting with Redis Streams or RabbitMQ and growing from there is safer for operations.

pgvector-rag
supabase

References: Apache Kafka official docs, Kafka design, KRaft guide, Confluent blog, RabbitMQ official, NATS JetStream, Apache Pulsar.

Where Kafka Fits