12 · Messaging, Queues & Event-Driven Architecture
What you’ll learn: How asynchronous messaging decouples services, the deep distinction between a queue (RabbitMQ/SQS) and a log (Kafka), and the full set of guarantees — delivery semantics, ordering, exactly-once — with the failure modes that break them in production. By the end you should be able to choose a partition key under pressure, design a dead-letter strategy, reason about backpressure, and defend a Kafka topology in a staff-level design review.
Prerequisites: Read 09-cap-pacelc-and-consistency-models.md (consistency vocabulary), 10-replication-and-consensus.md (ISR/Raft, leader election), and 11-idempotency-and-distributed-transactions.md (idempotency keys, the outbox pattern). This module leans on all three.
1. Why asynchronous messaging exists (intuition)
Imagine a print shop. A customer drops off an order at the counter. The clerk does not make the customer stand there while the order is designed, printed, dried, and boxed. The clerk writes a ticket, drops it in a tray, and the customer leaves. The press operator picks up tickets from the tray when free. The tray is a message queue: it decouples arrival rate from processing rate, survives the operator going on break (durability), and lets you add a second operator to drain the tray faster (horizontal scale).
Synchronous RPC is the customer standing at the counter. It couples availability (if the press is down, the counter is down), couples latency (slow press = slow counter), and couples capacity (one customer at a time). Messaging breaks all three couplings. The price you pay is eventual processing and a whole zoo of new failure modes — duplicates, reordering, poison messages, unbounded backlogs — which is what the rest of this module is about.
SYNCHRONOUS (coupled) ASYNCHRONOUS (decoupled)
Client ──RPC──▶ Service Client ─▶ [ QUEUE/LOG ] ─▶ Consumer
(blocks until done) (returns immediately) (drains at its own pace)
fate-shared availability independent availability
In Print-Flow-360 terms: an order is placed → an event is published → thumbnail generation (the pdf-service), notification email, inventory decrement, and analytics all react independently. None blocks checkout; none can take checkout down.
2. Two fundamentally different primitives: Queue vs Log
This is the single most misunderstood thing in the space. They look similar (you put messages in, you get messages out) but the data model is opposite.
A queue (RabbitMQ, ActiveMQ, AWS SQS) is destructive read: a broker holds messages, hands each to one consumer, and deletes it on acknowledgment. The broker tracks per-message state. It is fundamentally a smart broker / dumb consumer.
A log (Apache Kafka, AWS Kinesis, Redpanda, Apache Pulsar) is an append-only, replayable ordered sequence. Reads are non-destructive — the consumer tracks its own position (offset). Messages stay until a retention policy evicts them, independent of who read them. It is a dumb broker / smart consumer.
QUEUE (RabbitMQ/SQS) LOG (Kafka)
┌──────────────┐ partition 0: [0][1][2][3][4][5]...→ append
│ msg msg msg │──deliver──▶ C1 ▲ ▲
│ (deleted on │──deliver──▶ C2 consumer-A consumer-B
│ ack) │ (compete) offset=2 offset=5
└──────────────┘ (both read the SAME data independently)
| Dimension | Queue (RabbitMQ / SQS) | Log (Kafka / Kinesis) |
|---|---|---|
| Read model | Destructive (deleted on ack) | Non-destructive (offset cursor) |
| Replay history | No (gone after ack) | Yes (re-seek to old offset) |
| Multiple independent consumers | Needs fan-out (one queue each) | Native (each group has own offsets) |
| Ordering | Per-queue, lost under competing consumers | Strict per partition |
| Throughput ceiling | ~10k–50k msg/s/queue typical | Millions/s (sequential disk + batching) |
| Per-message TTL / priority / delay | First-class (RabbitMQ) | Awkward (no per-msg priority) |
| Retention model | Until consumed | Time/size/compaction based |
| Best for | Task distribution, RPC-style work, priority jobs | Event streaming, event sourcing, replayable pipelines, many subscribers |
Rule of thumb: if the message is a command to do one unit of work once (“generate this thumbnail”), a queue fits. If it’s a fact that many parties care about and may want to re-read (“order #123 was placed”), a log fits. Real systems often use both: Kafka as the event backbone, SQS/Celery for discrete worker jobs.
3. Pub/Sub vs Point-to-Point
Two delivery topologies, orthogonal to the queue/log choice:
- Point-to-point (competing consumers): one message → exactly one of N consumers. This is how you parallelize work. N workers pull from one queue; each message handled once. Add workers to scale throughput linearly (until the broker or downstream DB is the bottleneck).
- Publish/subscribe (fan-out): one message → every subscriber gets a copy. This is how you decouple independent reactions to the same fact.
COMPETING CONSUMERS (point-to-point) PUB/SUB (fan-out)
Q ─▶ W1 (gets msg 1,4,7) topic ─▶ EmailSvc (all get every msg)
Q ─▶ W2 (gets msg 2,5,8) ─▶ InventorySvc
Q ─▶ W3 (gets msg 3,6,9) ─▶ AnalyticsSvc
→ scale THROUGHPUT → DECOUPLE reactions
Kafka unifies both with the consumer group: consumers in the same group compete (point-to-point — each partition goes to one member); consumers in different groups each see the full stream (pub/sub). RabbitMQ models it with exchanges: a direct queue gives competing consumers; a fanout exchange copies to every bound queue. SNS→SQS on AWS is the canonical cloud fan-out (SNS publishes, multiple SQS queues subscribe, each drained by its own worker fleet).
4. Kafka internals (the depth interviewers probe)
Topics, partitions, offsets
A topic is a named stream. It’s split into partitions — that’s the unit of parallelism, ordering, and storage. Each partition is an ordered, immutable, append-only log of records, each with a monotonically increasing offset (0, 1, 2, …). A partition lives entirely on one broker’s disk (plus replicas).
Topic "orders" (3 partitions)
P0: [0][1][2][3][4]──▶ (leader on broker-1, replicas on broker-2,3)
P1: [0][1][2]──────▶ (leader on broker-2)
P2: [0][1][2][3]──▶ (leader on broker-3)
Producer hashes key → picks partition. Order guaranteed WITHIN a partition only.
The throughput ceiling is the partition count. A topic with 6 partitions can be drained by at most 6 active consumers in a group — partition #7’s consumer sits idle. Pick partition count for your peak future parallelism; you can increase but never decrease, and increasing breaks key→partition stability (see §5).
Consumer groups, rebalancing, offset commits
Each consumer group gets each partition assigned to exactly one member. The group coordinator (a broker) runs the rebalance protocol when membership changes. Classic (“eager”) rebalancing is stop-the-world: everyone revokes all partitions, then reassigns — pauses can be seconds. Cooperative incremental rebalancing (KIP-429, default since Kafka 2.4+) only moves the partitions that actually need to move. Use static membership (group.instance.id) so a pod restart inside session.timeout.ms doesn’t trigger a rebalance at all.
Consumers commit offsets (to the internal __consumer_offsets topic) to record progress. The semantics hinge on when you commit relative to when you process:
At-least-once: process(msg) ──▶ then commit(offset)
crash between = reprocess on restart = DUPLICATE (safe default)
At-most-once: commit(offset) ──▶ then process(msg)
crash between = message LOST (rarely acceptable)
enable.auto.commit=true commits on a timer regardless of processing — a quiet source of both loss and duplicates. For correctness, disable auto-commit and commit explicitly after processing.
Replication: ISR, acks, and the durability dial
Each partition has one leader and N-1 followers. The In-Sync Replica (ISR) set is the followers caught up within replica.lag.time.max.ms. Durability is a producer/broker negotiation:
acks | Guarantee | Failure cost | Throughput |
|---|---|---|---|
0 | Fire-and-forget | Loses data on any hiccup | Highest |
1 | Leader persisted | Loses data if leader dies before replication | Medium |
all (-1) | All ISR persisted | No loss while ≥1 ISR survives | Lowest (still fast) |
acks=all alone isn’t enough: if ISR shrinks to just the leader, “all” means “the one”. Pair it with min.insync.replicas=2 (on a replication-factor-3 topic) so a write fails fast rather than silently risking loss. And set unclean.leader.election.enable=false — otherwise a stale out-of-sync replica can be elected leader and silently truncate committed data (the famous availability-vs-durability knob; see consensus trade-offs in 10-replication-and-consensus.md).
Retention vs log compaction
Two eviction policies, often confused:
- Retention (
cleanup.policy=delete): drop segments older thanretention.ms(default 7 days) or beyondretention.bytes. Time-series / events. - Compaction (
cleanup.policy=compact): keep the latest value per key forever, garbage-collecting superseded versions. Turns the log into a durable, replayable changelog / snapshot — perfect for “current state of every order keyed by order_id”. Kafka Streams’ state stores and__consumer_offsetsrely on it. Anullvalue is a tombstone that deletes the key afterdelete.retention.ms.
Compaction of key=user42:
before: [u42=A][u7=X][u42=B][u42=C][u7=Y]
after: [u7=Y][u42=C] ← only the newest per key survives
5. Ordering guarantees and partition-key choice
Kafka guarantees order only within a partition. Across partitions, all bets are off. So the partition key is the most consequential design decision you’ll make. Records with the same key always land on the same partition (key hash mod partition count), so they’re ordered relative to each other.
- Order matters per entity (all events for
order_123must apply in sequence): key byorder_id. Now status transitions never race. The cost: an order with a hot stream of events can hot-spot one partition. - Order doesn’t matter / you want max parallelism: key by something high-cardinality and uniform (e.g. a random/round-robin key), or null key (round-robin batches).
The partitioning skew trap: keying by tenant_id in a multi-tenant SaaS (like this codebase) means one whale tenant saturates a single partition while others idle — a classic hot-partition incident. Mitigation: composite key (tenant_id:order_id) when per-order ordering is enough, or salt the key.
Subtle producer gotcha: even single-partition order can break. With retries>0, a failed batch retried after a later batch succeeded reorders records — unless max.in.flight.requests.per.connection ≤ 5 with the idempotent producer enabled (Kafka then preserves order on retry via sequence numbers). With the idempotent producer off, you need max.in.flight=1 for strict order, which kills pipelining throughput.
6. Delivery semantics: at-most / at-least / exactly-once
duplicates? loss? use when
at-most-once no possible metrics you can lose, max throughput
at-least-once POSSIBLE no default; pair with idempotent consumers
exactly-once no no money, dedup-expensive side effects
The honest senior take: in a distributed system you get at-least-once delivery, and you engineer effectively-once processing by making consumers idempotent. The Two Generals problem means the producer can never know its message arrived without an ack, and the ack itself can be lost — so it must retry, so duplicates are inevitable on the wire. See 11-idempotency-and-distributed-transactions.md.
Idempotent consumers (the real-world EOS)
Make processing safe to repeat: dedup on a business key (event_id/order_id) stored in a processed_events table with a unique constraint, or design naturally idempotent operations (SET status='paid' is idempotent; balance = balance + 10 is not). This is the technique you reach for 95% of the time.
Kafka exactly-once (EOS) — what it actually covers
Kafka offers genuine exactly-once within Kafka via two mechanisms:
- Idempotent producer (
enable.idempotence=true, default in modern clients): the broker assigns each producer a PID and tracks per-partition sequence numbers, so a retried duplicate batch is dropped broker-side. Kills producer-retry duplicates — but only within a single producer session, single partition. - Transactions (
transactional.id+initTransaction/beginTransaction/sendOffsetsToTransaction/commitTransaction): atomically write to multiple partitions and commit consumer offsets in one transaction. Consumers setisolation.level=read_committedto skip aborted records. This makes the consume→process→produce loop atomic — the foundation of Kafka Streams EOS (processing.guarantee=exactly_once_v2).
The crucial caveat: Kafka EOS is exactly-once for the Kafka→Kafka path. The moment your consumer does a side effect outside Kafka — write to Postgres, call Stripe, send an email — Kafka transactions don’t cover it. For that boundary you still need the transactional outbox + idempotency keys (covered in module 11). Don’t let “Kafka has exactly-once” lull you into skipping consumer idempotency on external effects.
7. Backpressure, retries, DLQs, and poison messages
Backpressure
Producers can outrun consumers. What happens to the backlog?
- Log (Kafka): the backlog just sits on disk (cheap — TBs are fine). Backpressure surfaces as consumer lag (offset distance behind the log head). Alert on lag trend, not absolute value: rising lag = consumers losing the race. Kafka has effectively no producer-side backpressure beyond buffer-full blocking (
buffer.memory,max.block.ms). - Queue (RabbitMQ): unbounded queues swell RAM until the broker hits its memory high-watermark and blocks publishers — backpressure propagates to producers. Use bounded queues /
x-max-lengthwith an overflow policy (drop-head or reject-publish), and consumer prefetch (basic.qos) so one greedy consumer doesn’t hoard 10k unacked messages. - Pull vs push: Kafka consumers pull (natural flow control — fetch when ready). RabbitMQ pushes (must bound with prefetch or you flood the consumer). Pull-based is generally more robust under backpressure.
The cascade failure: an unbounded queue in front of a slow DB hides the problem until the queue’s storage fills, then everything upstream blocks at once — a thundering collapse. Bounded buffers + load shedding fail gracefully instead. “An unbounded queue is a memory leak with extra steps.”
Retry with exponential backoff + jitter
A poison/transient failure shouldn’t hammer a struggling downstream. Back off exponentially, and always add jitter to avoid the retry thundering herd (10k consumers all retrying at exactly t+1s, t+2s … synchronize into traffic spikes that re-kill the service).
delay = min(cap, base * 2^attempt) # exponential
delay = random_between(0, delay) # FULL JITTER (AWS-recommended)
# base=100ms, cap=20s, attempt 0..6 → ~100ms,200,400,800,1.6s,3.2s,6.4s (then randomized)
AWS’s “Exponential Backoff and Jitter” article showed full jitter minimizes both contention and completion time vs naive backoff. Cap total attempts; infinite retry of a permanent error is a tight loop in disguise.
Dead-letter queues & poison messages
A poison message can never succeed — malformed payload, a referenced row that was deleted, a bug. Retrying it forever blocks the partition/queue (head-of-line blocking) and burns CPU. After N attempts, route it to a Dead-Letter Queue (DLQ) and move on.
main queue ──process──┐ fail
▼ attempt < N? ──yes──▶ retry (backoff queue)
retry? │
│ attempt ≥ N ▼
└──────────────────▶ [ DLQ ] (alert + human/automated triage)
- SQS: native — set a redrive policy with
maxReceiveCount; after N receives the message auto-moves to the DLQ. - RabbitMQ:
x-dead-letter-exchange; messages dead-lettered on reject/nack/TTL-expiry. Delayed-retry tiers via per-queue TTL + DLX chaining. - Kafka: no native DLQ — you publish failed records to a
topic.DLT(Spring Kafka / Kafka Connect do this for you). Critically, never block the partition on a poison record; commit past it and side-line it, or you stall every key behind it.
DLQ discipline: a DLQ with no alarm and no owner is a black hole where data goes to die silently. Monitor DLQ depth, capture the failure reason + stack on the message, and build a redrive path (fix bug → replay DLQ back to main). For Kafka, replayability means you can also just re-seek the offset.
8. Throughput tuning (concrete knobs & numbers)
Kafka’s speed comes from sequential disk I/O, the OS page cache, zero-copy (sendfile), and batching. The producer-side levers:
| Knob | Effect | Typical |
|---|---|---|
batch.size | Bytes per partition batch | 64KB–256KB (up from 16KB default) |
linger.ms | Wait to fill a batch | 5–50ms (trade latency for throughput) |
compression.type | Shrink payload | lz4/zstd (zstd best ratio) |
buffer.memory | Producer send buffer | bump for bursty load |
consumer fetch.min.bytes | Wait for N bytes before returning | raise to batch fetches |
Back-of-envelope: 1KB messages, 100MB/s/broker sustained sequential write → ~100k msg/s/broker, and a partition tops out around 10–50MB/s. Need 1M msg/s? ~10–20 partitions across enough brokers, batched and lz4-compressed. More partitions ≠ free: each adds open file handles, replication overhead, longer rebalances, and higher end-to-end latency; thousands of partitions per broker degrade controller failover. Size partitions for required parallelism plus headroom, not “as many as possible.”
The dominant real bottleneck is rarely the broker — it’s the downstream the consumer writes to (Postgres, an API). Tune consumer concurrency and downstream batching together; a faster broker just fills your DB’s queue sooner.
9. Common pitfalls / war stories
- Auto-commit ate my messages.
enable.auto.commit=truecommitted offsets on its 5s timer mid-batch; a crash skipped the un-processed tail. Fix: manual commit after processing. - The hot partition. Keyed events by
tenant_id; the biggest customer’s traffic pinned one consumer at 100% while five sat idle. Lag climbed only for that tenant. Fix: composite/salted key. - Reordered writes despite one partition. Idempotent producer was off and
max.in.flight=5; a retried batch overtook a later one, applying an old status after a new one. Fix:enable.idempotence=true. - The retry storm. A flaky payment API blipped; 50k consumers retried in lockstep (no jitter) and turned a 2s blip into a 20-minute self-DDOS. Fix: full jitter + circuit breaker.
- The silent DLQ. Poison messages piled into a DLQ nobody watched; finance noticed missing orders a week later. Fix: alarm on DLQ depth, store failure reason, build redrive.
- Treating Kafka like a database. Querying by scanning a topic, or keeping infinite retention as a “source of truth” without compaction or an external store. Kafka is a log, not a queryable DB.
- “We have exactly-once.” Enabled Kafka transactions, then a consumer also charged Stripe — double charges on rebalance reprocessing because the external call wasn’t idempotent. Kafka EOS doesn’t cross the Kafka boundary.
- The dual-write. Service writes to Postgres and publishes to Kafka in two steps; a crash between them leaves DB and stream inconsistent. Fix: the transactional outbox + CDC/Debezium (see module 11).
- Rebalance storms. Aggressive
session.timeout.ms+ GC pauses caused constant rebalances; throughput collapsed under stop-the-world reassignment. Fix: cooperative rebalancing + static membership + tuned timeouts.
🧩 Case Study: LinkedIn’s creation of Apache Kafka (the central log)
The problem. By 2010 LinkedIn had ~90 million members and a data-integration mess that grew worse every quarter. Dozens of systems needed to know about the same events — a member viewing a profile, sending a connection request, a page impression, a search. The activity-tracking and operational-metrics pipelines fed the newsfeed, relevance/search, ad targeting, fraud detection, the data warehouse, and a fleet of Hadoop jobs. Each producer→consumer link was a bespoke point-to-point integration. With N sources and M sinks you tend toward an O(N×M) spaghetti of pipelines, each with its own format, batching cadence, and failure behavior. Their existing tools didn’t fit: traditional message queues (ActiveMQ) buckled under the volume and, being destructive-read, couldn’t feed both a real-time consumer and a batch Hadoop loader from the same stream; log-aggregation tools (Scribe/Flume) were one-directional toward HDFS and weren’t built for low-latency online consumers. Volumes were already enormous — hundreds of billions of messages/day, and within a few years over a trillion/day and petabytes flowing through the cluster.
The insight: make the log the primitive (not the queue). This is exactly the log vs queue distinction from §2. Instead of a smart broker that deletes on ack, LinkedIn built a dumb broker holding an append-only, replayable, non-destructive log, with smart consumers tracking their own offsets. The same event stream could now serve a sub-second newsfeed consumer and a once-an-hour Hadoop batch loader off identical data, each at its own position. The O(N×M) web collapsed into O(N+M): every producer writes once to Kafka; every consumer reads independently. That is the decoupling of producers from consumers the module opens with — neither knows the other exists.
BEFORE: O(N×M) point-to-point pipelines AFTER: O(N+M) via the central log
tracking ─┬─▶ search tracking ─┐
db-CDC ─┼─▶ newsfeed db-CDC ─┤ ┌─▶ search (group, offset)
metrics ─┼─▶ Hadoop metrics ─┼─▶ KAFKA┼─▶ newsfeed (group, offset)
ads ─┴─▶ fraud ...every pair wired ads ─┘ LOG ├─▶ Hadoop (group, offset)
└─▶ fraud (replay any time)
Partitions, consumer groups, offsets — the scaling levers. A topic like PageViewEvent was split into partitions (§4), the unit of parallelism and of ordering. Each consumer group (newsfeed, search-indexer, Hadoop ETL) gets every partition assigned to exactly one member, so members of one group compete (point-to-point parallelism) while different groups each see the whole stream (pub/sub) — the consumer-group duality from §3. Throughput scales by adding partitions and consumers; LinkedIn ran topics with hundreds of partitions to drain a single high-volume stream across many machines.
Ordering per key — chosen deliberately, not globally. LinkedIn did not want a global total order (it doesn’t scale — one partition, one disk). They accepted ordering only within a partition (§5) and chose partition keys to match the entity that needed ordering: events keyed by member or by entity id land on the same partition and stay in sequence, while unrelated events spread across partitions for parallelism. This is the precise trade-off the module hammers: per-key order or unbounded parallelism, picked per stream.
Throughput from boring mechanics. Kafka’s numbers came from the levers in §8 — sequential disk I/O, the OS page cache, zero-copy (sendfile) straight from page cache to socket, and aggressive producer batching with compression. LinkedIn’s published 2014 benchmark pushed ~2 million writes/sec on three cheap commodity brokers; their production cluster carried over a trillion messages/day. Crucially they leaned on retention being cheap on disk — keep days of data and let consumer lag (not a swelling queue) be the backpressure signal (§7).
Replay was the killer feature. Because reads are non-destructive, a consumer that shipped a bug could fix it and re-seek to an old offset to reprocess history — the replay capability from §2/§7. A new derived dataset (say, a fresh search index) could be built by replaying the whole topic from offset 0. This same property turned the log into the backbone of event sourcing and stream processing (later Samza, then Kafka Streams) and of CDC via Debezium feeding the outbox pattern (module 11).
The trade-off they accepted. Kafka gave up the conveniences of a queue: no per-message acknowledgment, priority, TTL, or selective delete, and (originally) only at-least-once delivery — duplicates were the consumer’s problem to dedup (the effectively-once via idempotent consumers stance of §6; transactions/EOS came years later in KIP-98). They also gave up global ordering and strong queryability — Kafka is a log, not a database (§9). In exchange they got linear horizontal throughput, replayability, and one stream feeding many independent consumers. For a data-integration backbone, that bargain was overwhelmingly correct: Kafka was open-sourced (2011), graduated from Apache incubation, and became the de-facto streaming standard, with Confluent spun out to commercialize it.
Lessons
- Pick the log when many parties care about the same facts and may re-read them; pick the queue when it’s one-unit-of-work-once. LinkedIn’s whole win was recognizing their events were facts, not commands — so a replayable, non-destructive log beat a destructive queue.
- A shared log turns O(N×M) integration spaghetti into O(N+M). Standardize on one event backbone with self-tracking consumers instead of wiring every source to every sink.
- Partition key = your ordering-vs-parallelism decision. Order per entity costs you a hot-partition risk; global order costs you scale. Choose per stream, not platform-wide.
- Cheap sequential disk + replay changes what’s possible. Treat the log as durable, re-readable history — it enables reprocessing, new derived views, and CDC — but don’t mistake it for a queryable database or assume duplicates won’t happen.
10. Test yourself
- You need per-order event ordering AND high throughput across millions of orders. What’s the partition key, and what breaks it? Hint: key by
order_idfor per-entity order; producermax.in.flight+ idempotence can still reorder on retry; increasing partition count later breaks key→partition stability. - Why is “exactly-once delivery” impossible, and what do we actually build instead? Hint: Two Generals / lost-ack; at-least-once + idempotent (effectively-once) processing.
acks=allis set but you still lost data. How? Hint: ISR shrank to the leader andmin.insync.replicaswasn’t set, or unclean leader election truncated committed records.- When do you choose RabbitMQ/SQS over Kafka? Hint: discrete task distribution, per-message priority/TTL/delay, no replay needed, modest throughput; vs replayable event stream with many independent subscribers.
- A poison message is stuck at the head of a Kafka partition. What happens and what do you do? Hint: head-of-line blocking stalls every key behind it; commit past it / route to a DLT after N attempts, alert + redrive.
- Why add jitter to exponential backoff? Hint: synchronized retries form a thundering herd that re-kills the recovering service; full jitter spreads load.
- Difference between log compaction and retention, and when do you want compaction? Hint: retention drops old; compaction keeps latest-per-key → durable changelog / current-state snapshot; tombstones delete keys.
- Your consumer does exactly-once into Kafka but also writes Postgres. Is it exactly-once end-to-end? Hint: no — Kafka transactions don’t cover external side effects; need outbox + idempotency keys at that boundary.
11. Further reading
- DDIA (Kleppmann), Designing Data-Intensive Applications — Ch. 11 “Stream Processing” (logs vs queues, exactly-once, fault tolerance) and Ch. 4 (message schemas). The canonical treatment.
- Apache Kafka docs — Design, Replication, Exactly-Once Semantics (KIP-98), Idempotent Producer, Log Compaction. https://kafka.apache.org/documentation/
- Kreps, “The Log: What every software engineer should know about real-time data’s unifying abstraction” — the foundational essay on the log abstraction.
- AWS Architecture Blog, “Exponential Backoff and Jitter” — the definitive empirical case for full jitter. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
- KIP-98 (Kafka transactions / EOS) and KIP-429 (cooperative incremental rebalancing) — primary-source design rationale.
- AWS SQS Developer Guide — dead-letter queues, redrive policy, FIFO vs standard queues.
- RabbitMQ docs — Consumer Prefetch, Dead Letter Exchanges, Quorum Queues, Flow Control.
- Confluent, “Transactions in Apache Kafka” and “Designing Event-Driven Systems” (Stopford, free O’Reilly e-book).
- Sibling modules:
09-cap-pacelc-and-consistency-models.md,10-replication-and-consensus.md,11-idempotency-and-distributed-transactions.md.