Skip to main content

Building an Idempotent Webhook Consumer for Email Events

Build an idempotent email webhook consumer: dedupe on the provider event id, make side effects safe to repeat, and resolve out-of-order events deterministically.

When your consumer double-counts opens or flips an address from suppressed to active and back, the cause is almost always the same: it treated an at-least-once event stream as if each event arrived exactly once. This guide builds a consumer that produces the same correct state no matter how many times — or in what order — the provider redelivers an event.

The Problem and Its Scope

SendGrid, Amazon SES (via SNS), Postmark, and Mailgun all guarantee at-least-once delivery of webhook events and retry on any non-2xx response or timeout. The practical consequence is that the same event arrives multiple times, and a delivered can land after the bounce for the same message because there is no ordering guarantee. A naive consumer that increments a counter per webhook or runs UPDATE blindly will double-count engagement, re-suppress already-suppressed addresses, or — worse — un-suppress an address when a late delivered overwrites a bounce. This guide applies to any consumer draining email webhooks; examples use PostgreSQL and Node.js with notes on SQS FIFO versus standard queues.

Root Cause: At-Least-Once Plus No Ordering

Two independent provider properties combine into the bug. First, retry semantics: if your receiver returns 503, times out, or the network drops the response, the provider redelivers the identical event — same sg_event_id or SES messageId — minutes or hours later. Second, no ordering guarantee: events are generated by distributed systems and delivered over independent retries, so the open for a message can arrive before its delivered, and a transient deferred after the final delivered. A consumer that assumes "each webhook is a new, ordered fact" is therefore wrong on both axes. Idempotency fixes the first; an explicit ordering rule fixes the second.

The Exact Fix

Four rules, applied together:

  1. Dedupe on a stable provider event id using a unique constraint and an upsert that does nothing on conflict.
  2. Make side effects idempotent — suppression is an upsert, not an insert; counters are derived from the event store, not incremented per webhook.
  3. Return 200 quickly and process asynchronously, so retries are driven by genuine failures, not slowness.
  4. Resolve out-of-order events with the event timestamp and last-write-wins per (message_id, event_type), with terminal states sticky.
Dedup and idempotency flow An event is inserted with on-conflict-do-nothing; a new row drives side effects while a duplicate is skipped, and message status resolves by latest timestamp. Dedup & Idempotency Flow Event off queue has event_id INSERT ... ON CONFLICT DO NOTHING New row run side effect Duplicate skip (no-op) Resolve status Status resolves per (message_id, type) by latest occurred_at — last write wins. Terminal states (bounce, complaint) are sticky: a late delivered cannot override them.
Dedup on the event id makes redelivery a no-op; only a newly inserted row triggers side effects, and message status resolves by latest timestamp.

Schema and consumer

-- email_events: one row per provider event, deduped by a unique event id.
CREATE TABLE email_events (
  event_id     text PRIMARY KEY,        -- SendGrid sg_event_id / Mailgun event-data.id / SES synthetic id
  message_id   text NOT NULL,           -- correlation id back to the original send
  type         text NOT NULL,           -- delivered | open | click | bounce | spamreport | dropped
  recipient    text NOT NULL,
  occurred_at  timestamptz NOT NULL,    -- the provider's event timestamp, used for ordering
  payload      jsonb NOT NULL
);
CREATE INDEX email_events_message_idx ON email_events (message_id, type, occurred_at DESC);

-- message_status: the resolved current state per (message, event type), last-write-wins.
CREATE TABLE message_status (
  message_id   text NOT NULL,
  type         text NOT NULL,
  occurred_at  timestamptz NOT NULL,
  PRIMARY KEY (message_id, type)
);
// consumer.js — idempotent processing of one event.
const { Pool } = require('pg');
const pool = new Pool();

// Terminal states are sticky: once a message bounces or is reported as spam,
// a late-arriving "delivered" must NOT flip it back. Used by the status resolver.
const TERMINAL = new Set(['bounce', 'spamreport', 'dropped']);

async function processEvent(ev) {
  // SendGrid gives sg_event_id; SES has no native event id, so synthesize a stable one.
  const eventId = ev.sg_event_id ?? `${ev.message_id}:${ev.type}:${ev.occurred_at}`;

  // 1. Dedupe: a redelivered event hits the unique PK and inserts zero rows.
  const inserted = await pool.query(
    `INSERT INTO email_events (event_id, message_id, type, recipient, occurred_at, payload)
     VALUES ($1, $2, $3, $4, $5, $6)
     ON CONFLICT (event_id) DO NOTHING
     RETURNING event_id`,
    [eventId, ev.message_id, ev.type, ev.recipient, ev.occurred_at, ev],
  );
  if (inserted.rowCount === 0) return; // duplicate delivery — already processed, no-op

  // 2. Resolve out-of-order status: only advance if this event is newer than what we hold.
  //    The WHERE clause makes a stale, late-arriving event a no-op even though it is "new".
  await pool.query(
    `INSERT INTO message_status (message_id, type, occurred_at)
     VALUES ($1, $2, $3)
     ON CONFLICT (message_id, type) DO UPDATE
       SET occurred_at = EXCLUDED.occurred_at
       WHERE message_status.occurred_at < EXCLUDED.occurred_at`,
    [ev.message_id, ev.type, ev.occurred_at],
  );

  // 3. Idempotent side effect: suppression is an upsert, safe to run again.
  if (TERMINAL.has(ev.type)) {
    await pool.query(
      `INSERT INTO suppressions (email, reason, suppressed_at)
       VALUES ($1, $2, $3)
       ON CONFLICT (email) DO NOTHING`,  // already suppressed? leave the original reason/time
      [ev.recipient, ev.type, ev.occurred_at],
    );
  }
}

The ON CONFLICT (event_id) DO NOTHING is the dedup. The RETURNING plus rowCount === 0 check is what lets every side effect run exactly once across infinite redeliveries. The conditional DO UPDATE ... WHERE on message_status is the out-of-order fix: a delivered that arrives after a bounce will not overwrite the more recent terminal state because its occurred_at is older.

Variant: The Exactly-Once Illusion and Queue Choice

There is no exactly-once delivery over the network — only exactly-once processing, achieved by deduping on a stable id within a window. If your provider does not supply a stable event id (SES via SNS does not), synthesize one from message_id + type + occurred_at, as above, and keep it as the primary key. A bounded dedup window (for example, only keep email_events rows for 30 days) is a practical compromise: redeliveries beyond the window are vanishingly rare, and pruning keeps the table fast.

Queue choice shifts how much ordering work you do:

  • SQS standard is at-least-once and explicitly unordered. It is the right default for email events because your consumer is already idempotent and resolves order from timestamps. Cheaper, higher throughput, no extra config.
  • SQS FIFO offers ordering within a MessageGroupId and built-in five-minute deduplication on MessageDeduplicationId. Setting MessageGroupId = message_id keeps a single message's events in order, which simplifies status resolution — but it caps throughput per group and its dedup window is only five minutes, far shorter than the provider's retry horizon. FIFO complements, but does not replace, the database-level unique constraint.

The pragmatic stance: use standard queues plus database idempotency. Reach for FIFO only when strict per-message ordering materially simplifies a downstream consumer, and never rely on FIFO's five-minute dedup as your sole guard against the provider's hours-long retries.

Property SQS standard SQS FIFO
Delivery At-least-once Exactly-once within the 5-min dedup window
Ordering None (best-effort) Ordered per MessageGroupId
Dedup None (you dedup in the DB) Built-in on MessageDeduplicationId, 5 minutes only
Throughput Effectively unlimited Capped per message group
Right fit for email events Yes — consumer is already idempotent and timestamp-orders Only when a downstream genuinely needs per-message ordering

The throughput cap is the practical reason FIFO is rarely the default for email: if you key MessageGroupId on message_id to get per-message ordering, every message becomes its own group, and high send volume bumps into FIFO's per-group limits. Keying the group more coarsely (e.g. per sending domain) restores throughput but loses the per-message ordering that was the only reason to choose FIFO. Standard queues sidestep the dilemma entirely because the database, not the queue, is the ordering and dedup authority.

A dedup-window strategy

"Dedup forever" is unaffordable — the email_events table would grow without bound and the unique-index lookup would slow down. The realistic guarantee is dedup within a bounded window that comfortably exceeds the provider's maximum retry horizon. The window has to satisfy one inequality: it must be longer than the longest interval over which a provider can redeliver the same event. SendGrid, SES, Postmark, and Mailgun all retry over hours, occasionally up to roughly a day, so a window of 30 days carries a wide safety margin while keeping the table prunable.

Two layers implement it. The cheap, fast layer is a short-lived dedup cache (Redis SET key NX EX) that absorbs the burst of near-simultaneous redeliveries; the authoritative layer is the database unique constraint, which catches a redelivery that arrives after the cache key expired:

// Layer 1: a Redis fast-path dedup absorbs rapid redeliveries cheaply.
// SET ... NX returns null if the key already exists → seen recently, skip the DB round-trip.
const fresh = await redis.set(`evt:${eventId}`, '1', 'NX', 'EX', 3600); // 1h cache window
if (fresh === null) return;          // recent duplicate; the DB would no-op anyway

// Layer 2: the authoritative INSERT ... ON CONFLICT is the real guarantee (window = retention).
const inserted = await pool.query(/* ON CONFLICT (event_id) DO NOTHING RETURNING event_id */);
if (inserted.rowCount === 0) return; // duplicate beyond the cache window — still safe

Prune the authoritative table on the same window so the unique index stays small. Crucially, do the prune and the dedup against the same horizon, because a row deleted while the provider can still redelivery its event would let a duplicate through:

-- Run on a schedule. The retention window MUST exceed the provider's max retry horizon (~24h),
-- so 30 days leaves a wide margin while keeping the unique-index lookup fast.
DELETE FROM email_events WHERE occurred_at < now() - interval '30 days';

For a provider that supplies no native event id (SES via SNS), the synthesized id message_id:type:occurred_at is stable across redeliveries because all three inputs are fixed at event creation — so it deduplicates correctly within the same window, provided occurred_at comes from the provider's event payload and not from your own ingest clock (which differs on each redelivery and would defeat dedup entirely).

Pipeline Integration

This consumer is the fourth stage of the webhook event pipeline. It runs only on events that already passed SendGrid signature verification at the receiver — never verify inside the consumer, because by then the event is already enqueued and trusted. Drain the queue in batches, call processEvent per event, and let failures redrive through the queue's retry path into a dead-letter queue after a finite maxReceiveCount. The suppression writes this consumer performs feed directly into bounce and complaint handling, and the resolved message_status rows back the per-message status your support team queries.

Validation & Deployment Checklist

  • email_events has a unique primary key on the provider event id (synthesized for SES).
  • Inserts use ON CONFLICT DO NOTHING and side effects run only when rowCount === 1.
  • Suppression and other side effects are idempotent upserts, never blind inserts.
  • Status resolution advances only when the incoming occurred_at is newer.
  • Terminal states (bounce, complaint, dropped) are sticky against late delivered events.
  • The consumer is async; the receiver returns 200 before this work runs.
  • A dead-letter queue captures poison messages after a finite retry count.
  • A replayed batch in staging produces zero duplicate counts and identical final state.

← Back to Building Webhook Event Pipelines