Building Webhook Event Pipelines for Transactional Email
Design a webhook event pipeline for transactional email: verify ESP signatures, enqueue fast, run idempotent consumers, and persist a queryable event store.
Every transactional message generates a stream of downstream events — it was delivered, opened, clicked, bounced, marked as spam, or deferred — and your provider reports those events by POSTing webhooks to an endpoint you own. Without a durable pipeline to receive, authenticate, deduplicate, and store those events, you fly blind: suppression lists drift, deliverability dashboards lie, and per-message status is unknowable. This guide builds that pipeline end to end, from the HTTP receiver that must return 200 in milliseconds to the idempotent consumer that turns at-least-once delivery into an accurate event store. It assumes you have already chosen a provider during ESP selection and integration and need to operationalize the feedback loop.
The Event Types ESPs Emit
Before architecting anything, fix a mental model of the payloads you will ingest. Across SendGrid, Amazon SES, Postmark, and Mailgun the vocabulary differs slightly, but the categories converge:
- delivered — the receiving MTA accepted the message. This is the terminal success event, not proof of inbox placement.
- open — a tracking pixel loaded. Massively over-counted by Apple Mail Privacy Protection (MPP), which pre-fetches pixels, and by image proxies in Gmail.
- click — a wrapped link was followed. The most trustworthy engagement signal because it requires intent that a prefetcher does not supply.
- bounce — the message was rejected. Hard bounces (invalid mailbox,
5.1.1) must suppress the address permanently; soft bounces (full mailbox, greylisting,4.x.x) are transient. The mechanics of separating the two are covered in bounce and complaint handling. - complaint / spamreport — the recipient hit "report spam", arriving through a feedback loop. A complaint must suppress immediately and should be treated as more severe than a bounce.
- dropped / deferred / blocked — the provider declined to send (recipient already on its internal suppression list, invalid payload) or the remote server asked it to retry later. Deferred is informational; dropped means the message never left.
Each event carries a provider-assigned event id, a timestamp, the recipient address, and a correlation id you can tie back to the original send (SendGrid sg_message_id, SES mail.messageId, Postmark MessageID, Mailgun message-id). Those two fields — the unique event id and the message correlation id — are the keys the entire pipeline pivots on.
Why You Need an Event Store
It is tempting to treat webhooks as fire-and-forget side effects: receive an open, increment a counter, move on. That breaks in production for three reasons.
- Deliverability analytics demand history. Bounce rate over the last 14 days, complaint rate per sending domain, click-through by template version — none of these are answerable without a row per event you can
GROUP BYand window over. - Suppression must be authoritative and auditable. When you suppress an address you need to know which complaint at what time triggered it, so you can defend a re-subscribe request or debug a wrongful suppression. A counter cannot do that.
- Per-message status is a support requirement. "Did the password reset actually reach the customer?" is a question your support team will ask hourly. Only a normalized event store keyed by the message correlation id answers it.
The event store is therefore the spine of the pipeline. Everything upstream exists to get authenticated, deduplicated events into it; everything downstream (suppression writes, analytics rollups, alerting) reads from it.
Pipeline Architecture
The pipeline is five stages, and the ordering is not negotiable. Signature verification happens before parsing; enqueue happens before any business logic; the consumer is the only component that writes the event store and triggers side effects.
Core Implementation: The Webhook Receiver
The single most common production incident with email webhooks is the receiver doing too much work synchronously, timing out, and triggering a retry storm. Providers retry on any non-2xx response or slow response, so a receiver that writes to the database inline will, under load, fall behind, return 503, get retried, and fall further behind. The fix is structural: the receiver verifies the signature and enqueues the raw bytes, nothing more.
// receiver.js — Express receiver for SendGrid Event Webhook.
// CRITICAL: capture the RAW body. Signature verification fails against
// re-serialized JSON because key order and whitespace change the bytes.
const express = require('express');
const { EventWebhook } = require('@sendgrid/eventwebhook'); // SendGrid client
const { SQSClient, SendMessageCommand } = require('@aws-sdk/client-sqs');
const app = express();
const sqs = new SQSClient({ region: 'us-east-1' });
const QUEUE_URL = process.env.EVENTS_QUEUE_URL;
const verifier = new EventWebhook();
const publicKey = verifier.convertPublicKeyToECDSA(process.env.SENDGRID_WEBHOOK_PUBLIC_KEY);
app.post('/webhooks/sendgrid',
express.raw({ type: 'application/json' }), // keep bytes intact for SendGrid signing
async (req, res) => {
const signature = req.get('X-Twilio-Email-Event-Webhook-Signature');
const timestamp = req.get('X-Twilio-Email-Event-Webhook-Timestamp');
// Verify BEFORE parsing. req.body is a Buffer here, exactly as SendGrid signed it.
const valid = verifier.verifySignature(publicKey, req.body, signature, timestamp);
if (!valid) {
return res.status(403).send('bad signature'); // never enqueue unauthenticated events
}
// Enqueue the raw payload and return immediately. Do NOT touch the DB here.
await sqs.send(new SendMessageCommand({
QueueUrl: QUEUE_URL,
MessageBody: req.body.toString('utf8'),
}));
// 200 within milliseconds keeps SendGrid from retrying and starting a storm.
return res.status(200).send('ok');
});
app.listen(3000);
The receiver returns 200 after a single network round-trip to the queue. If the queue write itself fails, returning 500 is correct — the provider will redeliver the whole batch, and because the consumer is idempotent (next section) the redelivery is harmless.
Second Pattern: The Idempotent Consumer
The consumer is where events become durable state. Because every provider delivers at-least-once, the consumer must assume each event id will arrive more than once and must produce the same result every time. The mechanism is a unique constraint on the provider event id plus an upsert that does nothing on conflict.
// consumer.js — drains the queue and writes the event store idempotently.
// Provider event ids: SendGrid sg_event_id, SES eventType+mail.messageId+timestamp,
// Postmark RecordType+MessageID, Mailgun event-data.id.
const { Pool } = require('pg');
const pool = new Pool();
async function handleBatch(records) {
for (const record of records) {
const events = JSON.parse(record.body); // SendGrid posts an array per webhook
for (const ev of events) {
const eventId = ev.sg_event_id; // stable, globally unique per SendGrid event
const messageId = ev.sg_message_id; // correlation id back to the original send
const type = ev.event; // delivered | open | click | bounce | spamreport ...
// Insert-or-skip on the unique event id. A duplicate delivery is a no-op,
// so re-counting opens or re-suppressing addresses cannot happen.
const { rowCount } = await pool.query(
`INSERT INTO email_events (event_id, message_id, type, recipient, occurred_at, payload)
VALUES ($1, $2, $3, $4, to_timestamp($5), $6)
ON CONFLICT (event_id) DO NOTHING`,
[eventId, messageId, type, ev.email, ev.timestamp, ev],
);
// Only fire side effects when the row was actually new (rowCount === 1).
// On a duplicate, rowCount === 0 and we skip — this is what makes it idempotent.
if (rowCount === 1 && (type === 'bounce' || type === 'spamreport' || type === 'dropped')) {
await suppress(ev.email, type, messageId); // suppression write is itself an upsert
}
}
}
}
The pairing of a unique index and ON CONFLICT DO NOTHING is the entire idempotency guarantee. The deeper handling — out-of-order resolution, dedup windows, and SQS FIFO versus standard — is built out in the dedicated guide on building an idempotent webhook consumer.
Provider Constraint Table
Each ESP authenticates and shapes its webhooks differently. Wire the receiver to the exact mechanism your provider uses; do not assume parity.
| Provider | Transport / format | Authentication | Notable constraint |
|---|---|---|---|
| Amazon SES | SNS notification (JSON) to HTTPS or SQS | SNS message signature (X.509 cert from SigningCertURL) |
Must confirm the SubscriptionConfirmation once; verify the cert URL host is *.amazonaws.com before trusting it |
| SendGrid | Event Webhook, batched JSON array | ECDSA signed (X-Twilio-Email-Event-Webhook-Signature + Timestamp) |
Signature is over timestamp + raw body; you must use the raw bytes |
| Postmark | Per-event webhooks (separate URL per event type) | HTTPS + optional HTTP Basic auth | No payload signature — restrict by Basic auth and IP allowlist; one event per request |
| Mailgun | JSON event, single event per request | HMAC SHA-256 over timestamp + token |
Reject if timestamp is older than a few minutes to block replay |
The SES path is worth flagging: because notifications arrive via SNS, the cleanest topology is SES → SNS → SQS, which lets you skip the HTTP receiver entirely and have the consumer poll SQS directly. The SES-specific bounce flow is detailed in handling hard bounces with Amazon SES SNS, and the SendGrid signing scheme has its own walkthrough in verifying SendGrid Event Webhook signatures.
Numbered Pipeline Integration Steps
Wire the pipeline in this order. Each step is independently deployable and testable.
- Provision the queue first. Create the SQS queue (or SNS topic for SES) and a dead-letter queue with a
maxReceiveCountof 5 before writing any code. The DLQ is your safety net for poison messages. - Stand up the receiver behind HTTPS. Deploy the
/webhooks/<provider>endpoint with TLS terminated. ESPs refuse to POST to plain HTTP and to self-signed certs. - Wire signature verification with the keys from the provider dashboard. Pull the SendGrid verification key, SNS signing cert handling, or Mailgun signing key into config — never hardcode.
- Enqueue the raw body and return 200. Confirm with a load test that the p99 receiver latency stays under ~250 ms so the provider never times out.
- Build the consumer against the unique-event-id schema. Create the
email_eventstable withUNIQUE (event_id)and the upsert path before processing real traffic. - Attach side effects downstream of the insert. Suppression, analytics rollups, and alerting all read the event store or hang off the new-row branch, never off the raw webhook.
- Register the endpoint in the ESP dashboard and send a test event. Use the provider's "send test" button and confirm a row lands in the event store with the right correlation id.
Debugging: Named Symptoms, Cause, Exact Fix
Symptom: the same open is counted two or three times. Cause: the consumer increments before checking for an existing row, so at-least-once redelivery double-counts. Fix: gate every side effect on rowCount === 1 from the ON CONFLICT DO NOTHING insert, as in the consumer above. Never count from the raw webhook.
Symptom: a delivered event is recorded after a bounce for the same message, flipping status back to delivered. Cause: out-of-order delivery — providers do not guarantee ordering. Fix: resolve per (message_id, type) using the event occurred_at timestamp with last-write-wins, and treat terminal states (bounce, complaint) as sticky. The resolution rule is implemented in the idempotent consumer guide.
Symptom: every webhook is rejected with a bad signature. Cause: a body-parser middleware ran before verification and re-serialized the JSON, changing the bytes. Fix: mount express.raw() on the webhook route only and verify against the Buffer. This is the single most common SendGrid integration failure.
Symptom: a flood of duplicate events, sometimes thousands in minutes. Cause: the receiver returned a 5xx (or timed out), so the ESP entered exponential-backoff retry and kept resending the batch. Fix: make the receiver do nothing but verify-and-enqueue, push slow work to the consumer, and add the DLQ so genuinely broken messages stop being retried forever.
Symptom: SES notifications never arrive. Cause: the SNS subscription was never confirmed. Fix: handle the SubscriptionConfirmation message type in the receiver and visit (or GET) the SubscribeURL once.
Symptom: a message shows status delivered even though the customer reports it bounced. Cause: the delivered arrived after the bounce and overwrote it because status was resolved by arrival order. Fix: resolve by the event's own occurred_at and make terminal states sticky, as in the message_status upsert above.
Symptom: Postmark events are silently rejected with 401. Cause: Postmark has no payload signature, so the receiver's signature step has nothing to check and falls through to a deny, or the Basic-auth credentials drifted. Fix: branch Postmark to Basic-auth + IP-allowlist verification instead of a signature check, and confirm the credentials match the webhook config.
Symptom: open and click counts are wildly inflated for Apple Mail recipients. Cause: Apple Mail Privacy Protection pre-fetches the tracking pixel, generating open events with no human behind them. Fix: do not gate logic on opens; use clicks for engagement, and tag MPP-style opens (often arriving in a burst right after send from Apple proxy IPs) so analytics can exclude them.
Symptom: the dead-letter queue fills with the same payload shape. Cause: a consumer bug throws on a specific event type (e.g. an unsubscribe the schema does not handle), so every such event exhausts maxReceiveCount and lands in the DLQ. Fix: inspect one DLQ message, add the missing branch, and redrive the DLQ back to the main queue once fixed.
A Deeper Event-Type Taxonomy
The six categories above are the buckets; production code branches on the provider's exact strings, and the mapping is not one-to-one. A single internal bounce may correspond to two or three provider events, and an event that looks like a delivery success on one provider is a no-op on another.
| Internal type | SendGrid event |
Amazon SES eventType |
Postmark RecordType |
Mailgun event |
Drives a side effect? |
|---|---|---|---|---|---|
| delivered | delivered |
Delivery |
Delivery |
delivered |
No (status only) |
| open | open |
Open |
Open |
opened |
Analytics only |
| click | click |
Click |
Click |
clicked |
Analytics only |
| hard bounce | bounce (type=bounce) |
Bounce (Permanent) |
Bounce (HardBounce) |
failed (permanent) |
Suppress |
| soft bounce | deferred / bounce (type=blocked) |
Bounce (Transient) |
Bounce (Transient) |
failed (temporary) |
Retry, count toward cap |
| complaint | spamreport |
Complaint |
SpamComplaint |
complained |
Suppress (severe) |
| dropped | dropped |
Reject / Rendering Failure |
n/a | rejected |
Suppress (already dead) |
| unsubscribe | group_unsubscribe / unsubscribe |
n/a (managed in app) | SubscriptionChange |
unsubscribed |
Suppress for that stream |
Two normalization rules fall out of this. First, collapse the soft/hard split early: SendGrid signals a soft failure as deferred (and sometimes bounce with type: blocked), while SES uses one Bounce event with bounceType, so the normalizer must read a second field, not just the event name — the hard-vs-soft separation logic itself lives in bounce and complaint handling. Second, treat opens as untrustworthy: Apple Mail Privacy Protection pre-fetches the tracking pixel for every message regardless of whether the human opened it, so an open row is evidence the message reached an Apple Mail account, not that anyone read it. Use clicks, not opens, as the engagement signal that feeds re-engagement or sunset logic.
The Fast-200 Plus Async Queue Pattern, in Detail
The receiver returning 200 in milliseconds is not a nicety — it is the load-bearing decision that keeps the whole pipeline stable. ESPs treat your endpoint as a dependency with a timeout: SendGrid, SES (via SNS), and Mailgun all expect a 2xx within a small number of seconds, and they interpret a slow response identically to an error — they retry. So a receiver that writes to Postgres inline has coupled its availability to the database's: a slow query, a lock, or a failover turns into receiver timeouts, which turn into provider retries, which pile more load onto the already-slow database. This is the retry storm, and it is self-amplifying.
The structural fix is to make the receiver's only synchronous dependency a durable buffer (SQS, SNS→SQS, Kafka, or even a Postgres inbox table written with a single fast insert) and to do all parsing, validation beyond the signature, suppression, and analytics in an asynchronous consumer that can fall behind without ever touching the provider's view of your health.
// The receiver's entire job: verify, enqueue, 200. Note what is ABSENT —
// no JSON.parse of the events, no DB write of business data, no suppression call.
app.post('/webhooks/:provider', express.raw({ type: '*/*' }), async (req, res) => {
if (!verifySignature(req.params.provider, req)) {
return res.status(403).send('bad signature'); // never enqueue unauthenticated bytes
}
try {
await enqueueRaw(req.params.provider, req.body); // single fast write to SQS
} catch (err) {
// A 500 here is CORRECT: the ESP redelivers, and the idempotent consumer
// (keyed on event id) makes the redelivery a no-op. Better to be retried than to lose the event.
return res.status(500).send('enqueue failed');
}
return res.status(200).send('ok'); // p99 must stay well under the provider timeout
});
The decoupling buys two more properties for free: the consumer can be scaled, deployed, and even taken down for a migration without the ESP ever seeing an error (events simply accumulate in the queue), and a poison event that crashes the consumer is isolated to the queue and the dead-letter queue rather than blocking the HTTP path.
Retry-Storm and Out-of-Order Handling
Retry storms start when a transient receiver failure flips the provider into exponential-backoff redelivery. Because the redelivered batch is identical, the only safe defenses are the ones already in this pipeline: verify-and-enqueue keeps the receiver fast, the dead-letter queue stops a genuinely poison message from being retried forever, and consumer idempotency makes the inevitable duplicate deliveries harmless. Add a fourth: monitor queue depth and consumer lag, because a rising backlog is the earliest signal that the consumer — not the receiver — has become the bottleneck, well before the DLQ starts filling.
Out-of-order delivery is the subtler failure. Providers generate events in distributed systems and deliver them over independent retries, so the delivered for a message can arrive after its bounce, and a stale open can arrive after a click. A consumer that does last-write-by-arrival will flip a bounced message back to "delivered." The fix is to resolve status by the event's own occurred_at timestamp, not by arrival order, and to make terminal states (bounce, complaint, dropped) sticky so a late success cannot override them:
-- Resolve per-message status by the event's OWN timestamp, not arrival order.
-- A late-arriving "delivered" (older occurred_at) cannot overwrite a newer "bounce".
INSERT INTO message_status (message_id, type, occurred_at)
VALUES ($1, $2, $3)
ON CONFLICT (message_id) DO UPDATE
SET type = EXCLUDED.type, occurred_at = EXCLUDED.occurred_at
WHERE message_status.occurred_at < EXCLUDED.occurred_at -- only advance forward in time
AND message_status.type NOT IN ('bounce','spamreport','dropped'); -- terminal states are sticky
The full treatment — synthesizing event ids for SES, dedup windows, and SQS FIFO versus standard queues — is in the idempotent consumer guide.
Fuller Provider Signing and Verification Reference
Wiring the receiver wrong on signing is the most common reason a correct pipeline rejects every legitimate event. The exact header names, the signed material, and the key handling differ per provider:
| Provider | Signature header(s) | Signed material | Key / secret | Verification gotcha |
|---|---|---|---|---|
| Amazon SES (via SNS) | x-amz-sns-message-type, signature in JSON body (Signature, SigningCertURL) |
SNS canonical string of selected message fields | X.509 cert fetched from SigningCertURL |
Validate the cert URL host ends in .amazonaws.com before fetching; cache the cert; confirm SubscriptionConfirmation once |
| SendGrid | X-Twilio-Email-Event-Webhook-Signature, X-Twilio-Email-Event-Webhook-Timestamp |
timestamp + raw body, ECDSA |
Base64 public key from dashboard, converted to ECDSA | Must verify against raw bytes; re-serialized JSON breaks it — see the SendGrid signature guide |
| Postmark | none (no payload signature) | n/a | HTTP Basic auth credentials you set | Restrict by Basic auth + IP allowlist; one event per request, not a batch |
| Mailgun | signature object (timestamp, token, signature) |
HMAC-SHA256 of timestamp + token |
Webhook signing key from dashboard | Reject if timestamp is older than ~a few minutes; the token is single-use, so cache seen tokens to block replay |
The SES row is the structural outlier and is best handled by skipping HTTP entirely: subscribe an SQS queue to the SNS topic (SES → SNS → SQS) and have the consumer poll, which is detailed in handling hard bounces with Amazon SES and SNS.
Validation & Deployment Checklist
- Receiver verifies the provider signature (SES SNS cert, SendGrid ECDSA, Mailgun HMAC) before any parsing.
- Receiver reads the raw request body for signature checks, not re-serialized JSON.
- Receiver enqueues and returns
200in under ~250 ms at p99 under load. - Queue has a dead-letter queue with a finite
maxReceiveCount. email_eventstable has aUNIQUEconstraint on the provider event id.- Consumer uses
ON CONFLICT DO NOTHINGand fires side effects only on a newly inserted row. - Out-of-order events resolve by
occurred_atwith terminal states sticky. - Suppression writes are idempotent upserts, not blind inserts.
- Endpoint registered in the ESP dashboard and a test event lands in the store with the correct correlation id.
- Replay protection rejects stale timestamps (Mailgun, SendGrid) older than a few minutes.
- Queue depth and consumer lag are monitored, with an alert before the DLQ starts filling.
- Status resolution makes terminal states (bounce, complaint, dropped) sticky against late
delivered. - Opens are treated as untrusted (Apple MPP); engagement logic keys on clicks.
Frequently Asked Questions
Why not just write each webhook straight to the database in the receiver? Because it couples your endpoint's response time to your database's, and the ESP interprets a slow response as a failure and retries — which adds load to the already-slow database. The verify-and-enqueue pattern decouples them so the consumer can fall behind without the provider ever seeing an error.
Do I need a queue, or is a pg inbox table enough? A single fast insert into an inbox table is a perfectly good buffer for low-to-moderate volume; the requirement is only that the receiver's one synchronous write is cheap and that a separate worker drains it. SQS earns its place at higher volume and when you want a managed dead-letter queue and back-pressure for free.
Should I trust delivered as proof the email reached the inbox? No. delivered means the receiving MTA accepted the message; it says nothing about inbox versus spam folder placement. For placement you need seed-list testing or aggregate signals like Gmail Postmaster Tools, covered in processing feedback loops and complaints.
How do I handle a provider that sends one event per request and another that batches? Normalize at the consumer: SendGrid posts a JSON array (iterate it), while Postmark and Mailgun post one event per request (wrap it in a single-element array). After normalization, the rest of the pipeline processes a uniform stream of single events.
What is the right maxReceiveCount for the dead-letter queue? Five is a common default: enough to ride out transient consumer or database blips, few enough that a genuinely poison message is quarantined quickly instead of being retried indefinitely.
Related
- Verifying SendGrid Event Webhook Signatures — the ECDSA verification step in depth
- Building an Idempotent Webhook Consumer — dedup, ordering, and FIFO versus standard queues
- Bounce and Complaint Handling — what to do with the bounce and complaint events this pipeline ingests
- ESP Selection and Integration — choosing the provider whose webhooks you are wiring