Visual Regression Email Testing with Playwright

A structural assertion can pass while the rendered email is visibly broken: a collapsed table cell, a logo that shifted 40px, a background color that flipped to white. This guide shows how to render built HTML emails in a real browser, screenshot them at fixed viewports, and fail a build when pixels drift from an approved baseline.

Why DOM snapshots miss visual breakage

The DOM-and-text approach behind automated snapshot testing compares the compiled HTML string against a stored copy. That catches markup mutations, but it is blind to anything the markup does not literally spell out. A CSS change in a <style> block, a swapped web font, an image that now returns a 404 and renders as a broken-image box, or a max-width that no longer constrains a table — none of these necessarily change the serialized HTML in a way a string diff flags as meaningful, yet all of them change what a recipient sees.

The root cause is layering: text snapshots operate on source, but visual breakage happens at the render layer, after the browser has resolved cascade, box model, font metrics, and image loading. To catch render-layer regressions you have to test at the render layer. That means rasterizing the email in an engine and comparing the resulting bitmap, which is exactly what Playwright's toHaveScreenshot() does.

Each tested email rides the same loop: render, screenshot, diff against a committed baseline, then gate the build on the threshold.

Project layout and configuration

Install Playwright and its browsers as development dependencies:

npm i -D @playwright/test
npx playwright install --with-deps chromium webkit  # webkit approximates Apple Mail's rendering engine

Define two browser projects. Chromium is your general desktop/web-client proxy; webkit shares the engine family that powers Apple Mail and iOS Mail, so it surfaces font-metric and -webkit- quirks that chromium hides.

// playwright.config.js
import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  testDir: './tests/email',
  // Fail the run if a test accidentally left .only in the source on CI.
  forbidOnly: !!process.env.CI,
  // No retries: a screenshot diff is deterministic; retrying only hides flakiness.
  retries: 0,
  // Write the HTML report so CI can publish actual/diff images as artifacts.
  reporter: [['html', { outputFolder: 'playwright-report', open: 'never' }]],
  use: {
    // Compare at full color fidelity; antialiasing differences are handled per-assertion.
    screenshot: 'only-on-failure',
  },
  expect: {
    toHaveScreenshot: {
      // maxDiffPixelRatio tolerates sub-pixel antialiasing without hiding real breakage.
      // 0.002 = 0.2% of pixels may differ before the assertion fails.
      maxDiffPixelRatio: 0.002,
      // threshold is the per-pixel YIQ color distance (0..1) below which pixels are "equal".
      threshold: 0.15,
    },
  },
  projects: [
    {
      name: 'chromium',
      use: { ...devices['Desktop Chrome'] },
    },
    {
      name: 'webkit', // webkit ~= Apple Mail / iOS Mail rendering family
      use: { ...devices['Desktop Safari'] },
    },
  ],
});

The spec: render, screenshot desktop and mobile

Point the test at the email your build already produces. The example reads a compiled HTML file from disk, but you can equally fetch it from a local preview server so the test exercises the exact bytes your SMTP transport would send.

// tests/email/welcome.spec.js
import { test, expect } from '@playwright/test';
import { readFileSync } from 'node:fs';
import { fileURLToPath } from 'node:url';

// Load the BUILT html (post-inline, post-MJML-compile), not the template source.
const html = readFileSync(
  fileURLToPath(new URL('../../dist/welcome.html', import.meta.url)),
  'utf8'
);

// Two widths: 600px is the classic desktop email column; 375px = iPhone-class mobile.
const VIEWPORTS = [
  { label: 'desktop', width: 600, height: 1200 },
  { label: 'mobile', width: 375, height: 1200 },
];

for (const vp of VIEWPORTS) {
  test(`welcome email — ${vp.label}`, async ({ page }) => {
    // Fixed viewport: any layout shift caused by a CSS change shows up as a pixel diff.
    await page.setViewportSize({ width: vp.width, height: vp.height });

    // setContent renders the raw email markup; waitUntil:'networkidle' lets remote
    // images and web fonts finish loading so they are baked into the screenshot.
    await page.setContent(html, { waitUntil: 'networkidle' });

    // Disable CSS animations/transitions so the frame is deterministic.
    await page.emulateMedia({ reducedMotion: 'reduce' });

    // fullPage captures below-the-fold rows; the baseline name is keyed by viewport
    // AND by project (chromium/webkit) automatically via the {projectName} token.
    await expect(page).toHaveScreenshot(`welcome-${vp.label}.png`, {
      fullPage: true,
      // animations:'disabled' freezes any CSS keyframes Playwright can detect.
      animations: 'disabled',
    });
  });
}

The first run has no baseline, so Playwright writes one and reports the test as failed-then-created. Inspect the generated PNGs, confirm they look correct, and commit them. Baselines live next to the spec in a folder Playwright names welcome.spec.js-snapshots/, with one file per project and viewport, for example welcome-desktop-chromium-linux.png.

Masking dynamic regions

Transactional emails contain volatile content — order numbers, names, expiry timestamps — that will diff on every run. Rather than disabling the screenshot, mask the volatile elements so their pixels are excluded from comparison. Tag those regions in the template with a stable selector.

// Mask dynamic regions so changing data never trips the visual diff.
await expect(page).toHaveScreenshot('receipt-desktop.png', {
  fullPage: true,
  // Each masked element is painted as a solid box before diffing, so its
  // contents are ignored. Use selectors you control in the template markup.
  mask: [
    page.locator('[data-vrt-mask="order-id"]'),   // e.g. #A1024-9981
    page.locator('[data-vrt-mask="customer-name"]'),
    page.locator('[data-vrt-mask="expiry"]'),      // relative timestamps
  ],
  // maskColor must be a color the email never uses, so a mask leak is obvious.
  maskColor: '#FF00FF',
});

Keep masks tight: mask the smallest element that contains the volatile text, not the whole row, or you blind yourself to layout shifts around it.

The limitation you must design around

A headless browser is not every email client. Critically, it is not Outlook on Windows, which renders with Microsoft Word's HTML engine — a completely different code path that ignores margin on <td>, drops max-width, mishandles background images, and requires VML and MSO conditional comments. Chromium and webkit will happily render an email that Word will mangle. Treat Playwright as a fast, deterministic guard against web-engine regressions, and pair it with a real client farm for true cross-client fidelity. Submit the same build to Litmus and Email on Acid workflows to capture actual Outlook, Gmail, and dark-mode renders; the browser screenshots gate fast pre-merge, the client farm verifies before you ship.

Pipeline integration

Run the suite in CI and publish artifacts so reviewers can see what changed when a diff fails.

# .github/workflows/email-vrt.yml
name: Email Visual Regression
on: [pull_request]

jobs:
  vrt:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'
      - run: npm ci
      # Build the emails first so /dist holds the exact bytes under test.
      - run: npm run build:emails
      - run: npx playwright install --with-deps chromium webkit
      - run: npx playwright test
      # Always upload report so the actual/diff PNGs are downloadable on failure.
      - if: always()
        uses: actions/upload-artifact@v4
        with:
          name: playwright-report
          path: playwright-report/
          retention-days: 14

When a diff is intentional — you redesigned the header on purpose — regenerate the baselines with npx playwright test --update-snapshots, review the new PNGs in the diff, and commit them in the same change as the template edit so the baseline update is auditable. Pin the Playwright version in package.json; a browser engine bump can shift antialiasing enough to require a baseline refresh, so upgrade it in a dedicated change.

Because baselines are platform-specific (Linux rasterizes fonts differently from macOS), always generate them in the same OS that CI runs. The simplest path is to update baselines inside the CI container — run the workflow with the update flag once, download the artifacts, and commit them — so local-vs-CI font rendering never causes phantom failures.

Chromium vs webkit: what each project actually catches

Running both engines is not redundant — they diverge on exactly the properties email developers lean on. Keeping the two projects from the config separated lets you attribute a failure to an engine and decide whether it matters for your audience.

Property under test	chromium (Desktop Chrome)	webkit (Desktop Safari)	Why it matters for email
System/web-font metrics	Blink font shaping; subtle line-height rounding	WebKit shaping ~= Apple Mail / iOS Mail	Apple Mail clients bake in webkit metrics; a 1px line shift only webkit shows mirrors real iOS Mail wrapping
`-webkit-` prefixed CSS	Honored but tolerant	Strictest interpreter; closest to Apple Mail	Catch `-webkit-text-size-adjust` and gradient quirks before they hit iOS Mail
Default `font-smoothing`	Grayscale antialiasing	Subpixel-leaning antialiasing	Drives most "phantom" diffs — handled by `threshold`, not by ignoring webkit
Emoji / fallback glyphs	Noto-style fallback	Apple-style fallback	A missing glyph box appears in only one engine, exposing a real font-stack gap

Neither engine is Gmail's web client, which strips <style> blocks and rewrites class names, nor Samsung Email, which runs its own WebView. Use chromium as the broad web-render guard and webkit as the Apple Mail / iOS Mail proxy; everything else still needs a real-client farm.

The baseline-update workflow, step by step

A baseline is an approval artifact, so updating it must be deliberate, not a reflex when CI goes red. Treat a red visual run as a question — "did I mean to change this?" — and resolve it in one of two ways.

Unintended diff: download the playwright-report artifact, open the side-by-side actual/diff PNGs, and fix the template or CSS. Do not touch the baseline.
Intended redesign: regenerate baselines in the same change as the markup edit so the approval is reviewable in one diff.

# Regenerate ONLY the spec you changed, on the OS that CI uses (Linux here),
# so font rasterization matches the committed baselines exactly.
npx playwright test tests/email/welcome.spec.js --update-snapshots

# Review the regenerated PNGs, then stage baseline + template together.
git add tests/email/welcome.spec.js-snapshots/ tests/email/welcome.spec.js
git commit -m "redesign welcome header; refresh visual baselines"

Because Linux, macOS, and Windows rasterize fonts differently, a baseline generated on a developer's macOS laptop will phantom-fail against Ubuntu CI. The durable fix is to generate baselines inside the CI image — run the workflow once with --update-snapshots, download the artifact, and commit it — so the bytes always originate from the same engine and OS that will judge them. Pin the Playwright version in package.json: a browser engine bump can shift antialiasing past the threshold, so upgrade Playwright in its own dedicated change with a deliberate baseline refresh rather than letting it ride along with a template edit.

Validation checklist

Both chromium and webkit projects are defined and run in CI.
Screenshots are taken at a desktop width (600px) and a mobile width (375px).
The test loads the built HTML (post-inline), not the template source.
waitUntil: 'networkidle' is set so images and web fonts are loaded before capture.
Volatile regions (IDs, names, timestamps) are masked, not snapshotted.
Baselines are committed and were generated on the same OS as CI.
On failure, the run uploads actual/diff PNG artifacts for review.
Intentional changes update baselines in the same commit as the template edit.
A real-client farm is wired up separately to cover Outlook's Word engine.

Automated snapshot testing — the structural-diff layer this visual check complements
Litmus and Email on Acid workflows — true cross-client rendering, including Outlook's Word engine
Local email preview servers — serve the built email so the test renders the exact dispatched bytes
Email testing & QA workflows — how pixel diffing fits the wider QA gate

← Back to Automated Snapshot Testing