Guide

Agent tests that hit the live model are flaky, paid, and nondeterministic — a temperature wobble or a provider hiccup turns a green suite red, and every CI run burns tokens. tapedeck records that call once against the real API, commits the result as a cassette, and replays it on every run after: deterministic, offline, free, and stream-accurate.

tapedeck is a Vercel AI SDK companion. It plugs in at the wrapLanguageModel middleware layer (model spec v3), so it is provider-agnostic and stream-aware by construction — no HTTP proxy, no mock to hand-write, no infra. Switch behaviour with one env var; nothing else in your code changes.

Install

npm install -D @nkwib/tapedeck
# or
pnpm add -D @nkwib/tapedeck

Requires the ai peer (>=6.0.0 <7). tapedeck has zero runtime dependencies beyond that peer — @ai-sdk/provider is a type-only dev dependency.

Quickstart

Wrap your model once and read the mode from an env var. That is the whole integration:

import { openai } from '@ai-sdk/openai';
import { generateText, wrapLanguageModel } from 'ai';
import { cassetteMiddleware } from '@nkwib/tapedeck';

const model = wrapLanguageModel({
  model: openai('gpt-4o'),
  middleware: cassetteMiddleware({
    mode: process.env.CASSETTE_MODE ?? 'live', // record | replay | live
    cassetteDir: './cassettes',
    redact: ['apiKey', 'authorization', /token/i],
  }),
});

// CASSETTE_MODE=record → hits the live API, writes a cassette.
// CASSETTE_MODE=replay → offline, deterministic, free.
const { text } = await generateText({ model, prompt: 'Say hi' });

The recommended workflow: live in development, record to capture a fixture once, replay in CI. Run the test against the live API a single time, commit the cassette, and flip CI to replay.

record once, then replay

# 1. record once — hits the live API, writes a cassette
$ CASSETTE_MODE=record pnpm test
[vitest] checkout agent > runs the checkout flow
[tapedeck] record: openai/gpt-4o -> cassettes/a1b2c3.cassette.json
 ✓ checkout agent > runs the checkout flow (1284ms)
Test Files  1 passed (1)

# 2. commit the cassette, then replay forever — offline, deterministic, free
$ git add cassettes/ && git commit -m "record checkout cassette"
$ CASSETTE_MODE=replay pnpm test
[tapedeck] replay: openai/gpt-4o <- cassettes/a1b2c3.cassette.json
 ✓ checkout agent > runs the checkout flow (3ms)
Test Files  1 passed (1)

Modes

Mode	Behaviour
`record`	Calls the real model, serializes request + response to a cassette, returns the live result.
`replay`	Looks up the cassette by hash, serves it. A miss throws — a changed prompt or tool schema fails the test, forcing a re-record.
`live`	Passthrough. No recording, no lookup.

In record mode the call flows out to the provider and the response is captured on the way back:

In replay mode the cassette is resolved by hash and served straight back — the provider is never touched:

One env var, no code change

An invalid mode string fails fast with CassetteModeError the moment the middleware is constructed, so a typo'd CASSETTE_MODE never silently falls through to a live call.

cassetteMiddleware

cassetteMiddleware(options?) returns an AI SDK LanguageModelV3Middleware that intercepts both doGenerate (one-shot) and doStream (streaming).

Option	Type	Default	Description
`mode`	`'record' \\| 'replay' \\| 'live'`	`'live'`	Operating mode.
`cassetteDir`	`string`	`'./cassettes'`	Directory cassettes are read from / written to.
`redact`	`(string \\| RegExp)[]`	`[]`	Extra key matchers, merged with the built-in defaults.
`cassetteName`	`string`	—	Force a specific filename instead of hash-addressing. Named cassettes are multi-interaction (keyed by hash). Mostly used internally by `withCassette`.
`store`	`CassetteStore`	filesystem	Storage backend. Use `memoryCassetteStore()` on edge runtimes.
`tracer`	`TapedeckTracer`	—	OTel-compatible tracer; emits `tapedeck.generate` / `tapedeck.stream` spans.

tapedeck also exports lower-level primitives for direct use — hashing (computeCassetteHash, stableStringify, normalizeTools), cassette I/O (loadCassette / saveCassette, parseCassette / serializeCassette, isMultiCassette), diff/merge (diffCassettes, diffCassetteFiles, mergeCassetteDirs), storage (fileCassetteStore, memoryCassetteStore), telemetry (withSpan), and the constants CASSETTE_VERSION, MULTI_CASSETTE_VERSION, cassetteFilename(hash), REDACTED, DEFAULT_REDACT. See the API reference for all of them.

Streaming

Streaming is first-class — not a non-goal. In record mode tapedeck drains the live stream, captures the ordered stream parts, and re-serves them so your code still receives the response. In replay mode the recorded parts are replayed as a genuine ReadableStream via the SDK's own simulateReadableStream, so streamText, UI message streams, and tool-call streaming all see the surface they would live.

import { streamText } from 'ai';

const { textStream } = await streamText({ model, prompt: 'Tell me a story' });
for await (const delta of textStream) process.stdout.write(delta);
// Identical output whether the model is live or replayed from a cassette.

Cassette format

Cassettes are pretty-printed JSON, keyed by a stable hash, designed to diff cleanly in PRs:

{
  "version": "tapedeck@0.1.0",
  "hash": "sha256:abc123…",
  "recordedAt": "2026-06-10T12:00:00Z",
  "request": {
    "modelProvider": "openai",
    "modelId": "gpt-4o",
    "prompt": [ ],
    "tools": [ ],
    "temperature": 0.7
  },
  "response": {
    "type": "stream",
    "chunks": [
      { "type": "text-delta", "id": "0", "delta": "I'll" },
      { "type": "text-delta", "id": "0", "delta": " help" },
      { "type": "tool-call", "toolCallId": "call_123", "toolName": "search", "input": "{\"query\":\"t-shirts\"}" }
    ]
  }
}

A one-shot generateText produces a "type": "generate" response holding the recorded content array, finish reason, and usage instead of chunks.

Named cassettes (from withCassette / cassetteName) use the v2 multi-interaction format: one file holding every call the test makes, keyed by hash — generate and stream interactions can mix freely:

{
  "version": "tapedeck@0.3.0",
  "recordedAt": "2026-06-10T12:00:00Z",
  "interactions": [
    { "hash": "sha256:abc…", "request": { }, "response": { "type": "generate" } },
    { "hash": "sha256:def…", "request": { }, "response": { "type": "stream", "chunks": [ ] } }
  ]
}

Legacy v1 single-interaction named cassettes still replay (served as-is); hash-addressed cassettes always use the single format.

Hash algorithm

The hash is a SHA-256 of the canonicalized, sorted JSON of:

{ modelProvider, modelId, prompt, toolSchemas, maxOutputTokens, temperature, topP }

Tool schemas are normalized (descriptions stripped, keys sorted) so cosmetic doc changes don't invalidate a cassette — but a changed prompt, tool input schema, or sampling param does. That is the point: a behavioural change fails CI loudly instead of replaying stale data.

A changed prompt fails CI on purpose

When the inputs to the hash change, replay misses with CassetteMissError and the test fails. Re-record, eyeball the cassette diff in the PR, and commit the new fixture.

Secret redaction

Redaction is key-name based and runs at record time, so secrets never reach disk:

Default matchers: apiKey, authorization, x-api-key, bearer, token (case-insensitive).
Configurable via redact: (string | RegExp)[] — strings match field / header names case-insensitively; RegExps test the raw key. Your matchers are merged with the built-in defaults.
Replaying a cassette that still contains a value a matcher would strip throws CassetteSecretError — a committed secret fails the build instead of leaking.

cassetteMiddleware({
  mode: 'record',
  redact: ['apiKey', 'authorization', /secret/i],
});

Vitest helper

@nkwib/tapedeck/vitest exports withCassette(name, testFn, options?), which pins a test to a named cassette and forces replay mode for its duration:

import { describe, it, expect } from 'vitest';
import { withCassette } from '@nkwib/tapedeck/vitest';

describe('checkout agent', () => {
  it('runs the checkout flow', async () => {
    await withCassette('checkout-flow.json', async () => {
      const result = await runAgent({ prompt: 'buy a t-shirt' });
      expect(result.steps).toHaveLength(3);
    });
  });
});

Any cassetteMiddleware instance active inside the callback picks up the named cassette automatically — via an AsyncLocalStorage context — and tears down on exit, so there is no global setup/teardown to wire up. Pass options.mode to override the forced replay, or options.cassetteDir to point at a different directory.

The named cassette is multi-interaction: if the agent above makes three model calls, all three are recorded into checkout-flow.json keyed by request hash, and each call replays its own response — in any order. Re-recording a test starts the file fresh, so stale interactions never linger.

@nkwib/tapedeck/vitest also exports the toFollowRoute() matcher: pair with toolroute to assert that the replayed trajectory only makes transitions your router allows (expect(result.steps).toFollowRoute(router) after expect.extend({ toFollowRoute })).

CLI

The package ships a tapedeck bin for the record/replay workflow:

npx tapedeck record ./scripts/demo.mjs    # run with CASSETTE_MODE=record
npx tapedeck replay ./scripts/demo.mjs    # run with CASSETTE_MODE=replay
npx tapedeck record pnpm test             # non-file args run as commands on PATH

npx tapedeck ls ./cassettes               # kind, model, recordedAt per cassette
npx tapedeck diff a.json b.json           # semantic field-level diff (exit 1 on difference)
npx tapedeck merge ./from-ci ./cassettes  # merge directories; --force overwrites conflicts

diff pinpoints which fields diverged and ignores recordedAt; merge skips identical files and fails on conflicts unless --force is passed.

Telemetry

Pass any OTel-compatible tracer and every record/replay emits a span — typed structurally, so tapedeck keeps zero runtime dependencies:

import { trace } from '@opentelemetry/api';

cassetteMiddleware({ mode: 'replay', tracer: trace.getTracer('tapedeck') });

Spans (tapedeck.generate / tapedeck.stream) carry mode, hash, cassette path, model, hit/miss, and chunk-count attributes; a miss records the exception with an error status, so a failing CI replay shows up in traces.

Storage & edge runtimes

Cassette I/O goes through a CassetteStore (read/write/list). The default is the filesystem (loaded lazily); pass memoryCassetteStore() — or a KV/R2-backed store — on edge runtimes. The core never imports node:fs, node:path, or node:crypto statically; the one remaining Node builtin is node:async_hooks, which Cloudflare Workers provides under the nodejs_compat flag. See Compatibility for the caveats.

Errors

Error	When
`CassetteMissError`	`replay` mode, no cassette matches the hash. Message includes the hash and the path searched.
`CassetteSecretError`	A replayed cassette still contains unredacted secrets. Lists the offending field paths.
`CassetteCorruptError`	Invalid JSON, unknown version, or a malformed / mismatched response shape.
`CassetteModeError`	An invalid mode string was supplied.

All extend CassetteError, so you can catch the whole family with one instanceof CassetteError.

Roadmap

Everything deferred from the first cut — OTel spans, the CLI, diff/merge tooling, the edge-safe core, the toFollowRoute() matcher, and multi-interaction named cassettes — has shipped as of 0.3.0. Still ahead:

Deployed Cloudflare Workers smoke test in CI — edge support is designed-for, not yet CI-verified.
Interaction-level merge for multi-cassettes (merge is file-level today).

See the changelog for the full release history.