Guide
Agent tests that hit the live model are flaky, paid, and nondeterministic — a temperature wobble or a provider hiccup turns a green suite red, and every CI run burns tokens. tapedeck records that call once against the real API, commits the result as a cassette, and replays it on every run after: deterministic, offline, free, and stream-accurate.
tapedeck is a Vercel AI SDK companion. It plugs in at
the wrapLanguageModel middleware layer (model spec v3), so it is
provider-agnostic and stream-aware by construction — no HTTP proxy, no mock to
hand-write, no infra. Switch behaviour with one env var; nothing else in your
code changes.
Install
npm install -D @nkwib/tapedeck
# or
pnpm add -D @nkwib/tapedeck Requires the ai peer (>=6.0.0 <7). tapedeck has zero runtime dependencies
beyond that peer — @ai-sdk/provider is a type-only dev dependency.
Quickstart
Wrap your model once and read the mode from an env var. That is the whole integration:
import { openai } from '@ai-sdk/openai';
import { generateText, wrapLanguageModel } from 'ai';
import { cassetteMiddleware } from '@nkwib/tapedeck';
const model = wrapLanguageModel({
model: openai('gpt-4o'),
middleware: cassetteMiddleware({
mode: process.env.CASSETTE_MODE ?? 'live', // record | replay | live
cassetteDir: './cassettes',
redact: ['apiKey', 'authorization', /token/i],
}),
});
// CASSETTE_MODE=record → hits the live API, writes a cassette.
// CASSETTE_MODE=replay → offline, deterministic, free.
const { text } = await generateText({ model, prompt: 'Say hi' }); The recommended workflow: live in development, record to capture a fixture
once, replay in CI. Run the test against the live API a single time, commit
the cassette, and flip CI to replay.
Modes
| Mode | Behaviour |
|---|---|
record | Calls the real model, serializes request + response to a cassette, returns the live result. |
replay | Looks up the cassette by hash, serves it. A miss throws — a changed prompt or tool schema fails the test, forcing a re-record. |
live | Passthrough. No recording, no lookup. |
In record mode the call flows out to the provider and the response is
captured on the way back:
In replay mode the cassette is resolved by hash and served straight back —
the provider is never touched:
CassetteModeError the
moment the middleware is constructed, so a typo'd CASSETTE_MODE never silently falls through to a live call.cassetteMiddleware
cassetteMiddleware(options?) returns an AI SDK LanguageModelV3Middleware that intercepts both doGenerate (one-shot) and doStream (streaming).
| Option | Type | Default | Description |
|---|---|---|---|
mode | 'record' \| 'replay' \| 'live' | 'live' | Operating mode. |
cassetteDir | string | './cassettes' | Directory cassettes are read from / written to. |
redact | (string \| RegExp)[] | [] | Extra key matchers, merged with the built-in defaults. |
cassetteName | string | — | Force a specific filename instead of hash-addressing. Named cassettes are multi-interaction (keyed by hash). Mostly used internally by withCassette. |
store | CassetteStore | filesystem | Storage backend. Use memoryCassetteStore() on edge runtimes. |
tracer | TapedeckTracer | — | OTel-compatible tracer; emits tapedeck.generate / tapedeck.stream spans. |
tapedeck also exports lower-level primitives for direct use — hashing
(computeCassetteHash, stableStringify, normalizeTools), cassette I/O
(loadCassette / saveCassette, parseCassette / serializeCassette, isMultiCassette), diff/merge (diffCassettes, diffCassetteFiles, mergeCassetteDirs), storage (fileCassetteStore, memoryCassetteStore),
telemetry (withSpan), and the constants CASSETTE_VERSION, MULTI_CASSETTE_VERSION, cassetteFilename(hash), REDACTED, DEFAULT_REDACT. See the API reference for all of them.
Streaming
Streaming is first-class — not a non-goal. In record mode tapedeck drains the
live stream, captures the ordered stream parts, and re-serves them so your code
still receives the response. In replay mode the recorded parts are replayed as
a genuine ReadableStream via the SDK's own simulateReadableStream, so streamText, UI message streams, and tool-call streaming all see the surface
they would live.
import { streamText } from 'ai';
const { textStream } = await streamText({ model, prompt: 'Tell me a story' });
for await (const delta of textStream) process.stdout.write(delta);
// Identical output whether the model is live or replayed from a cassette. Cassette format
Cassettes are pretty-printed JSON, keyed by a stable hash, designed to diff cleanly in PRs:
{
"version": "tapedeck@0.1.0",
"hash": "sha256:abc123…",
"recordedAt": "2026-06-10T12:00:00Z",
"request": {
"modelProvider": "openai",
"modelId": "gpt-4o",
"prompt": [ ],
"tools": [ ],
"temperature": 0.7
},
"response": {
"type": "stream",
"chunks": [
{ "type": "text-delta", "id": "0", "delta": "I'll" },
{ "type": "text-delta", "id": "0", "delta": " help" },
{ "type": "tool-call", "toolCallId": "call_123", "toolName": "search", "input": "{\"query\":\"t-shirts\"}" }
]
}
} A one-shot generateText produces a "type": "generate" response holding the
recorded content array, finish reason, and usage instead of chunks.
Named cassettes (from withCassette / cassetteName) use the v2
multi-interaction format: one file holding every call the test makes, keyed by
hash — generate and stream interactions can mix freely:
{
"version": "tapedeck@0.3.0",
"recordedAt": "2026-06-10T12:00:00Z",
"interactions": [
{ "hash": "sha256:abc…", "request": { }, "response": { "type": "generate" } },
{ "hash": "sha256:def…", "request": { }, "response": { "type": "stream", "chunks": [ ] } }
]
} Legacy v1 single-interaction named cassettes still replay (served as-is); hash-addressed cassettes always use the single format.
Hash algorithm
The hash is a SHA-256 of the canonicalized, sorted JSON of:
{ modelProvider, modelId, prompt, toolSchemas, maxOutputTokens, temperature, topP } Tool schemas are normalized (descriptions stripped, keys sorted) so cosmetic doc changes don't invalidate a cassette — but a changed prompt, tool input schema, or sampling param does. That is the point: a behavioural change fails CI loudly instead of replaying stale data.
CassetteMissError and the test fails. Re-record, eyeball the
cassette diff in the PR, and commit the new fixture.Secret redaction
Redaction is key-name based and runs at record time, so secrets never reach disk:
- Default matchers:
apiKey,authorization,x-api-key,bearer,token(case-insensitive). - Configurable via
redact: (string | RegExp)[]— strings match field / header names case-insensitively; RegExps test the raw key. Your matchers are merged with the built-in defaults. - Replaying a cassette that still contains a value a matcher would strip throws
CassetteSecretError— a committed secret fails the build instead of leaking.
cassetteMiddleware({
mode: 'record',
redact: ['apiKey', 'authorization', /secret/i],
}); Vitest helper
@nkwib/tapedeck/vitest exports withCassette(name, testFn, options?), which pins a
test to a named cassette and forces replay mode for its duration:
import { describe, it, expect } from 'vitest';
import { withCassette } from '@nkwib/tapedeck/vitest';
describe('checkout agent', () => {
it('runs the checkout flow', async () => {
await withCassette('checkout-flow.json', async () => {
const result = await runAgent({ prompt: 'buy a t-shirt' });
expect(result.steps).toHaveLength(3);
});
});
}); Any cassetteMiddleware instance active inside the callback picks up the named
cassette automatically — via an AsyncLocalStorage context — and tears down on
exit, so there is no global setup/teardown to wire up. Pass options.mode to
override the forced replay, or options.cassetteDir to point at a different
directory.
The named cassette is multi-interaction: if the agent above makes three
model calls, all three are recorded into checkout-flow.json keyed by request
hash, and each call replays its own response — in any order. Re-recording a
test starts the file fresh, so stale interactions never linger.
@nkwib/tapedeck/vitest also exports the toFollowRoute() matcher: pair with toolroute to assert that the replayed
trajectory only makes transitions your router allows
(expect(result.steps).toFollowRoute(router) after expect.extend({ toFollowRoute })).
CLI
The package ships a tapedeck bin for the record/replay workflow:
npx tapedeck record ./scripts/demo.mjs # run with CASSETTE_MODE=record
npx tapedeck replay ./scripts/demo.mjs # run with CASSETTE_MODE=replay
npx tapedeck record pnpm test # non-file args run as commands on PATH
npx tapedeck ls ./cassettes # kind, model, recordedAt per cassette
npx tapedeck diff a.json b.json # semantic field-level diff (exit 1 on difference)
npx tapedeck merge ./from-ci ./cassettes # merge directories; --force overwrites conflicts diff pinpoints which fields diverged and ignores recordedAt; merge skips identical files and fails on conflicts unless --force is passed.
Telemetry
Pass any OTel-compatible tracer and every record/replay emits a span — typed structurally, so tapedeck keeps zero runtime dependencies:
import { trace } from '@opentelemetry/api';
cassetteMiddleware({ mode: 'replay', tracer: trace.getTracer('tapedeck') }); Spans (tapedeck.generate / tapedeck.stream) carry mode, hash, cassette
path, model, hit/miss, and chunk-count attributes; a miss records the
exception with an error status, so a failing CI replay shows up in traces.
Storage & edge runtimes
Cassette I/O goes through a CassetteStore (read/write/list). The
default is the filesystem (loaded lazily); pass memoryCassetteStore() — or a
KV/R2-backed store — on edge runtimes. The core never imports node:fs, node:path, or node:crypto statically; the one remaining Node builtin is node:async_hooks, which Cloudflare Workers provides under the nodejs_compat flag. See Compatibility for the caveats.
Errors
| Error | When |
|---|---|
CassetteMissError | replay mode, no cassette matches the hash. Message includes the hash and the path searched. |
CassetteSecretError | A replayed cassette still contains unredacted secrets. Lists the offending field paths. |
CassetteCorruptError | Invalid JSON, unknown version, or a malformed / mismatched response shape. |
CassetteModeError | An invalid mode string was supplied. |
All extend CassetteError, so you can catch the whole family with one instanceof CassetteError.
Roadmap
Everything deferred from the first cut — OTel spans, the CLI, diff/merge
tooling, the edge-safe core, the toFollowRoute() matcher, and
multi-interaction named cassettes — has shipped as of 0.3.0. Still ahead:
- Deployed Cloudflare Workers smoke test in CI — edge support is designed-for, not yet CI-verified.
- Interaction-level merge for multi-cassettes (merge is file-level today).
See the changelog for the full release history.