Before / after — Vercel AI SDK
A real agent test on the Vercel AI SDK. Same test, same model call. The
only difference is whether CI hits the live API — or replays a committed
cassette: record once, replay forever.
import { describe, it, expect } from 'vitest';
import { openai } from '@ai-sdk/openai';
import { runCheckoutAgent } from '../src/agent';
// Option A: let the test call the live model in CI.
describe('checkout agent', () => {
it('runs the checkout flow', async () => {
// Needs OPENAI_API_KEY wired into CI secrets — auth hygiene on
// every runner. Each run bills real tokens: $ per push, per retry.
const result = await runCheckoutAgent({
model: openai('gpt-4o'),
prompt: 'buy a t-shirt',
});
// The model is nondeterministic. It picked three steps on your laptop;
// tonight it adds a clarifying turn and this assertion flips red.
expect(result.steps).toHaveLength(3);
// 3am: an upstream latency spike times the request out. The build
// goes red, nobody touched the code. "Works on my machine." Re-run.
});
});
// Option B: hand-write a MockLanguageModelV3 for every turn instead.
// No network — but you author each chunk by hand, it doesn't replay a
// real stream, and the next `ai` SDK bump rewrites the part shapes out
// from under you. Brittle boilerplate that rots the day you stop looking. The test calls the real model on every run: flaky on upstream
latency, billed per push, and asserting against nondeterministic
output. The only escape — hand-writing a MockLanguageModelV3 per turn — is brittle boilerplate that collapses on the next SDK bump.
cassetteMiddleware + withCassetteimport { describe, it, expect } from 'vitest';
import { openai } from '@ai-sdk/openai';
import { wrapLanguageModel } from 'ai';
import { withCassette } from '@nkwib/tapedeck/vitest';
import { cassetteMiddleware } from '@nkwib/tapedeck';
// Wrap the model once. Behaviour switches on one env var — nothing else.
const model = wrapLanguageModel({
model: openai('gpt-4o'),
middleware: cassetteMiddleware({
mode: process.env.CASSETTE_MODE ?? 'live', // record | replay | live
cassetteDir: './cassettes',
redact: ['apiKey', 'authorization', /token/i],
}),
});
describe('checkout agent', () => {
it('runs the checkout flow', async () => {
// Recorded once with CASSETTE_MODE=record against the live API.
// withCassette pins this fixture and forces replay for its duration.
await withCassette('checkout-flow.json', async () => {
const result = await runCheckoutAgent({ model, prompt: 'buy a t-shirt' });
// Deterministic, offline, free — and stream-accurate: the recorded
// parts replay as a genuine ReadableStream, so streamText sees what
// it would live, down to the chunk boundaries.
expect(result.steps).toHaveLength(3);
});
});
});
// Change the prompt or a tool schema and the hash changes — replay
// misses and the test fails loudly:
// CassetteMissError: no cassette for sha256:abc123… in ./cassettes
// Re-record, commit the new cassette, move on. Wrap the model once and record a cassette once. Every run after is
deterministic, offline, free, and stream-accurate — replayed as a real ReadableStream. Change a prompt or tool schema and the
hash changes: replay misses and CI fails loudly with a CassetteMissError.
The wedge, in a few lines of diff
- const result = await runCheckoutAgent({
- model: openai('gpt-4o'),
- prompt: 'buy a t-shirt',
- });
+ const model = wrapLanguageModel({
+ model: openai('gpt-4o'),
+ middleware: cassetteMiddleware({ mode: process.env.CASSETTE_MODE ?? 'live' }),
+ });
+ await withCassette('checkout-flow.json', async () => {
+ const result = await runCheckoutAgent({ model, prompt: 'buy a t-shirt' });
+ expect(result.steps).toHaveLength(3);
+ });Try it
Wrap your model, run the test once with CASSETTE_MODE=record against the live API, and commit the cassette. Set CASSETTE_MODE=replay in CI — your tests are now offline,
deterministic, and free.