Write the Tests. The Agent Handles the Rest.
"In theory, there is no difference between theory and practice. In practice, there is." — Jan L.A. van de Snepscheut
The last article established the machinery: neural networks are pure functions, agents are loops, context is the input. That's the theory. Now let's talk about what actually happens when you sit down to build with these things.
Most AI coding failures aren't model failures. They're input failures. The model did exactly what it would do given what you gave it — which is to say, it navigated the space of reasonable interpretations and picked one. If the space was large, you got something in the general neighborhood of what you wanted. If it was small and well-defined, you got what you wanted.
That's the whole game. Make the space small and well-defined.
Context Is the Only Lever You Have
You can't change the model's weights. You can't change what it was trained on. The only thing you control is what goes into the context window — and that turns out to be enough.
The goal isn't more context. It's better context. Anthropic's engineering team defines it as "the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome." Minimal doesn't mean short — it means free of low-signal content.
Here's the same task, two ways:
# Verbose, meandering
Hi! I need you to look at this code. So basically what's happening is we have this
function called processOrder and I think there might be a bug in it, but I'm not
totally sure. The function is supposed to handle order validation but sometimes it
returns null instead of throwing an error, I think. Can you take a look and maybe
fix it if there is a bug? Also the tests might be failing but I haven't run them recently.
# Precise, terse
Bug: processOrder() returns null instead of throwing on invalid input.
Expected: throw OrderValidationError with message.
File: src/orders/processor.ts:45
Verify: npm test -- --grep "processOrder"
Same underlying request. The first is full of hedges and approximations — "I think," "maybe," "I haven't run them recently." Each one is an invitation for the model to fill in a blank from training data, not from what you actually know. The second leaves almost no blanks. It's a specification.
Why Noise Is Worse Than Absence
Bad context isn't just less helpful — it's actively harmful. Research across 18 frontier models found that every one of them degrades as context grows, even when the window is far from full. Three mechanisms drive this:
Lost-in-the-middle effect. Information buried in the middle of a long context gets 30%+ worse recall than information at the start or end. Transformer attention isn't uniform — earlier tokens accumulate more attention passes through the network's layers.
Attention dilution. At 100K tokens, there are ten billion pairwise attention relationships. The signal gets spread thin.
Distractor interference. This one is counterintuitive: content that is semantically similar to the target but wrong is more harmful than content that's completely unrelated. Adjacent concepts activate overlapping embedding clusters, so the model can't cleanly separate close-but-wrong from close-and-right. If you're debugging a payment validation function and include an old version of that same function in context, you're actively working against yourself.
The implication: cut ruthlessly. Not just for brevity — for signal. Include the task, the relevant files, and the active constraints. Cut the conversation history from last week and the design doc that no longer applies. When in doubt, leave it out.
Sub-Agents: Context Windows Are Cheap
The agent loop accumulates state. Every tool result, every intermediate step, every correction gets appended to the context. By the time you're deep into a complex task, the context is carrying a lot of weight — and not all of it is signal.
Sub-agents solve this by giving each subtask its own fresh context window.
The parent agent hands the sub-agent exactly what it needs to do one specific job — nothing more. The sub-agent does the work, accumulates whatever intermediate state it needs, and returns a compressed result. The parent never sees the mess. Its context stays focused on the top-level task.
Say you're building a code review pipeline: a parent orchestrates the work, delegating to one sub-agent that pulls the relevant diff, another that checks for security issues, and another that runs the tests. Each starts with a fresh context. The parent synthesizes three clean summaries — not three full context windows of accumulated tool output.
Anthropic's multi-agent research system — Claude Opus 4 orchestrating Claude Sonnet 4 sub-agents — reported more than 90% performance improvement over single-agent setups on open-ended research tasks. The gains were attributed to spreading reasoning across independent context windows rather than scaling a single one. Those numbers are specific to their internal system and workload; yours will vary. The mechanism, however, is sound.
Sub-agents also unlock parallelism. Independent subtasks can run simultaneously — no scheduling meetings required.
The cost: coordination overhead. More API calls, more latency, more surface area for failures to compound. Use sub-agents when context accumulation is the actual bottleneck — not as a default pattern for every task.
When to Start Fresh
Sub-agents handle context isolation automatically for subtasks. For the top-level session itself, that's your job.
Long sessions accumulate noise: dead ends, corrections, exploratory reasoning that looked relevant at the time. The model doesn't benefit from seeing how you got here. It benefits from a clean brief of where you are now. At natural inflection points — after a major task completes, before switching to something unrelated, when outputs start drifting in ways that are hard to explain — compact the context or start a new session.
The previous article described the model as a contractor who gets a fresh briefing document every time you talk. There's no shame in handing them a clean summary instead of the full transcript of everything that happened before they arrived. A good brief beats a long history.
Chaining Agents: Pipelines and Their Failure Modes
A chain of agents is a data pipeline where each stage is a model call. The same principles apply: clean interfaces between stages, validation at boundaries, clear contracts.
Four patterns show up across most multi-agent systems:
| Pattern | Shape | Best for |
|---|---|---|
| Prompt chaining | A → B → C | Fixed, sequential sub-steps |
| Orchestrator-workers | Hub → N spokes | Parallel specialized tasks |
| Evaluator-optimizer | Generator ↔ Critic loop | Well-defined quality criteria |
| Parallelization | N agents → merge | Multi-perspective analysis |
The failure mode to watch for is cascading bad output. If stage A produces something subtly wrong and stage B accepts it uncritically, stage C is now reasoning from bad premises. By the time you see the final output, the original error is buried under layers of downstream processing. This isn't a model problem — it's a pipeline design problem. Every ETL engineer has had this conversation.
The fix is validation at boundaries. The receiving agent checks output quality before passing it downstream. Failed validation halts and retries or escalates — it does not silently propagate.
Start simple. A single-agent loop solves more than you'd expect. Add orchestration only when a simpler setup demonstrably falls short, and even then, add one layer at a time.
The System of Record Problem
Prompts drift.
The model you're prompting today won't be the model you're prompting in six months. Providers update weights, sometimes without version guarantees. The model doesn't file a changelog. Behavioral drift — gradual shifts in tone, adherence, and reliability even when your application code is unchanged — is real enough that Anthropic's own engineering team cites it as the primary reason to invest in evals: teams without them face weeks of manual testing every time a model changes; teams with them know in minutes. Your prompt history is not a durable spec.
Neither is the agent's output, on its own. Generated code does what it does on the day it ran, against the model it ran against. No contract, no guarantee. Run the same prompt against a different model version and you might get something subtly different. You won't know unless you have something to compare against.
The previous article made the case that structured, formally defined formats are more durable than natural language for anything that needs to mean the same thing next year as it does today — the same logic that drives every system of record from Sumerian accounting tablets to double-entry bookkeeping. The same argument applies here, and it points at a concrete target: your behavioral specifications should live in your test suite.
Tests in formally specified languages, running against versioned runtimes, are the durable record of what your system is supposed to do. They're executable. They're version-controlled. They don't depend on which model is behind the API today. When the model changes and something breaks, your tests tell you exactly what broke and what the expectation was. Without them, you're doing manual QA and hoping.
Recent research formalizes this as Test-Driven AI Agent Definition: write the behavioral spec as executable tests first, then let the agent generate code that passes them. The tests are the brief. The agent fills in the implementation. Give acceptance criteria up front — not after the fact, not "fix it until it feels right." Fair warning: writing a complete, correct behavioral spec is hard. It's arguably the hardest part of software development regardless of whether an agent is involved. The agent can only be as good as the tests you hand it. Garbage spec, garbage code — just a different kind of input failure.
The practical workflow looks like this:
// 1. Write the spec
it("throws OrderValidationError on invalid input", () => {
expect(() => processOrder({ items: [] })).toThrow(OrderValidationError);
});
// 2. Give the agent the test as context
// "Make this test pass. File: src/orders/processor.ts"
// 3. Run the tests. Feed results back into context. Repeat.
One caveat: the tests need to specify what the system does, not how the model achieves it. Tests tied to exact model phrasing or specific output format will break when the model changes even if the behavior is correct. Test the contract — observable inputs and outputs — not the implementation details.
The agent writes the code. Your tests tell it whether the code is right. You own the spec. The model is the execution engine.
The Short Version
Make the context small and precise. Distractor interference means close-but-wrong is more harmful than unrelated. Write a spec, not a stream of consciousness. When in doubt, cut it.
Use sub-agents for complex tasks. Not because it's elegant, but because each subtask gets fresh context instead of inheriting the accumulated noise of everything before it. The tradeoff is coordination overhead — worth it when context bloat is the actual bottleneck.
Validate between pipeline stages. Bad output doesn't stay local — it propagates, compounds, and buries the original error by the time the pipeline finishes. Catch it at the boundary.
Write the tests first. The model doesn't have a version contract — your tests run against the same runtime next year, your prompt runs against whatever the provider deployed last Tuesday. Behavioral specs belong in code.
LLMs are good at generating things. They are not good at remembering, maintaining consistency over time, or keeping your system's behavioral contract. Those are your job. The agent handles the rest.