The Boring Truth About AI Coding Agents
Updated March 23, 2026
"Any sufficiently advanced technology is indistinguishable from magic." — Arthur C. Clarke
You describe what you want. The agent writes code, edits files, runs tests — sometimes across dozens of steps without you touching the keyboard. When it works, it feels like magic.
The problem with magic is that you can't debug it. When the output is wrong and your mental model is "smart AI thing does smart AI stuff," you're stuck — tweaking prompts, crossing your fingers, hoping. If it works, you don't know why. If it doesn't, you don't know what to change.
The hype compounds this. A machine that does something surprisingly useful invites a much larger claim: that it can do everything an experienced developer can do. That's a big leap. These tools are powerful and they will surprise you — but they're still machines, with specific architectures, specific strengths, and real limitations. Surprising capability is not the same as general capability.
None of this is as new as the hype suggests. Every major tech wave has a few genuinely powerful ideas surrounded by noise. JavaScript frameworks come and go, but they're all built on the DOM and the observer pattern — ideas that predate all of them. Cloud computing rewired how we think about infrastructure, but at its core it's someone else's computer with an API. Autocomplete, refactoring, static analysis — each wave of IDE tooling felt transformative and each turned out to be a tool that makes you faster at what you already know. Prior knowledge didn't become obsolete. It became the foundation.
AI coding tools are no different. You don't need to know how to build a database engine to use one effectively — but knowing how indexes work changes how you write queries, and knowing how transactions work helps you avoid consistency bugs. The internals aren't required knowledge; they're leverage. The same is true here. You don't need to implement a neural network to use an LLM, but understanding what the layers do is what turns it from a black box into a force multiplier. And at the bottom of those layers, the interface is just a function call.
We're going to start at the bedrock and build up through the layers — from neural networks at the foundation to the way you interact with agents at the surface. Each layer will look familiar. They're all built from things you already know.
The Stack, Before We Ascend
Before we start climbing, here's the map of what we're ascending through:
When you write a prompt, you're at the top of this stack. Everything below is machinery that processes that input and produces a response.
One thing worth noting before we begin: the neural network at the foundation is where all the interesting things happen. The layers above it — agents, tools, multi-agent orchestration — are mostly interfaces and glue. Powerful glue, but glue. Everything above the network is there to manage what goes in and what comes out. Keep that in mind as we build upward.
Layer 1: The Neural Network Is a Function
You don't need an ML background for what follows. If you've ever worked with functions, you already have the right mental model.
At the bottom of the stack is a trained neural network. And a trained neural network is a pure function.
f(input) → output
Feed in the same input, get the same output. Every time. No side effects. No hidden state that changes between calls. It's a composition of mathematical transformations stacked in layers — the kind of thing that looks imposing on a whiteboard but is conceptually no different from any other function you've worked with.
Yes, it's a complex function — a lot of computation happening inside. But complexity of implementation doesn't change what something is at the interface level. You call bcrypt.hash() without knowing the Blowfish cipher internals. You call zlib.compress() without understanding Deflate. You use database drivers, rendering engines, and cryptography libraries every day without reading their source. The neural network is the same: a function from a library you didn't write, doing something sophisticated under the hood, callable with an input and returning an output.
Each of those layered transformations has numeric values baked into it — model parameters (or weights). Before training, all those parameters are unknowns — variables. When people say "this model has hundreds of billions of parameters," they mean the network has that many knobs to set before it can do anything useful. An untrained network is a function that needs them all:
// Before training: needs weights *and* input
function network(weights: number[], input: string): string { ... }
Training is the search for the weight values that make the network perform well. Once found, those parameters are fixed — frozen into the function definition. The function goes from needing billions of arguments down to one:
// Training locks in the weights — like partial application
const trainedModel = (input: string) => network(trainedWeights, input);
// At inference time, you supply exactly one argument: your prompt
trainedModel("What is the capital of France?");
For an LLM specifically, that one remaining parameter is the context window — your full input prompt. That's it.
The key distinction is between training (finding and fixing parameters) and inference (calling the fixed function):
At inference time, the model doesn't "think." It evaluates. The weights are frozen — what you're calling is a fixed function. This is counterintuitive because the outputs feel generative and open-ended. But a sufficiently complex deterministic function can produce outputs that feel unbounded. A neural network with hundreds of billions of parameters is an enormous function. That doesn't make it non-deterministic — it makes it unpredictable, which is a different thing. More on that shortly.
One question worth addressing before going further: if a neural network is just doing math on numbers, how does it process text? The same way your computer always has — by mapping text to numbers. Just like UTF-8 maps every character to a numerical code, an LLM has a vocabulary: a fixed mapping of words, word fragments, and punctuation to numerical IDs called tokens. Your prompt gets converted to a sequence of token IDs, the network processes those numbers, and the output numbers get mapped back to text. The encoding concept is identical — just a different table.
If you want to go much deeper on how these networks are actually built, Build a Large Language Model from Scratch is the best hands-on reference.
Training: Parameter Search
If the trained network is a pure function, training is the process of finding the right constants to bake into it.
Think of the parameters as knobs. Each knob can be set to some value, and the combination of all knob settings determines how the model behaves. Training is the search for the knob settings that make the model perform well.
How do you measure "perform well"? With a reward function — a score that tells you how good the model's outputs are relative to expected outputs.
A simple example: you show the model the prompt "Paris is the capital of ___" and the expected answer is "France." If the model outputs "France," the reward is high. If it outputs "Germany," the reward is low. Train across millions of examples like this, and the parameters that consistently produce high rewards are the ones you want.
On a simple enough problem, you could try every possible combination of knob values, score each one, and pick the best. But real models have hundreds of billions of parameters, and each parameter is a floating-point number — typically between 0 and 1. Even at low precision (say, 8 decimal places per parameter), the number of possible parameter sets is so large it dwarfs the number of atoms in the observable universe. Enumeration isn't just impractical; it's not a real option.
One practical approach is stochastic gradient descent (SGD). Imagine you're trying to find the top of a mountain in thick fog. You can't see the peak, but you can feel the slope under your feet. SGD says: find the direction that's steepest uphill, take a step that way, repeat. You don't evaluate every possible path — you follow the slope. The "stochastic" (random) part comes from which patch of ground you check the slope on at each step, drawn randomly from your training data. Two runs on the same data will take different paths through parameter space.
One implication of the training-on-a-corpus approach is easy to overlook: the model's understanding of language is frozen at the time it was trained. Natural language is a moving target — slang evolves, phrases shift meaning, new terms emerge. Your kids use expressions your parents never heard. The corpus a model trains on captures a snapshot of language at a moment in time, and that shapes how it interprets your prompts. This means context that works well with one model may not work with another trained on different data, and context that works today may not work with a future model trained on a different corpus.
This has a practical implication for anything you intend as long-lived model input: use formally defined languages where you can. Programming languages have fixed, versioned semantics. JSON schemas, API specifications, configuration formats — these mean exactly what they say, regardless of when or where they're read. Dijkstra made this argument about natural language programming in 1978 and it applies just as well here: formal notation rules out the ambiguity that natural language makes almost impossible to avoid. If you're writing prompts or context intended to drive consistent behavior across models and time, lean on structured formats for the parts where precision matters.
The AI ecosystem is already voting with its feet on this. Structured tool calling — where the model outputs a typed JSON schema describing what it wants, rather than asking in plain English — is now standard across every major provider. MCP (Model Context Protocol) formalizes this further: a structured protocol for connecting models to external systems, explicitly designed so that intent is machine-readable and unambiguous. The pattern is consistent: natural language for open-ended reasoning, structured formats wherever precision matters.
This isn't a new insight. Systems of record have always been structured — not arbitrary prose. The earliest known writing wasn't literature or correspondence; it was Sumerian accounting tablets, recording quantities of grain and livestock in compact, unambiguous notation. Double-entry bookkeeping, formalized in 15th-century Venice, applied the same principle: a structure explicit enough that the records mean the same thing next year as they do today, to anyone who reads them. The goal — reduce ambiguity, keep records compact and durable — is identical. Computers didn't invent this tradeoff. They just made the stakes higher.
Layer 2: The LLM Adds Intentional Randomness
An LLM's trained layers are deterministic. Same input, same output — for those layers.
But LLMs deliberately add a randomness step on top.
Here's how generation works: the network processes your input and produces a probability distribution across its vocabulary. An LLM's vocabulary is typically around 100,000 tokens (words, word fragments, punctuation). After processing your input, the model assigns a probability to each of those 100,000 tokens as the most likely next word. Common, contextually appropriate words get high probabilities. Unusual or off-topic words get low ones. The network's job ends there — it's not picking a word, it's running a popularity contest. Then a separate sampling step picks the actual next token from that distribution.
That sampling step is where the non-determinism lives. The main control is temperature:
- Low temperature is like looking up a definition. The model strongly favors its highest-probability option. Given the same input, you'll get nearly the same output every time.
- High temperature is like picking words out of a hat. The probabilities are flattened, and lower-probability tokens get more of a chance. Useful when you want diverse, creative responses.
- Temperature = 0 skips the lottery entirely — always pick the most probable token. Fully deterministic. Same input → same output, every time.
To make that concrete: given the prompt "The best way to fix a bug is to," a low-temperature model reliably returns something like "identify the root cause, write a failing test, then fix the code." Crank the temperature up and you might get "delete the feature. No feature, no bug." Technically not wrong.
Now, about that unpredictability mentioned in Layer 1. LLMs are chaotic in the technical sense: research treating LLMs as dynamical systems shows that prompts that are nearly identical in meaning can diverge rapidly in output. Change a single word, rephrase a sentence, reorder a list — and the response can shift dramatically. The sensitivity is what makes the model flexible and generative, but it cuts both ways: prompts are specifications, not suggestions. Vague input produces unpredictable output — not because the model is misbehaving, but because you've given it a large space of reasonable interpretations to roam around in. This is also why temperature exists: at zero, you get the most probable path, but "most probable" isn't always the most useful. A controlled amount of randomness lets the model explore adjacent paths, which is often exactly what you want.
Layer 3: Context Is the Input
Context is the complete input to the model at any given call. System prompt, conversation history, retrieved documents, tool results — all of it is assembled into a single block of text and handed to the model. That block is the context window.
Think of it like briefing a contractor who has no memory of your previous conversations. Every time you talk, you hand them a fresh document containing everything they need to know — background, prior decisions, current task, constraints. The model is that contractor. The context window is that document. And like any briefing document, the quality of the output is directly proportional to the quality of the input: it needs to be succinct, comprehensive, and unambiguous. Too sparse and the model fills in the gaps with guesses. Too verbose and the signal gets buried in noise. Contradictory and you get a coin flip.
The model has no memory outside the context window. What isn't in the context doesn't exist, as far as the model is concerned. No persistent state. No long-term memory. No grudges.
This has a clean implication: context engineering is input engineering. If the model isn't behaving the way you want, the answer is in the context. Treating a bad output as mysterious or random is a dead end — treat it as a debugging problem and look at the input.
Context windows have a size limit, measured in tokens. Current models support anywhere from tens of thousands to a few million tokens. This is a real constraint; it forces tradeoffs in long-running tasks and is a core reason why being succinct in your context matters.
Layer 4: An Agent Is a Loop
An agent wraps an LLM with a loop and a set of tools. That's it. An agent is a while loop with delusions of grandeur.
The loop looks roughly like this:
while (!done) {
const response = await llm.call(context);
if (response.hasToolCall()) {
const result = executeTool(response.toolCall);
context.append(result);
} else {
done = true;
}
}
That's the core agent architecture. Not much to it.
The response.hasToolCall() check deserves unpacking, because the LLM doesn't actually call functions. What it does is output specially formatted text — typically JSON — that describes a function call: the name of the function and the arguments to pass. Something like:
{ "tool": "read_file", "args": { "path": "src/index.ts" } }
The application code detects this pattern, extracts the name and arguments, calls the actual function, and appends the result back into the context for the next LLM call. The model never leaves the input-output model. It produces output; your code does the work; the result comes back as more input.
What counts as a tool? CLI applications, HTTP APIs, databases, filesystems — any external system the agent can interact with. A tool call might translate to "run this shell command," "make this HTTP request," or "query this database." The agent itself has none of these capabilities built in; the agent framework handles the translation from JSON to actual system call, and pipes the result back into context.
This matters for how you design tools. The model reads your tool schemas as part of its context — they need to be as clear as any other part of that briefing document. Here's what that looks like in practice:
// Too vague — the model has to guess what this does and when to use it
{
"name": "process",
"description": "processes the input",
"parameters": {
"x": { "type": "string" }
}
}
// Clear — the model knows exactly what it does, when to use it, and what to pass
{
"name": "run_tests",
"description": "Run the test suite and return stdout/stderr. Use this to verify that code changes don't break existing behavior.",
"parameters": {
"directory": {
"type": "string",
"description": "Directory to run tests in. Defaults to the project root."
}
}
}
Tool design is API design. A poorly named function with an ambiguous description will mislead the model the same way it would mislead a human developer skimming the docs.
Closing the Loop: Code Correctness
Writing code is one thing. Knowing whether it's correct is another.
An agent can generate a function, a test, a migration script — but it has no idea if any of it actually works until it runs it. The loop only closes when the execution results come back into context. That's when the model can assess whether the code did what it was supposed to do, and if not, why.
This means the quality of the feedback you feed back in is critical. For the model to diagnose and fix a problem, it needs to see:
- Return values and output — what the code actually produced
- Test failure messages — the specific assertion that failed, not just "tests failed"
- Application logs — what happened during execution
- Effects on external systems — what was sent to them, what state they're in now, what your code read back
The model's ability to fix a problem scales directly with the richness of this context. Some failures are common enough to be in the training data — the model recognizes them immediately and knows the fix. Others aren't. For those, the execution output is the information. Without it, the model is guessing.
Two things compound the value here enormously: automated tests and observability. Granular tests give the model precise, machine-readable feedback on exactly what broke and where. Good observability surfaces what's happening inside the system during execution — logs, traces, state — giving the model the context it needs to reason about failures it's never seen before. The better your test coverage and the richer your instrumentation, the more effective an agent is at the debugging loop. It's not a coincidence that the same practices that make systems maintainable for humans make them debuggable for agents.
Layer 5: Multiple Agents Are Function Composition
A multi-agent system is multiple agents chained together, where the output of one becomes the input to the next.
agent_1(context_A) → output_A
agent_2(context_B + output_A) → output_B
agent_3(context_C + output_B) → final_output
It's function composition — the same concept as piping commands in a shell or chaining transformations in a data pipeline.
Why use multiple agents at all? This is worth unpacking carefully, because the naive answer — "more agents = better results" — is wrong.
Identical agents given the same starting context will produce similar outputs. The sampling layer introduces some variation (different random token draws each run), but it's shallow diversity: the same model reasoning from the same position, exploring nearby paths in the probability space. For easy tasks, that's often enough. For hard ones, it isn't. You're just running the same search multiple times from the same starting point.
The more powerful pattern is multistart optimization — a technique from search and optimization theory. In any search problem, starting from a single point means you'll find the nearest local optimum, which may not be the best one overall. Start from many different points, and you improve your chances of finding a genuinely better solution. In agentic systems, "different starting point" means different context, different framing, different role.
Concretely: Agent A gets the system prompt "You are a skeptical code reviewer — your job is to find every assumption that could be wrong." Agent B gets "You are a pragmatic implementer — find the simplest path to a working solution." Same user request, completely different starting positions, different reasoning paths. A third agent (or the human) evaluates both. That's structured diversity — not sampling noise.
Recent research confirms the value of this: two agents with diverse configurations can match or outperform sixteen homogeneous ones. Scaling identical agents hits diminishing returns fast, because correlated agents make correlated errors. Diversity is the variable that matters.
Sub-agents and Context Isolation
A related pattern is the sub-agent: an agent spawned by a parent agent to handle a specific subtask, with its own isolated context.
Why isolated? Because context influences output — everything in it. A main agent working on a complex task accumulates a long context window: prior tool results, intermediate reasoning, dead ends, corrections. Handing all of that to a sub-agent doesn't help it do its job; it just gives it a bunch of noise to reason around. Worse, the sub-agent's own intermediate work — its false starts and exploratory output — would pollute the main agent's context if allowed to flow back in wholesale.
The clean solution is to give the sub-agent a focused briefing: exactly what it needs to do its job, nothing more. It completes the task, returns a result, and that result gets appended to the main agent's context. Same input-output model as everything else. The isolation is deliberate — a clean context is a more predictable context.
Coming Back Up
We started at the bedrock and we're back at the surface. The question is: what do you actually do differently?
Diagnose output problems as input problems. The model isn't having a bad day. Something in the context is incomplete, ambiguous, or wrong. Find it and fix it — the same way you'd debug any function that's producing bad output given bad input.
Be a good brief-writer. Context is the only lever you have. Make it succinct, comprehensive, and unambiguous. Assume the model has read it exactly once and has to act on it immediately — because that's exactly what happens.
Use temperature intentionally. For deterministic tasks (extract, classify, format, verify), keep it low. For generative tasks (brainstorm, synthesize, propose alternatives), let it run. Temperature = 0 is not always the right answer, but it should always be a conscious choice.
Design tools like you're writing docs for a careful but literal developer. Clear names, unambiguous parameter descriptions, explicit guidance on when to use each tool. That schema is part of the context — it's subject to the same rules.
When using multiple agents, give them genuinely different starting contexts. Sampling variation alone won't get you there for hard problems. Different roles, different framings, different constraints. Start the search from different positions.
The machinery underneath agentic systems isn't new. It's functions, inputs, outputs, and composition — the same primitives you've been working with since you wrote your first program. The scale is different. The inputs are stranger. But the model is the same.
It's functions all the way down.