The first time I sent a task to a background agent and closed my laptop, it felt wrong. Like leaving the stove on. I had spent two years learning to watch AI tools work, reading every diff as it appeared, ready to hit escape the moment it went sideways. Now I was supposed to describe a task, walk away, and come back to a finished pull request.
It worked. The agent cloned the repo, ran the setup, made the change across four files, ran the tests, and opened a draft PR with a summary of what it did. I reviewed it over coffee twenty minutes later. That was the moment the shape of my work changed.
This is the part of agentic development that people are still catching up to. Not “the AI writes code in my editor” but “the AI writes code somewhere else while I am not looking.” Background agents are a different tool with a different mental model, and using them well is mostly about knowing what to hand off and what to keep close.
What a Background Coding Agent Actually Is
Let me be precise, because the terminology is a mess right now.
A foreground agent runs where you are. You give it a task, you watch it work, you correct it in real time. This is the agentic coding loop most developers got comfortable with over the last eighteen months. It runs in your terminal or your IDE, against your actual filesystem, and you are in the loop the whole time.
A background agent runs somewhere else, usually a fresh cloud VM, and reports back when it is done. You describe the task, it spins up an isolated environment, clones your repo at the current HEAD of a branch, runs your setup commands, makes the change, runs your checks, and opens a draft pull request. You were not watching. You find out it finished when the PR notification lands.
The session length data tells the story. Average coding agent sessions went from about 4 minutes in early 2025 to 23 minutes in early 2026. Sessions got longer because agents stopped needing you to babysit every step. Multi-file edits went from 34% of sessions to 78% in the same window. The work got bigger and more autonomous at the same time.
Every major tool now ships some version of this. GitHub Copilot has a coding agent that opens draft PRs from issues. Cursor has cloud agents that run in isolated VMs. OpenAI’s Codex has cloud tasks. Claude Code has async background runs. The category went from “experimental” to “default expectation” in about a year.
The Three Tiers I Actually Use
I think about agent work in three tiers now, and the tier determines where the work runs.
Tier one is interactive. This is foreground work where I am present and correcting in real time. I use it for anything I do not fully understand yet, anything touching code I am nervous about, and anything where the feedback loop matters more than the throughput. Debugging a weird production issue lives here. So does any change to auth, billing, or data migrations.
Tier two is parallel sprints. This is where I fire off two or three background agents on independent tasks and let them run while I work on something else in the foreground. The key word is independent. If the tasks touch the same files, I am setting myself up for a merge nightmare, so I only parallelize work with clean boundaries.
Tier three is the overnight backlog drain. This is the one that still feels slightly magical. I queue up a batch of well-scoped, low-risk tasks before I stop for the day, and review the resulting draft PRs in the morning. Documentation patches, test coverage gaps, dependency bumps, small refactors that follow an existing pattern. Boring work that adds up.
Most developers I know who use background agents seriously use all three tiers. The mistake is treating background agents as a replacement for foreground work rather than a different tool for a different kind of task.
What Makes a Task Cloud-Ready
This is the whole game. The difference between background agents saving you hours and background agents creating a pile of broken PRs is almost entirely about task selection.
A task is cloud-ready when you can describe it completely in a prompt. That means it has five things:
- A clear scope. One feature, one fix, one refactor. Not “improve the codebase.”
- The files or area it touches. Either you name them or the agent can find them from a clear description.
- A setup path. The setup commands that get a fresh machine to a working state, usually your install and build steps.
- A test command. Something the agent can run to know whether it succeeded.
- A stop condition. A definition of done the agent can actually evaluate.
If a task is missing any of these, it is not ready for a background agent. The honest test I use: if I cannot explain the task in a prompt with scope, files, checks, and a stop condition, I either keep it local or I ask the agent to produce a plan first and I review that before letting it run.
Tasks that fit this shape beautifully: writing tests for existing code, refactors that follow a known pattern, documentation updates, fixing a well-described bug, security review follow-ups, dependency upgrades, and turning a detailed issue into a PR-ready branch. These all have clear acceptance criteria and do not depend on the messy state of your laptop.
Tasks that do not fit: anything exploratory, anything where the requirements emerge as you work, anything that depends on local services or browser state, and anything where you would not be able to tell from the diff alone whether it is correct.
When to Keep It Local Instead
Background agents are not the answer to everything, and the failure mode of over-delegating is real. Some work belongs on your machine, in the foreground, where you can see it.
Keep it local when the work depends on your current filesystem state, uncommitted changes, local services, desktop browser state, or private tools the cloud VM cannot reach. The classic example is the bug that only reproduces on your machine at 11:47 at night. A background agent in a clean VM will never see it. You need to be there, running the same commands against the same broken state.
Keep it local when you need a tight inspect-run-edit loop. Some debugging is a conversation. You run, you look, you tweak, you run again, ten times in five minutes. Shipping that to a cloud VM that takes a minute to spin up each time is slower, not faster. The latency of the round trip kills you.
Keep it local when you do not yet understand the problem. Background agents are for executing well-understood work, not for figuring out what the work is. If you cannot write the spec, you cannot delegate the task. This is the same reason context engineering matters so much: what the agent knows going in determines what comes out, and a background agent gets exactly one shot at the context you gave it.
There is a decision-framework version of this question I keep coming back to, the vibe ceiling, which is basically the point where AI help stops making you faster and starts making you slower on your own mature codebase. Background agents raise the ceiling for well-scoped work and lower it for everything else. Know which side of the line your task is on.
Setting Up So Agents Do Not Fight Each Other
The most productive background agent setup is boring in the best way. Reproducible setup scripts, clear instructions, small tasks, and non-overlapping file ownership.
The single highest-leverage thing you can do is write a good instructions file. Most tools read an AGENTS.md or equivalent at the repo root. This is where you put the stuff a new contributor would need to know: how to install, how to run tests, the conventions you care about, the things that always break. A background agent starting in a fresh VM has none of your accumulated context. The instructions file is how you give it some.
The second thing is non-overlapping file ownership when you run agents in parallel. If two agents are editing the same module, you are going to get conflicting diffs and waste the time you thought you were saving. I scope parallel tasks so each one owns a distinct slice of the codebase. One agent on the API layer, one on the docs, one on tests for an untouched module. They never collide.
The third thing is small tasks. A background agent given a huge vague task will produce a huge vague PR that takes longer to review than the work would have taken to do. Small, well-scoped tasks produce small, reviewable PRs. The merge rate on tight tasks is dramatically higher than on “go improve this” prompts.
If you are coordinating more than a couple of agents at once, you are basically doing orchestration, and the tradeoffs there are their own topic. I went deep on the multi-agent versus single-agent question separately, because more agents is not automatically better and the coordination cost is real.
The Review Problem Nobody Warns You About
Here is the thing that bit me, and bites most people who lean into background agents. The bottleneck does not disappear when you delegate the writing. It moves to review.
Teams using AI heavily merge far more pull requests than they used to, but PR review time went up, not down, and PR size ballooned. When agents can produce ten PRs in a weekend, the question stops being “can I write this code” and becomes “can I review this code fast enough to trust it.” I wrote about this shift in detail in the piece on the AI code review bottleneck, and background agents make it sharper because the volume goes up while your review capacity stays exactly the same.
A few things keep me sane here.
Treat every agent PR as a draft from a junior contributor. Background agents open draft PRs for a reason. The draft is a proposal, not a decision. The developer who assigned the task owns reviewing every line, the same way you would own a junior’s PR. Delegating the writing does not delegate the responsibility.
Keep your quality gates in charge. Branch protection, required human review, required status checks, security scans that must pass before merge. These are non-negotiable with background agents. The agent works inside the same gates as everyone else. It does not get to merge its own work, ever. This is the structural protection that lets you delegate without losing control.
Review the diff, not the explanation. Agents write convincing PR summaries. The summary tells you what the agent thinks it did. The diff tells you what it actually did. These are not always the same thing, and the gap is exactly where bugs live. I read the diff first and the summary second.
Test the change, do not just read it. AI-generated code has measurably higher bug density, and reading is not the same as verifying. I have a whole process for testing AI-generated code that catches the failure modes a visual scan misses, and it applies double to background work where you were not watching it happen.
The New Class of Technical Debt
There is a slower problem that does not show up in any single PR review. When you ship a lot of agent-written code fast, you accumulate a specific kind of debt.
The code works. The tests pass. It merges. But it is verbose, it duplicates patterns that already existed elsewhere, and nobody on the team has a complete mental model of it because nobody wrote it line by line. Studies through 2026 have been blunt about this: heavy agent adoption correlates with rising code complexity, more duplicated code, and climbing change failure rates, even as raw throughput goes up.
Background agents amplify this because the work is even further removed from human authorship. With a foreground agent, at least you watched it happen and have some memory of the decisions. With a background agent, you get a finished diff and a summary, and if you rubber-stamp it, the code enters your codebase as a black box.
I think about this as a new kind of technical debt that traditional review processes were not designed to catch. The defense is not to stop using background agents. It is to keep tasks small enough that each PR is genuinely reviewable, to enforce that agent code follows existing patterns rather than inventing new ones, and to accept that review is now the expensive part of your workflow and budget time for it accordingly.
The other half of the defense is what happens after merge. When agent-written code ships and breaks, you need to find out before your users do. Production observability for a solo developer is a different skill from review, and it is the one most people leaning into background agents are skipping. Fast shipping plus blind spots in production is how a quiet bug becomes a lost week.
Security and Access, Because Cloud VMs Are Not Free of Risk
A background agent runs in an environment with access to your code, your setup secrets, and a network. That is a real surface, and it is easy to be careless because the convenience is so high.
Limit repository access to only what the agent needs. Do not grant org-wide access because it was easier to click the broad permission. The agent working on one service does not need read access to forty repos.
Be deliberate about secrets in the agent environment. A fresh VM that needs to run your setup may need certain environment variables, but it does not need your production credentials to run unit tests. Scope what the agent can reach to what the task requires, and assume the environment is less trusted than your laptop.
This is a subset of the broader picture for running agents safely. If background agents are getting real access to your systems, the full set of practices in securing AI agents in production applies, especially least privilege and credential management. The cloud VM being isolated does not mean the access you handed it is.
How I Actually Work Now
Putting it together, here is the rhythm a normal day looks like.
Morning, I review the draft PRs from any overnight tier-three tasks. Most merge with light edits. A couple get sent back with comments. One occasionally gets closed because the agent went the wrong direction, which is fine, because it cost me nothing to run.
During the day, the hard, interesting, ambiguous work stays in the foreground where I can drive it. That is the work I actually want to be doing anyway, the system design and the judgment calls and the gnarly bugs. The agentic loop handles the execution and I handle the direction.
When I hit a chunk of independent, well-scoped work, I scope it as a background task and fire it off, then keep working. Tests for a module I just finished. A refactor I have been putting off. Documentation that drifted out of date. The agent handles it in a clean VM while I stay in flow on the main thread.
End of day, I queue whatever boring well-defined work is sitting in the backlog and let it run overnight.
The throughput change is real, but it is not the headline. The headline is that the work split cleanly into “things only I should do” and “things I can describe well enough to delegate,” and that split made both halves better. I spend more time on the half that needs judgment and less on the half that needs typing.
The Honest Bottom Line
Background coding agents are not a magic productivity multiplier you bolt on and forget. They are a delegation tool, and delegation has always been a skill. The developers getting the most out of them are not the ones with the best prompts. They are the ones who got good at deciding what to hand off.
The mechanics are easy. Every major tool ships a competent background mode now, and they all work roughly the same way: describe the task, isolated VM, draft PR, you review. The skill that separates useful from chaotic is task selection and review discipline. Scope tightly. Keep the ambiguous work local. Never let an agent merge its own code. Read the diff, not the summary. Test what shipped.
Do that, and async agents quietly take a real load off your day. Skip it, and you get a firehose of plausible-looking PRs that take longer to review than the work would have taken to write. The tool is the same either way. The difference is entirely in how you use it.
Two years ago I was nervous about closing my laptop while an agent worked. Now it is just Tuesday. The stove was never on.