Hands-off engineering

My work changed drastically in November 2025, when Opus 4.5 arrived. For the first time I could trust the code an agent produced, as long as I kept an eye on it. Trust, but verify.

Half a year later, the watching is going away too. With Sonnet 5 and Opus 4.8 running under an orchestration layer, I hardly open the CLI anymore. Work goes in as a spec and comes out as a merged MR. I built that layer with Fable, Anthropic's newest model, while it was temporarily available to me. The result is Sandcastle: a pipeline that picks up the projects I brainstorm and implements them while I do something else.

This post is a tour of the setup, with a proper deep dive into what happens during a single run.

What Sandcastle is

sandcastle is Matt Pocock's library for running a coding agent in a disposable sandbox. It creates a git worktree on a fresh branch and mounts it into a Docker container that runs Claude Code. That is all it does, and that is deliberate.

Everything around it is custom. My orchestration layer lives in a .sandcastle/ folder at the root of the monorepo I work on, a pnpm workspace with several Nuxt apps on Postgres. It connects three systems: Linear holds the work, GitLab holds the merge requests and CI, and cron on my machine is the heartbeat. One command, pnpm sandcastle:project, drains an entire Linear project.

From brainstorm to queue

Work starts as a brainstorm session with Claude. The brainstorm becomes a PRD, the PRD becomes a Linear project, and the project is cut into small issues wired together with blocked-by relations. None of that needs a terminal; it happens in conversation and lands in Linear.

Statuses are the contract. I move an issue to Ready for Agent, and from that point the pipeline owns it: In Progress, In Review, and finally Done, or Needs Info when it wants a human. The orchestrator keeps no database of its own. Everything it must remember lives on the issue itself, as a status plus an attached MR link with a typed prefix like "Auto-merging MR:". Kill the process at any moment and the next run rebuilds reality from Linear and GitLab.

The loop

Nothing runs permanently. Cron fires every fifteen minutes and executes a single tick: reconcile what happened while nobody was looking, plan which issues can run, run them a couple of sandboxes at a time, and finalize the project with a roll-up MR once everything is done. Then the process exits. Planning is a topological sort over the blocked-by graph, so an issue only starts when its blockers are finished, which is also what makes running sandboxes in parallel safe. A flock lock keeps ticks single-flight, and a --watch flag loops the same tick for when I want to sit and watch a drain live.

Why polling instead of webhooks? GitLab.com cannot reach into my WSL2 machine, and a loop that re-derives all state every tick is far easier to make crash-safe anyway.

Inside one issue run

This is the part worth slowing down for. The run is where the engineering happens.

Building the world

The pool picks a runnable issue and immediately moves it to In Progress, before any agent wakes up. Then the sandbox is built: a fresh git worktree on its own branch, bind-mounted into a Docker container with Node, pnpm, Postgres 16 and the Claude Code CLI. The host's pnpm store is mounted in as well, so installing dependencies is mostly a linking exercise instead of a download.

Cold start is a shell script with a twenty minute ceiling, and it does the boring work agents are bad at. It boots a throwaway Postgres cluster inside the container, copies in secret-free env templates, runs a frozen-lockfile install, builds every workspace package, pushes the database schema and seeds fixtures. That throwaway cluster matters more than it looks: every sandbox gets its own database, so parallel runs can migrate and seed without ever touching each other.

Then a smoke script gates the run. Is the database reachable, does typecheck pass, does one representative test go green? If the environment is broken, the run dies here, cheaply, instead of an agent spending an hour fighting a broken world and concluding the code must be at fault.

Only now does a model get involved.

The implementer

The implementer is Sonnet, working test-first, with a budget of roughly eight agent iterations. Its prompt contains the issue, the surrounding PRD context, and every comment on the Linear issue, with an explicit rule that the newest human comment outranks the issue body. That last part is the steering wheel. If a run went sideways yesterday, I write one comment on the issue and the next attempt starts with my correction on top, no restart required.

The prompt is also full of things the agent must not do. No pushing, no opening MRs, no touching remotes; the host does all of that. No history-altering git commands either, a rule with a story behind it that I will get to. And comments are marked as direction to weigh, not instructions to execute, so a stray command in a Linear comment never becomes something the agent runs.

Its one hard obligation is to commit its work. A run that produces zero commits is treated as a failure, no matter how confident the final summary sounds.

When it finishes, the host posts that summary as a comment on the Linear issue. The thread slowly becomes a progress log.

The reviewer

Then the sandbox gets a second visitor. The reviewer is Opus at maximum reasoning effort, and it gets exactly one iteration in a completely fresh session. It has not seen the implementer's chat, its plans, or its excuses. It sees the issue, the diff and the tests, and it writes a single file: a verdict that starts with either APPROVE or CHANGES REQUESTED.

The fresh session is the whole point. A reviewer that inherits the implementer's context inherits its blind spots too, and will happily approve its own reasoning back to itself. A different model in a clean context is the closest thing to a real colleague I can simulate.

CHANGES REQUESTED buys one rework round in the same warm sandbox, followed by another fresh review. If the second verdict is still unhappy, the loop stops and a human takes over. Agents can argue with each other forever, so I do not let them.

Publishing

Everything after the verdict happens on the host, outside the sandbox. The host pushes the branch and opens the MR with the verdict as its description. For approved work it also arms GitLab's merge-when-pipeline-succeeds. GitLab refuses auto-merge for a short window right after an MR is created, so the arm call retries with backoff. Small, dumb, essential.

Note what actually merges the code: not the implementer's confidence, not even the reviewer's approval, but a green CI pipeline. The agents propose; CI disposes.

The Linear issue follows along. Merged straight away means Done, with a link to the MR. Armed but still waiting on the pipeline means In Review with that "Auto-merging MR:" marker, which a later tick reads to promote the issue to Done, or to notice the auto-merge was dropped and send it to triage. Changes requested means a draft MR and the verdict posted as a comment. And when the reviewer flags useful work that is out of scope, its follow-up notes are filed automatically as new Backlog issues.

One issue in, one MR out, and a paper trail on the issue I can read from my phone.

The host holds the keys

The sandbox receives exactly one secret: the Claude Code OAuth token. The Linear API key and the GitLab token stay on the host, so the agent cannot push, cannot open MRs and cannot touch an issue even if it wanted to. All writeback is plain host-side code, Linear through its GraphQL API and GitLab through the glab CLI. State transitions are deterministic code paths, not something a model decides to do with a tool call.

That boundary earned its keep during the first live drain. The git worktree shares the host's object store, and one agent, trying to tidy up its workspace, ran git stash pop. It pulled a stash from my own working copy into its sandbox. Nothing was lost, but the lesson was clear: a container is not a git boundary. The prompt ban on destructive git commands exists because of that afternoon.

When it goes wrong

Failure handling is deliberately unsophisticated. There are no retry counters. A failed run moves its issue to Needs Info with a comment pointing at the log, and that is it; nothing gets retried until a human moves the issue back. The exceptions are narrow. An expired Claude token re-queues the issue untouched and stops the whole batch, because a dead token would take down every run after it. A merge conflict gets one cheap triage pass that picks between replaying the work on a fresh base, redoing it, or escalating with a diagnosis. A second conflict always goes to a human.

Crashes are handled by the reconcile step at the start of each tick. Ticks are single-flight, so an In Progress issue without a live run is by definition an orphan, and its branch tells the story: no MR means reset the issue, a merged MR means promote it to Done, an open MR means park it In Review.

What it runs on

There is no API key involved anywhere. The sandboxes authenticate with the same Claude Code Max subscription I use interactively, which makes the weekly usage cap the real constraint. A small monitor warns me when a weekly window passes eighty percent. The model split follows the money: Sonnet does the many iterations of implementation, Opus spends one expensive pass on judgment.

If you build one yourself

The code is not the transferable part. The design choices are, and these are the ones I would defend:

Keep durable state in your tracker, not in your orchestrator. Statuses and attached links survive crashes; process memory does not.
Let the host perform every side effect. The agent writes code and nothing else, and credentials never enter the sandbox.
Spend on the environment before the agent. A built, seeded, smoke-tested sandbox saves more tokens than any prompt tweak.
Review in a fresh context. A reviewer that inherits the implementer's session inherits its blind spots.
Failures should leave the queue, not retry. A counter only delays a loop; a human breaks it.
Make CI the merge gate, never the agent's confidence.
Build the steering channel early. Redirecting a live system with a comment beats killing and restarting it.

And to be honest about the hands-off part: I still write the specs, triage the queue, and read the roll-up MR before it reaches main. But the CLI has become the exception. From now on, the brainstorm is the work.