So You Want To Do Agentic Development

We’re three months into 2026, and coding agents have been a big part of my time since –things have definitely intensified, and has already panned out: agents are everywhere.

Yes, I love this picture, and I won't apologize for it
Yes, I love this picture, and I won't apologize for it

My advice for people getting into this remains the same:

Choose Mature Tooling

In the music hobby, there’s a thing called GAS–Gear Acquisition Syndrome–where people get obsessed with buying the latest gear even if they don’t know how to use it. I see in the space right now, so I’d rather recommend starting with mature, well-supported tools:

  • with GitHub Copilot is still the best entry point–you can compare Claude, GPT and Gemini side by side, and it affords real control over the agent’s environment (plus it’s designed for enterprise use).
  • Mistral Vibe and Gemini CLI both have daily free tiers with enough fail-safes to experiment safely. (I still recommend sandboxing, but it’s less critical than it was a few months ago.)
  • OpenCode is the fully free route, but the models have fewer guardrails and can take definitely sandbox this one.

I can’t in good conscience recommend spending hundreds on Anthropic or OpenAI subscriptions right now–the market is saturated, and both are shipping desktop tools (Claude Code, Codex) that will likely come with cheaper tiers. The “use boring technology” adage applies here too.

Of course by now and keep , but that’s a long-term project, and I don’t think it’s the best starting point for most people.

Sandboxing

I never run agent tools on a machine with personal data–that’s why I built agentbox in the first place. You may not need to be as as I am, but supports dev containers on any platform, and both Anthropic and OpenAI are shipping sandboxes with their tools, so there’s really no excuse.

Privacy and Security

There are rather a lot of misconceptions about data privacy, and things like OpenClaw don’t help matters (I’m still gobsmacked people give it direct access to their e-mail). Even with enterprise-hosted models that don’t train on your data, “don’t run untrusted code on a machine with personal data” is a principle worth keeping.

Like I was quipping the other day, AI is the new digital advertising–and yet people are giving it more access to their data than they give ad networks, which is just baffling.

The Local Fallacy

And speaking of OpenClaw–the “local AI” fallacy needs addressing. None of these things are really “local” in any meaningful sense–the gap between local and cloud models is huge, and even tens of thousands of dollars in hardware won’t get you close to frontier capabilities.

The agentic loop is inescapable
The agentic loop is inescapable

“Fast, Good, Cheap: pick two” still applies, and it’s easy to get bitten by technology advances: I and it’s already obsolete.

And there is an almost weekly hype cycle around local models that I think is totally unwarranted in practice. For instance, Qwen is promising, but the local quantizations match last year’s cloud models at best, and the gap keeps widening.

Workflow

I keep coming across people who say AI generates rubbish code, and I think it’s usually one of two things: wrong tools ( does much more than provide a UI–it curates context and feeds models your code structure), or wrong approach (they’re quite literally “holding it wrong”). I wrote about two years ago and the fundamentals haven’t changed.

Part of it is inflated expectations, of course. Frontier models like Opus 4.6 and GPT-5.4 are very capable, but they need skill to use effectively, and they never produce perfect code on the first try. You have to know how to use them, and that takes practice.

I’ve been refining how I work with them since , and although things have evolved quite a bit, the core principles remain the same.

SPEC.md

Every project starts with a SPEC.md that I , 20-questions style, until it covers the essentials–goals, functional specs, non-functional specs, technical specs, and acceptance criteria.

I prefer SPEC.md over PRD.md because it emphasises specification over requirements–I want the agent to follow it, not interpret it freely.

This isn’t a prompt–it’s a living document that evolves with the project. And –feeding one an actual ECMA-376 spec document got me to 60% compliance in days, with no hallucinated APIs.

SKILL.md

I complement specs with SKILL.md files–guidelines for coding, tooling, or domain-specific tasks. I have a growing collection in agentbox, and every new project starts with a make init that .

You can also fold these into .github/copilot-instructions.md (which picks up automatically), but standalone skills are tool-agnostic.

The properly interesting bit is that agents now piclaw built its own hot-reload, backup, and web scraping skills after I’d guided it through the process a few times. Early days, but that’s where this is headed.

–models struggle to chain skills together, whereas MCP narrows context and presents clear next steps. But for teaching how rather than what, skills are still invaluable.

The PLAN.md Loop

After doing the prep work, I go into a loop :

The workflow
The workflow

From the SPEC.md, I create a PLAN.md–not a flat TODO checklist, but a structured breakdown the agent can reason about (what’s done, what’s blocked, why). It updates the plan as it goes, which also refreshes model context. No reliance on built-in planning tools (which are patchy across models), and the plan is always in the repo for me to review.

The loop itself is pretty simple:

  • I break down work into focused chunks–scaffolding, data model, API, etc.
  • The agent writes code, lints, tests, documents, and updates the PLAN.md.
  • I review and steer–correcting, feeding it more context, or pointing at examples.

Steering

The most important bit of the loop–and the one most people get wrong at first. Effective steering isn’t about reprompting and hoping for the best; it’s about funnelling the agent’s attention to the right context:

  • –describe expected behaviour, ask the agent to make them pass. My most reliable workflow, especially when porting across languages.
  • Linting and static analysis via Makefile or CI–the agent self-corrects. I aim for 80% coverage as a quality bar.
  • Steering by example–pointing at existing code that demonstrates the right approach.
  • Claude still writes dodgy tests (ok, fine, it forgets all sorts of corner cases), so I use Codex models for test and security audits.

Yes, that’s proper work. But it’s no different from managing humans, and it gets easier with practice.

Language Matters

Some languages are inherently better for agents–not in terms of popularity (any agent can do decent ), but because strong types and annotations help models understand intent and self-correct.

In my experience, , and work much better than (too many revisions, too-opinionated frameworks). in particular through explicit function and interface references, and has excellent tooling for enforcing good practices–which is why I’ve been using both a lot more recently.

What’s Next

My is working well, and I don’t see myself changing it anytime soon. But I’ve been slowly extending my agents’ reach–piclaw, which started as a weekend hack, is now , and I’ve been giving it more autonomy as I learn to trust the guardrails.

The next frontier (boy, is this a pompous term, but I guess is sneaking in this late in the evening) is getting agents to collaborate–sharing context and skills to work as a group. I have , but that’s a matter for another post.