So You Want To Do Agentic Development

We’re three months into 2026, and coding agents have been a big part of my time since last year–things have definitely intensified, and one of my predictions has already panned out: agents are everywhere.

Yes, I love this picture, and I won't apologize for it

My advice for people getting into this remains the same:

Choose Mature Tooling

In the music hobby, there’s a thing called GAS–Gear Acquisition Syndrome–where people get obsessed with buying the latest gear even if they don’t know how to use it. I see quite a lot of that in the AI space right now, so I’d rather recommend starting with mature, well-supported tools:

VS Code with GitHub Copilot is still the best entry point–you can compare Claude, GPT and Gemini side by side, and it affords real control over the agent’s environment (plus it’s designed for enterprise use).
Mistral Vibe and Gemini CLI both have daily free tiers with enough fail-safes to experiment safely. (I still recommend sandboxing, but it’s less critical than it was a few months ago.)
OpenCode is the fully free route, but the models have fewer guardrails and can take weird turns–definitely sandbox this one.

I can’t in good conscience recommend spending hundreds on Anthropic or OpenAI subscriptions right now–the market is saturated, and both are shipping desktop tools (Claude Code, Codex) that will likely come with cheaper tiers. The “use boring technology” adage applies here too.

Of course by now I’ve built my own tools and keep improving upon them, but that’s a long-term project, and I don’t think it’s the best starting point for most people.

Sandboxing

I never run agent tools on a machine with personal data–that’s why I built agentbox in the first place. You may not need to be as strict about it as I am, but VS Code supports dev containers on any platform, and both Anthropic and OpenAI are shipping sandboxes with their tools, so there’s really no excuse.

Privacy and Security

There are rather a lot of misconceptions about data privacy, and things like OpenClaw don’t help matters (I’m still gobsmacked people give it direct access to their e-mail). Even with enterprise-hosted models that don’t train on your data, “don’t run untrusted code on a machine with personal data” is a principle worth keeping.

Like I was quipping the other day, AI is the new digital advertising–and yet people are giving it more access to their data than they give ad networks, which is just baffling.

The Local Fallacy

And speaking of OpenClaw–the “local AI” fallacy needs addressing. None of these things are really “local” in any meaningful sense–the gap between local and cloud models is huge, and even tens of thousands of dollars in hardware won’t get you close to frontier capabilities.

“Fast, Good, Cheap: pick two” still applies, and it’s easy to get bitten by technology advances: I built a server for local AI three years ago and it’s already obsolete.

And there is an almost weekly hype cycle around local models that I think is totally unwarranted in practice. For instance, Qwen is promising, but the local quantizations match last year’s cloud models at best, and the gap keeps widening.

Workflow

I keep coming across people who say AI generates rubbish code, and I think it’s usually one of two things: wrong tools (VS Code does much more than provide a UI–it curates context and feeds models your code structure), or wrong approach (they’re quite literally “holding it wrong”). I wrote about the expectation gap two years ago and the fundamentals haven’t changed.

Part of it is inflated expectations, of course. Frontier models like Opus 4.6 and GPT-5.4 are very capable, but they need skill to use effectively, and they never produce perfect code on the first try. You have to know how to use them, and that takes practice.

I’ve been refining how I work with them since last year, and although things have evolved quite a bit, the core principles remain the same.

SPEC.md

Every project starts with a SPEC.md that I iterate on with the LLM, 20-questions style, until it covers the essentials–goals, functional specs, non-functional specs, technical specs, and acceptance criteria.

I prefer SPEC.md over PRD.md because it emphasises specification over requirements–I want the agent to follow it, not interpret it freely.

This isn’t a prompt–it’s a living document that evolves with the project. And agents love specs–feeding one an actual ECMA-376 spec document got me to 60% compliance in days, with no hallucinated APIs.

SKILL.md

I complement specs with SKILL.md files–guidelines for coding, tooling, or domain-specific tasks. I have a growing collection in agentbox, and every new project starts with a make init that adapts them to the local spec.

You can also fold these into .github/copilot-instructions.md (which VS Code picks up automatically), but standalone skills are tool-agnostic.

The properly interesting bit is that agents now write their own skills–piclaw built its own hot-reload, backup, and web scraping skills after I’d guided it through the process a few times. Early days, but that’s where this is headed.

MCP tools work better than skills for chaining–models struggle to chain skills together, whereas MCP narrows context and presents clear next steps. But for teaching how rather than what, skills are still invaluable.

The PLAN.md Loop

After doing the prep work, I go into a loop I’ve been tightening for months:

From the SPEC.md, I create a PLAN.md–not a flat TODO checklist, but a structured breakdown the agent can reason about (what’s done, what’s blocked, why). It updates the plan as it goes, which also refreshes model context. No reliance on built-in planning tools (which are patchy across models), and the plan is always in the repo for me to review.

The loop itself is pretty simple:

I break down work into focused chunks–scaffolding, data model, API, etc.
The agent writes code, lints, tests, documents, and updates the PLAN.md.
I review and steer–correcting, feeding it more context, or pointing at examples.

There is also an “advanced mode” where I break up the PLAN.md into multiple Kanban-like tickets and we then play “20 questions” again for that ticket alone and the LLM asks me about it to refine the scope, but I realize most people just don’t have the patience for that even though it massively improves what gets done and how.

Steering

The most important bit of the loop–and the one most people get wrong at first. Effective steering isn’t about reprompting and hoping for the best; it’s about funnelling the agent’s attention to the right context:

TDD-like tests–describe expected behaviour, ask the agent to make them pass. My most reliable workflow, especially when porting across languages.
Linting and static analysis via Makefile or CI–the agent self-corrects. I aim for 80% coverage as a quality bar.
Steering by example–pointing at existing code that demonstrates the right approach.
Switching models–Claude still writes dodgy tests (ok, fine, it forgets all sorts of corner cases), so I use Codex models for test and security audits.

Yes, that’s proper work. But it’s no different from managing humans, and it gets easier with practice.

Language Matters

Some languages are inherently better for agents–not in terms of popularity (any agent can do decent Python), but because strong types and annotations help models understand intent and self-correct.

In my experience, Go, Rust and TypeScript work much better than Swift (too many revisions, too-opinionated frameworks). Go in particular reinforces context through explicit function and interface references, and TypeScript has excellent tooling for enforcing good practices–which is why I’ve been using both a lot more recently.

What’s Next

My sandboxing approach is working well, and I don’t see myself changing it anytime soon. But I’ve been slowly extending my agents’ reach–piclaw, which started as a weekend hack, is now a fully-fledged agentic assistant on my phone, and I’ve been giving it more autonomy as I learn to trust the guardrails.

The next frontier (boy, is this a pompous term, but I guess corporate speak is sneaking in this late in the evening) is getting agents to collaborate–sharing context and skills to work as a group. I have some ideas about how to do it, but that’s a matter for another post.

Tao of Mac