Notes for February 2-7

Half my working week was spent at an internal company thing, so I decided to make the best of my weekend and start cutting down on coding a bit.

I still want to spend most of my free time building things, but there is a associated with the current state of the agent ecosystem and I need to get back to building physical things.

Polishing The Grindstone

But I just can’t stop myself from polishing stuff, and I was having a niggling issue with webterm and pyte, so I dove into ANSI land again and fixed alternate-screen handling so that full-screen redraws don’t overlay stale content in the SVG snapshots, and finally figured out that Ink (the React TUI framework–yes, people are using React for TUIs…) had some quirks around screen clearing, so I ended up monkey-patching pyte to handle those.

The highlight of that was that I got some repeatable skills out of that–I used itself as a debugging tool, and then got Copilot to write down the screenshot debugging workflow as a repeatable SKILL.md it can reuse later.

vibes got some ACP protocol fixes and now restarts the agent more consistently–I’ve been using it to update link tables in the wiki and add TODOs for myself, using Copilot and gpt-5-mini as a back-end, and I’m starting to question whether I need to keep developing python-steward–I kept poking at it, but I just don’t have the time to fully kit out an agent harness.

But I made sure to add some of the key missing pieces: It can scan .github/skills, load Copilot-style instructions, preserve SKILL frontmatter correctly, and structures the context correctly for caching, so I learned a lot about the “plumbing” of agent workflows, and I can always add more features later if I decide to keep going with it.

agentbox now ships with more built-in SKILL.md files and all the little conventions (how I structure workspaces, how I debug things, how I ship), so every new project starts with a decent set of skills to get the ball rolling, and I can just ask Copilot to adapt them to the project at hand.

The Fun Bits

To help me , I built daisy, a live disk usage sunburst visualizer, completely on a whim on my Linux laptop, and then ported it to the Mac (with a native CLI version as well) because I wanted something with live updating and better performance than the existing options.

It’s also a nice demo for something I’ve been arguing (and occasionally annoying people with): SPEC-driven, agent-assisted development can be ridiculously fast when the problem is well-scoped and the feedback loop is tight.

Since I would be spending two days at the office, I decided to fix one of my gripes with (the fact that it can’t resolve .local hostnames when you’re on the road) and ended up building mdnsbridge, a mDNS to DNS bridge that runs on a node that can see the LAN, queries avahi-daemon for .local names, and answers normal DNS queries with the results using split DNS feature.

This also reduces an important bit of friction for me: I can use .local URLs and machine names for everything from everywhere, without having to worry about weird DNS names or tunneling into the LAN with a full remote desktop session just to get access to local-only services.

Another thing I’ve been tinkering with is apfelstrudel, an “AI Agent for music coding” that uses the strudel.cc engine.

It's not very musical on its own, but I'm trying to make it so
It's not very musical on its own, but I'm trying to make it so

And, finally, I published go-busybox, a busybox clone written in Go, because I wanted a sandboxable, WASM-ready set of Unix utilities that I can use as a base for building more complex agents and tools without having to worry about the security implications of running arbitrary binaries.

As usual, the icon was the most fun part to design, AI or not:

All those cute, industrious Go Gophers
All those cute, industrious Go Gophers

I think I have enough side projects for now, so I’m going to focus on circling back to some of the stuff I started during the holiday season and get it to a more polished state. But before that, I want to spend some time doing some electronics work and building some physical things…

Accelerando, But Janky

The past couple of weeks have been sheer madness in the AI hype space, enough that I think it’s worthwhile capturing the moment for posterity.

The Madness

A key insight I’ve had this Thursday is that I think I’m regretting , because everything is so loud there. I went back to follow Salvatore, Armin and Mario, so my “For You” feed is now all , all the time, and although I can tune out the FOMO it induces, the pace is exhausting, particularly around .

might be a blip, but right now it has spawned another Cambrian explosion of DIY agents, with everyone and their lobster creating minimal/more secure/tailored versions of it. If sandboxing was already high on everyone’s mind, the fact that people are actually giving API keys and “money” to a set of convoluted digital noodles running in a JavaScript runtime with administrative privileges drove everyone over the edge in various ways.

In short, it’s the Wild West, and there are no stable, clear patterns emerging yet.

Which is why I’ve decided that is going to stay the way it is for the foreseeable future. I’m still developing a sandboxable/WASM-ready busybox clone, but I’m going to wait it out until proper reusable patterns emerge (spoiler: I don’t think we’ll get to a consensus this year, unless sticking to containers is a consensus).

The Frontier

Both Anthropic and OpenAI launched their incremental updates this week, pretty much on schedule and confirming that the decreasing returns phase of LLMs would be compensated by relentless optimization—but no amazing breakthroughs.

Still, both are pretty decent upgrades. Since I have pretty big projects built with their previous iterations, I’ve let both loose on those codebases (I have taken to asking for a code smell/best practices/logic and security audit, including writing fuzzing tests) and both Opus 4.6 and Codex 5.3 spotted a few things.

What they both lack, sadly, is taste. Claude models tend to be much better at creating UIs than Codex but write absolutely shit tests, and Codex will often design API surfaces that make sense but are cumbersome to use—and neither of those traits (nor the annoying “personality” twists some stupid product managers insist on instilling on the models) was fixed by this week’s updates.

The demos are, of course, amazing, but what should matter at this point is accuracy (against specs), correctness (of code), and speed (which is what Codex 5.3 improved for me). I don’t particularly care about other people’s use cases, and neither should you.

Engineering Skills

I’ve been sticking to the new GitHub Copilot CLI because I can get any frontier model on it, so I’ve been isolated from the /minimal agent hype. I do like Pi and it’s minimal, shell-like workflow and approach, but am still convinced that we need higher-level tooling (because, again, my ultimate goal is not to build coding agents).

Regardless of tools, people have finally cottoned on to —and although I think there is a relapse to the voodoo/cargo-culting of early prompt engineering approaches, there are a few useful nuggets out there worth collecting and adapting.

My take on it has been to fold them into a skel folder in agentbox, and asking Copilot to "take the skills, workflows and instructions from this URL and adapt them to the scope of our SPEC.md" for every new project - both Codex and Claude are smart enough to not just duplicate the structure, but also to rewrite the skills to better suit the project, which is delightful.

Then during the project I will often ask Copilot to capture a specific workflow into a new skill (for instance, this week I got some feedback on development, and that is now codified in this file).

And after things stabilize a bit, I can take any truly new skills or useful updates and put them back into my little archive, rather than collecting random stuff off the Internet that might never really suit my workflow and tooling.

Media

One of the few worthwhile outcomes of dipping into Twitter/X is that I can gauge the mass market impact of image and video generation with a broader view than just hanging around r/StableDiffusion, and this week was quite interesting in that regard.

Kling has been seeding social media with pretty amazing (but still detectable) AI shorts. They claim Hollywood is dead, but realistically I’d say video advertising (where impactful short content rules and there is maximal return) is going to be revolutionized, because good creatives will certainly know what to do with it.

That is worrying on several fronts, especially considering that even official sources seem to be using AI-generated media these days.

But I am cautiously optimistic that once visual inconsistencies are sorted out (or at least minimized and papered over by human SFX editors) we might see some actually good content coming out of it.

Notes for January 26 - February 1

I’ve had some feedback that my last few weekly notes (especially ) have been a bit too long and that I should try to keep them shorter and more focused on a single topic.

Well, this week I wrote a lot thanks to the inclement weather and some insomnia, so I broke most of it out into separate posts:

This is the age of TikTok, after all, and attention spans are shorter than ever, so I might as well try to adapt.

But there was a lot more going on this week, so here’s a quick roundup of two other things I worked on:

go-rdp Improvements

My web-based RDP client got UDP transport support—experimental, gated behind a --udp flag, but apparently functional. So far it’s going well, even if audio support is not all there yet (and yes, I know it’s a bit much for a web client).

This is one of my “things that should exist” projects that I’ve been using to experiment how to spoon-feed agents with highly complex protocol specs, and it’s been working out great so far, largely because RDP is so well documented.

The test suite now includes Microsoft Protocol Test Suite validation tests for RDPEUDP, RDPEMT, RDPEDISP, and RDPRFX. It’s the kind of spec compliance work that’s tedious but essential, and that I am increasingly convinced can be massively sped up with (after all, it took me less than a working week of continuous effort to do the whole thing, instead of the months it would have taken me otherwise).

go-ooxml from OO to… Hero?

As another data point, I decided to build my own Office Open XML library in this week, called go-ooxml. It too is meant to be a clean-room implementation, and I intend it to be a comprehensive replacement for the various format-specific libraries out there–I’ve already been using as an accelerator for in things like pysdfCAD, so I know it works.

And it also fits into the “things that I think should exist” category, because existing libraries for this are either abandoned, incomplete, or have APIs that make me want to cry.

And on the front the ECMA-376 spec is perfect:

  • It is a real thing people used for other implementations, and progress has been ridiculous–agents are genuinely useful for this kind of spec-driven implementation work.
  • In about two days (less than four hours of “actual work”), I went from initial project setup to enough to handle basic document creation (mostly Word, which is the most complex)

Is it production-ready? Of course not.

But it’s already more complete than many alternatives, and having the foundation right means I can add features as I need them without fighting the architecture.

Agents make the breadth possible, but they don’t remove the need for taste: deciding what the public API should look like, what’s testable, and what’s going to be maintainable six months from now is, of course, the hard part still to come…

At least I’m aware of it, I guess.

Thoughts on AI-Assisted Software Development in 2026

A few things I jotted down during –i.e., while building out my agentbox and webterm setups and other things.

Agents love specs. My go-ooxml project went from nothing to 60% spec compliance in days because I fed the agents the actual ECMA-376 spec documents and told them to implement against those. No hallucination about what XML elements should be called, no invented APIs—just spec-compliant code.

Mobile is an afterthought until it isn’t. Half the webterm fixes this week were iOS/iPad edge cases. If you’re building tools you’ll use on multiple devices, test on all of them early, because agents can only help you with things that they can test for autonomously.

The unglamorous work matters. I did a lot of CI/CD cleanup jobs, release automation, pipelines, and invested quite a few hours in creating solid SKILL.md scaffolding–none of this is exciting, but it’s what separates a tool you can rely on from a tool that occasionally bites you, and right now, for me, at least, it’s what makes agents genuinely useful.

There’s going to be more software. With such a low barrier to entry into new languages, tools or frameworks, any decent programmer is soon going to realize that their skills are transferable to the point where they can take on work in any technology.

There’s going to be more shitty software because, well, there are a lot of overconfident people out there, for starters, and the law of averages is inevitably going to kick in at some point. I am acutely aware that I am treading a fine line between “productive developer leveraging AI” and “architecture astronaut”, but my focus is always on shipping self-contained, small tools that solve real problems (for me, at least), so I hope I can avoid that pitfall.

The number of truly gifted developers is going to stay roughly the same, because programming, like any form of engineering, is a mindset much more than it is a skill.

Some of these are debatable, of course, but they are my current take on things. Let’s see how they hold up over time.

Vibing with the Agent Control Protocol

Although most of my actual work , I have been wanting an easy way to talk to the newfangled crop of agents from my iPhone.

So I spent a good chunk of this week building out a Slack-like web interface for chatting with agents via the Agent Client Protocol (ACP) called vibes.

Right now, it looks like this:

Just a web view, right? Well, not quite.
Just a web view, right? Well, not quite.

The UX

I blame for planting the seed of a chat-like interface for agents in my mind–this is not replacing my terminal-based workflow, but it’s a nice complement, especially for quick check-ins or when I want to give an agent a task and review the results later from my phone.

The web layer is “just” Preact and SSE, with enough CSS for it to work nicely in small screens and touch input, and the main timeline view shows messages from me and the agent, with support for rich content like code blocks, KaTeX formulae, images, and resource links.

But the key thing is the tool permission flow: when the agent wants to call a tool, it shows a modal with an explanation of what the tool does (fetched from the ACP server), and I can approve or deny it with a tap–that is the key part of ACP that I wanted to leverage, and that so far I’ve only seen in CLI/TUI clients.

The Back-End

How things tie together
How things tie together

One thing that isn’t on the diagram above is the database. I love for small projects like this, and all the more so now that I learned the tricks around using JSON columns for flexible data storage. And, of course, you get full text search support out of the box, which is perfect for searching what I intend to be an infinite timeline.

Hacking In ACP

Like , I am . ACP has many of the same flaws, except that now you also have to deal with the ambiguity of how to surface all of the interactivity you’d have in a TUI in a chat timeline.

Content parsing went through several iterations to handle all the edge cases: tool calls, thinking panels, resource links, embedded resources with annotations, live updates from the agent, etc.

And I had to test it with multiple ACP servers, since each implementation has its own quirks. Right now, vibes works reasonably well with my python-steward, Mistral’s vibe and GitHub Copilot CLI, but all of them have small differences in how they implement the spec.

If I had to do it again, I would have probably built a proper acp client library in or first, but since I was building both the client and server sides at the same time, I just kept iterating on the wire format until everything worked.

But Why?

It’s not just the convenience of having a cute web app on my phone–having a low-friction review loop is essential when working with agents (which is why I was keen on leveraging ACP in the first place), but I also wanted persistent history and richer rendering than what a terminal can provide, because I want to give my agents more complex tasks that involve multiple steps and outputs.

Everyone and their dog seems to be thinking that agents only have to have bash (Armin Ronacher makes some excellent points), but I am trying to strike a balance when designing steward: Give it all the tools it should need for most use cases, a little scripting engine (QuickJS) for extensibility, and extensive SKILL.md support so I can teach it to do new things.

The Sandboxing Endgame

I am pretty sure that my endgame will eventually involve WASM (maybe tinygo in a sandbox or a Cloudflare-like V8 isolate) and I’m actually hedging my bets by looking at porting a subset of busybox to , but for the moment I want to keep things simple and give agents access to higher-level tools that can do complex things without needing to script them from scratch.

Because, well, I don’t want to write coding agents. There’s a special kind of myopia around their incredible success, but I think there should be some balance in the Force.

Crustaceans are cool and all, but sometimes you just want to vibe with your agent about something as prosaic as scheduling a meeting or searching your vault.

Seizing The Means Of Production (Again)

Since , I’ve been hardening my agentbox and webterm setup through sheer friction. The pattern is still the same:

Small, UNIX-like, loosely coupled pieces which I then glue together with labels, WebSockets and enough duct tape to survive , and the goal is still the same: to reduce friction and cognitive load when managing multiple agents across what is now almost a dozen development sandboxes.

My take on tooling is, again, that a good personal stack is boring. You should be able to understand it end-to-end, swap parts in and out of, and keep it running even when upstreams change direction. The constraint is always the same: my time is finite, so any complexity I add needs to pay rent.

And boy, has there been complexity this week.

Going Full-On WASM and WebGL

I rewrote a huge chunk of webterm this week.

I started out with the excellent Textual scaffolding (i.e., xterm.js + a thin server), but I kept having weird glitches (mis-aligned double-width characters, non-working mobile input, theme handling, etc.).

So, being myself, I decided to reinvent that particular wheel, and serendipitously I stumbled onto a build of Ghostty that is pretty amazing–it can render using WebGL, fixed all of my performance issues with xterm.js, and… well, it was a bit of a challenge to deal with, but only because of a few incomplete features.

In the pre- days, I would have stopped there, but this week it took me under an hour to create a patched fork of ghostty-web that filled in the gaps I wanted and that I could just drop into webterm.

Then came the boring part–ensuring the font stack worked properly across platforms, fixing a few rendering glitches, replacing the entire screenshot capture stack (which is what I loved about Textual) with pyte, and… a lot of mobile testing.

Still, the end result is totally worth it:

The prettiest thing I did all week
The prettiest thing I did all week

There were a lot of little quality-of-life improvements that came out of this rewrite:

  • The dashboard got typeahead search, so I can quickly find the right sandbox among many.
  • And the most satisfying cosmetic fix: dashboard screenshots now use each session’s actual theme palette.
  • PWA support landed, so the iPad can treat it like a proper “app”.
  • The WebSocket plumbing got a proper send queue so slow clients couldn’t freeze other sessions.

I would have rewritten this in , but as it happens the Go equivalent of pyte didn’t seem to be good enough yet, and running half a dozen sessions at a time for a single person isn’t a load-sensitive setup anyway.

Again, this is all about reducing friction: Color helps me recognize the project, and typehead find makes it trivial to, well… find. The less mental overhead I have to deal with when switching contexts, the more likely I am to actually use the tools I’ve built.

Mobile Woes

But getting it to work properly on mobile was a pain:

  • Mobile keyboard handling was a mess. You can’t customize the onscreen keyboard in the browser, and modifier keys were especially problematic.
  • To make mobile usable for real work (not just htop screenshots), webterm now pops up a draggable keybar with Esc/Ctrl/Shift/Tab and arrow keys, which are “sticky” so you can tap out proper Ctrl/Shift arrow sequences–and Ctrl+C, which is kind of essential.
  • Focus was a big problem. is incredibly finicky about input browser input events–and if you test on an iPad with a keyboard attached, you miss half the problems. The “solution” was to monkeypatch input via a hidden textarea that captures all input events and forwards them to the terminal renderer–and that still breaks in weird, unpredictable ways.

I might have gone a bit overboard with testing–I don’t have an Android tablet, so I decided to test on my Oculus Quest 2 headset browser, which is almost Android with a head strap:

Testing `webterm` on the Oculus Quest 2 browser--it works surprisingly well!

ANSI Turtles All The Way Down

Then came even weirder rendering bugs, since, well, terminals are terminals. And for such a simple concept, the stack is surprisingly complex:

Each and every one of those arrows gave me a headache
Each and every one of those arrows gave me a headache

For instance, you’ll notice in the diagram above that there is a PTY layer and in the mix. That means there are two layers of terminal emulation happening, and both need to be configured properly to avoid glitches.

For instance, I kept getting 1;10;0c when I connected, which led me down the weirdness of ANSI escape codes and nested terminal emulators (something I hadn’t done since running emacs to wrap VAX sessions…). sends DA2 queries, but my wrapper ended up having to filter more than DA1 responses and not messing up UTF-8 sequences.

Then I realized that the Copilot CLI sends a bunch of semi-broken escape sequences that pyte couldn’t handle properly, which led to all sorts of rendering glitches in the screenshots, and another round of patches, and another…

Scaffolding The Future

I also spent a good chunk of time this week improving the agentbox Docker setup, adding better release automation, cleaning up old artifacts, and generally making it easier to spin up new sandboxes with the right tools and my secret weapon:

A set of starter SKILL.md files that teach the bundled agents how to manage the environment, use how I prefer to develop, and generally be useful and run through proper code/lint/test/fix cycles without me having to babysit them.

Right now I’m at a point where I can just go into any of my git repositories, run make init (or, if it’s an old project, point Copilot at the skel files and tell it to read and adapt them according to the local SPEC.md), and have a fully functional AI agent sandbox ready to go.

That I can do that and the infra for it in under a minute, with proper workspace mappings, RDP/web terminal access, and to get the results back out, is just… chef’s kiss.

Ah well. At least now I have a pretty solid UX that even works from on my ageing iPad Mini 5 snappily enough (as long as I don’t try to open too many tabs), and I can finally start focusing on other stuff.

Which I sort of did, all at once…

TIL: Apple Broke Time Machine Again On Tahoe

So… Here we are again.

Today, after a minor disaster with my vault, I decided to restore from Time Machine, and… I realized that it had silently broken across both my Tahoe machines. I use a NAS as Time Machine target, exporting the share over and that has worked flawlessly for years, but this came as a surprise because I could have sworn it was working fine a couple of months ago–but no, it wasn’t.

For clarity: It just stopped doing backups, silently. No error messages, no notifications, nothing. Just no backups for around two months. On my laptop, I only noticed because I was trying to restore a file and the latest backup was from December. On my desktop, I had a Thunderbolt external drive as a secondary backup.

After some research, I found out that the issue is with unilateral decision to change their SMB defaults (without apparently notifying anyone), and came across a few possible fixes.

What Seems To Be Working Now

I found this gist, which I am reproducing here for posterity, that seems to be working for me, but which entails editing the nsmb.conf file on the Mac itself–which is not exactly ideal, since I’m pretty sure Apple will break this again in the future.

sudo nano /etc/nsmb.conf # I used vim, of course

…and adding the following lines (the file should be empty):

[default]
signing_required=yes
streams=yes
soft=yes
dir_cache_max_cnt=0
protocol_vers_map=6
mc_prefer_wired=yes

The explanation here is that Tahoe changed the default from signing_required=no to stricter control, and NAS devices with relaxed SMB settings cannot handle this without explicit configuration.

Another common pitfall is name encoding issues in machine names, so you should remove Non-ASCII Characters from the .sparsebundle name (that wasn’t an issue for me, but YMMV).

On the side, the recommendation was to go to Control Panel > File Services > SMB > Advanced and set:

  • Maximum SMB protocol: SMB3
  • Enable Opportunistic Locking: Yes
  • Enable SMB2 Lease: Yes
  • Enable SMB Durable Handles: Yes
  • Server signing: No (or “Auto”)
  • Transport encryption: Disabled

That doesn’t quite match my DSM UI, but it’s close enough, and my settings now look like this:

My SMB settings, as of DSM 7.3.2-86009-1
My SMB settings, as of DSM 7.3.2-86009-1

My Backup Backup Plan

Since I’m tired of Apple breaking Time Machine every few years and the lack of transparency around this (it’s not ’s fault), I have decided to implement a more robust solution that doesn’t depend on Synology’s SMB implementation.

I already have that has an LXC container running Samba for general file sharing, so I decided to look into that as a possible Time Machine target.

As it happens, mbentley/timemachine is a image specifically designed for this purpose, and it seems to be well-maintained, so I’m testing it like this:

services:
  timemachine:
    image: mbentley/timemachine:smb
    container_name: timemachine
    restart: always
    network_mode: host
    environment:
      - TM_USERNAME=timemachine
      - TM_GROUPNAME=timemachine
      - PASSWORD=timemachine
      - TM_UID=65534 # 'nobody' user
      - TM_GID=65534 # 'nobody' group
      - SET_PERMISSIONS=false
      - VOLUME_SIZE_LIMIT=0
    volumes:
      # this is a pass-though mountpoint to the ZFS volume in Proxmox
      - /mnt/shares/timemachine:/opt/timemachine
    tmpfs:
      - /run/samba

Right now the first option seems to be working, but I will probably switch to the solution in the near future, since it gives me more control over the implementation and avoids relying on ’s software.

But if anyone from Apple is reading this: please, stop breaking Time Machine every few years. It’s a critical piece of infrastructure for many users, and the lack of communication around these changes is frustrating.

The Third Way: Borg Backup

I have been using Borg for some time now on , and I am considering using it for my Macs as well. seems decent, I just haven’t tried it yet.

A Minor, Yet Annoying, Additional Problem

Plus I’m annoyed enough that earlier this morning I tried to set up a new device and the infamous Restore in Progress: An estimated 100 MB will be downloaded… bug (which has bitten me repeatedly over the last six years) is still there.

The usual fix was hitting Reset Network Settings and a full hardware reboot, plus reconnecting to Wi-Fi… But this time it took three attempts.

Come on, Apple, get your act together. Hire people who care about the OS experience, not just .

Notes for January 19-25

Since , I’ve been heads-down building a coding agent setup that works for me and using it to build a bunch of projects, and I think I’ve finally nailed it. A lot more stuff has happened since then, but I wanted to jot down some notes before I forget everything, and my next weekly post will probably be about the other projects I’ve been working on.

Seizing The Means Of Production

I have now achieved coding agent nirvana–I am running several instances of my agentbox code agent container in a couple of VMs (one trusted, another untrusted), and am using my textual-webterm front-end to check in on them with zero friction:

My trusted set of agents
My trusted set of agents

This is all browser-based, so one click on those screenshots (which update automatically based on terminal activity) opens the respective terminal in a new tab, ready for me to review the work, pop into vim for fixes, etc. Since the agents themselves expend very little CPU or RAM and I’ve capped each container to half a CPU core, a 6-core VM can run literally dozens of agents in parallel, although the real limitation is my ability to review the code.

But it’s turned out to be a spectacularly productive setup – a very real benefit for me is having the segregated workspaces constantly active, which saves me hours of switching between them in , and another is being able to just “drop in” from my laptop, desktop, iPad, etc.

As someone who is constantly juggling dozens of projects and has to deal with hundreds of context switches a day, the less friction I have when coming back to a project the better, and this completely fixes that. Although I had this mostly working last week, getting the pty screen capture to work “right” was quite the pain, and I had to guide the LLM through various ANSI and double-width character scenarios–that would be worth a dedicated post on its own if I had the time, but anyone who’s worked with terminal emulators will know what I’m talking about.

You Wanted Sandboxing? You Got Sandboxing

Another benefit of this approach is that none of the agents are running locally and can’t possibly harm any of my personal data.

The whole thing (minus , which is how I connect everything securely) looks like this:

I had to explain this to a few people already, so here's the detailed diagram
I had to explain this to a few people already, so here's the detailed diagram

I have several levels of sandboxing in place:

  • Each container is an agentbox instance with its own /workspace folder
  • Containers are capped in both CPU and RAM (although that only impacts their ability to run builds and tests–but even Playwright testing works fine)
  • The containers are running in a full VM inside (capped at six cores and 16GB) and one of my ARM boards (more cores, but just 8GB of physical RAM)
  • The “untrusted” agents use LiteLLM to access Azure OpenAI, so they never have production keys and can be capped in various ways
  • Each setup runs a instance that syncs the workspace contents back to my Mac so I can do final reviews, testing and commits–that’s the only way any of the code reaches my own machine.

As to the actual agent TUI inside the agent containers, I’m using the new GitHub Copilot CLI (which gives me access to both Anthropic’s Claude Opus 4.5 and OpenAI’s GPT-5.2-Codex models), Gemini (for kicks) and Mistral Vibe (which has been surprisingly capable).

After I relegated OpenCode to the “untrusted” tier, and I also have my own toy coding assistant (based on python-steward, and focused on testing custom tooling) there.

KISS

A good part of the initial effort was bootstrapping this, of course, but since I did it the UNIX way (simple tools that work well together), I’ve avoided the pitfall of doing what most agent harnesses/sandboxing tools are trying to do, which is to do full-blown, heavily integrated environments that take forever to set up and are a pain to maintain.

I don’t care about that, and prefer to keep things nice and modular. Here’s an example of my docker compose file:

---
x-env: &env
  DISPLAY: ":10"
  TERM: xterm-256color
  PUID: "${PUID:-1000}"
  PGID: "${PGID:-1000}"
  TZ: Europe/Lisbon

x-agent: &agent
  image: ghcr.io/rcarmo/agentbox:latest
  environment:
    <<: *env
  restart: unless-stopped
  deploy:
    resources:
      limits:
        cpus: "${CPU_LIMITS:-2}"
        memory: "${MEMORY_LIMITS:-4G}"
  privileged: true # Required for Docker-in-Docker
  networks:
    - the_matrix

services:
  syncthing:
    image: syncthing/syncthing:latest
    container_name: agent-syncthing
    hostname: sandbox
    environment:
      <<: *env
      HOME: /var/syncthing/config
      STGUIADDRESS: 0.0.0.0:8384
      GOMAXPROCS: "2"
    volumes:
      - ./workspaces:/workspaces
      - ./config:/var/syncthing/config
    network_mode: host
    restart: unless-stopped
    cpuset: "0"
    cpu_shares: 2
    healthcheck:
      test: curl -fkLsS -m 2 127.0.0.1:8384/rest/noauth/health | grep -o --color=never OK || exit 1
      interval: 1m
      timeout: 10s
      retries: 3

  # ... various agent containers ...

  guerite:
    <<: *agent
    container_name: agent-guerite
    hostname: guerite
    environment:
      <<: *env
      ENABLE_DOCKER: "true" # this one needs nested Docker
    labels:
      webterm-command: docker exec -u agent -it agent-guerite tmux new -As0 \; attach -d
    volumes:
      - config:/config
      - local:/home/agent/.local
      - ./workspaces/guerite:/workspace

  go-rdp:
    <<: *agent
    container_name: agent-go-rdp
    hostname: go-rdp
    ports:
      - "4000:3000" # RDP service proxy
    labels:
      webterm-command: docker exec -u agent -it agent-go-rdp tmux new -As0 \; attach -d
    volumes:
      - config:/config
      - local:/home/agent/.local
      - ./workspaces/go-rdp:/workspace

# ... more agent containers ...

volumes:
  config:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: ./home
  local:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: ./home/.local

networks:
  the_matrix:
    driver: bridge

You’ll notice the labels, which are what textual-webterm uses to figure out what containers to talk to.

The Outputs

It’s been insane. Since this setup lets me drop back into each project at the click of a link and I can guide the agents for a couple of minutes at a time, or take notes and write specs in a separate window. Either of which fits well with my workflow and doesn’t require me to fire up a bloated IDE and loading a project folder (which can take quite a long time on its own).

So I now have the ability to create a bunch of things that I think should exist:

  • I now have my web-based RDP client working with a back-end that uses tinygo and to do high-performance decoding in the browser (which is something I’ve always wanted), and I decided to push it to the limit against the public test suites because I think a Go-based RDP client is something that should exist.
  • I took the existing pysdfCAD implementation of signed distance functions and replaced the slow marching cubes implementation it was using to render STL meshes with a Go-based backend that renders meshes much faster and with better quality (when it works–I need to sort out some bugs).
  • I built two (for now) extensions for mind-mapping and Kanban that match what I currently need from (and will be looking at enhancing Foam to match the editor soon)
  • I’m taking a couple of years of hacky scripts and building a writing agent that is going to help me do the automated conversion and re-tagging of the 4000+ legacy pages of this site that are still in format (the name editor is taken for building a WYSIWYG editor to replace with )
  • I ported a bunch of my own stuff (and a few fun things, like Salvatore Sanfilippo’s embedding model) to .
  • I started packaging my own servers as Azure App Services, so I can use the basic techniques to accelerate .

Lessons Learned

I’ve read about the Ralph Wiggum Loop, and it’s not for me (I find it to be the kind of thing you’d do if you’re an irresponsible adolescent rich kid with an inexhaustible supply of money/tokens and don’t really care about the quality of the results, and that’s all I’m going to say about it for now).

  • (write a SPEC.md, instruct the agents to run full lint/test cycles and aim for 80% test coverage, then go back, review and write TODO.md files for them to base their internal planning tools out of and work in batches) still works the best as far as final code quality is concerned. I still have to ask for significant refactorings now and then, but since my specs are usually very detailed (what libraries to use, which should be vendored, what code organization I want, and what specific test scenarios I believe to be the quality bar) things mostly work out fine.
  • Switching between models for coding and auditing/testing is still key. Claude (even Opus) has a tendency to be overly creative in tests, so I typically ask for test and security audits with GPT-5.2 that catch dozens of obviously stupid things that the Anthropic models did. Gemini is still a grey area, since I’m just using the free tier for it (although it seems unsurprisingly good at architecting packages).
  • Switching between frontier and small(ish) models for coding and testing also works great. gpt-5-mini, sonnet, haiku, mistral and gemini flash do a very adequate job of running and fixing most test cases, as well as front-end coding.
  • really doesn’t like when agents create virtualenvs or install npm packages, so I routinely have to tell the agents that they are in a containerized environment and that it’s fine to install pip and npm packages globally (i.e., outside the workspace mount point).
  • a little while back, is still the way to go for deterministic results with tools. Support for (and SKILL.md) is very uneven across all the current agentic TUIs, but with a few strategically placed symlinks I can have a workspace setup that works well across and the remote agents.
  • Having a shared set of tooling and skills across as many of your agents as possible really cuts down on the amount of prompting and scaffolding agents need to create per project. In that regard, umcp has probably been the best bang for the buck (or line of code) that I wrote in 2025, because I use it all the time.
  • Claude Code and Gemini have a bunch of teething issues with . Fortunately both Mistral Vibe and the new Copilot CLI work pretty well, and clipboard support is flawless even when using them inside both and textual-webterm.

And, finally, coding agents are like crack. My current setup is so addictive I find myself reviewing work and crafting TODOs for the agents from my iPad before I go to bed instead of easing myself into sleep with a nice book, which is something I really need to put some work into.

But I have a huge, decades-long list of ideas and projects I think should exist, and after three years of hallucinations and false starts, we’re finally at an inflection point where for someone with my particular set of skills and experience LLMs are a tremendous force multiplier for getting (some) stuff done, provided you have the right setup and workflow.

They’re still very far from perfect, still very unreliable without the right guardrails and guidance, and still unable to replace a skilled programmer (let alone an accountant, a program manager or even your average call center agent), but in the right hands, they’re not a bicycle for the mind–they’re a motorcycle.

Or a wingsuit. Just mind the jagged cliffs zipping past you at horrendous speeds, and make sure you carry a parachute.

The NestDisk

This one took me a while (for all the reasons you’ll be able to read elsewhere in recent posts), but the NestDisk has been quietly running beside my desktop for a month now, and it’s about time I do it justice.

Read More...

Notes for January 1-18

Return to work happened mostly as expected–my personal productivity instantly tanked, but I still managed to finish a few things I’d started during the holiday break–and started entirely new ones, which certainly didn’t help my ever-growing backlog.

Read More...

My Rube Goldberg RSS Pipeline

Like everybody else on the Internet, I routinely feel overwhelmed by the volume of information I “have” to keep track of.

Read More...

Notes on SKILL.md vs MCP

Like everyone else, I’ve been looking at SKILL.md files and tried converting some of my tooling into that format. While it’s an interesting approach, I’ve found that it doesn’t quite work for me as well as does, which is… intriguing.

Read More...

When OpenCode decides to use a Chinese proxy

So here’s my cautionary tale for 2026: I’ve been testing toadbox, my very simple, quite basic coding agent sandbox, with various .

Read More...

Lisbon Film Orchestra

Great start to the show
A little while ago, in a concert hall not that far away…

How I Manage My Personal Infrastructure in 2026

As regular readers would know, I’ve been on the homelab bandwagon for a while now. The motivation for that was manifold, starting with the pandemic and a need to have a bit more stuff literally under my thumb.

Read More...

Notes for December 25-31

OK, this was an intense few days, for sure. I ended up going down around a dozen different rabbit holes and staying up until 3AM doing all sorts of debatably fun things, but here’s the most notable successes and failures.

Read More...

TIL: Restarting systemd services on sustained CPU abuse

I kept finding avahi-daemon pegging the CPU in some of my LXC containers, and I wanted a service policy that behaves like a human would: limit it to 10%, restart immediately if pegged, and restart if it won’t calm down above 5%.

Well, turns out systemd already gives us 90% of this, but the documentation for that is squirrely, and after poking around a bit I found that the remaining 10% is just a tiny watchdog script and a timer.

Setup

First, contain the daemon with CPUQuota:

sudo systemctl edit avahi-daemon
[Service]
CPUAccounting=yes
CPUQuota=10%
Restart=on-failure
RestartSec=10s
KillSignal=SIGTERM
TimeoutStopSec=30s

Then create a generic watchdog script at /usr/local/sbin/cpu-watch.sh:

#!/bin/bash
set -euo pipefail

UNIT="$1"
INTERVAL=30

# Policy thresholds
PEGGED_NS=$((INTERVAL * 1000000000 * 9 / 10))   # ~90% of quota window
SUSTAINED_NS=$((INTERVAL * 1000000000 * 5 / 100)) # 5% CPU

STATE="/run/cpu-watch-${UNIT}.state"

current=$(systemctl show "$UNIT" -p CPUUsageNSec --value)
previous=0
[[ -f "$STATE" ]] && previous=$(cat "$STATE")
echo "$current" > "$STATE"

delta=$((current - previous))

# Restart if pegged (hitting CPUQuota)
if (( delta >= PEGGED_NS )); then
  logger -t cpu-watch "CPU pegged for $UNIT (${delta}ns), restarting"
  systemctl restart "$UNIT"
  exit 0
fi

# Restart if consistently above 5%
if (( delta >= SUSTAINED_NS )); then
  logger -t cpu-watch "Sustained CPU abuse for $UNIT (${delta}ns), restarting"
  systemctl restart "$UNIT"
fi

…and mark it executable: sudo chmod +x /usr/local/sbin/cpu-watch.sh

It’s not ideal to have hard-coded thresholds or to hit storage frequently, but in most modern systems /run is a tmpfs or similar, so for a simple watchdog this is acceptable.

The next step is to make it executable and figure out how to use it via systemd templates:

sudo chmod +x /usr/local/sbin/cpu-watch.sh
# cat /etc/systemd/system/[email protected]
[Unit]
Description=CPU watchdog for %i
After=%i.service

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/cpu-watch.sh %i.service
# cat /etc/systemd/system/[email protected]
[Unit]
Description=Periodic CPU watchdog for %i

[Timer]
OnBootSec=2min
OnUnitActiveSec=30s
AccuracySec=5s

[Install]
WantedBy=timers.target

The trick I learned today was how to enable it with the target service name:

sudo systemctl daemon-reload
sudo systemctl enable --now [email protected]

You can check it’s working with:

sudo systemctl list-timers | grep cpu-watch
# this should show the script restart messages, if any:
sudo journalctl -t cpu-watch -f

Why This Works

The magic, according to Internet lore and a bit of LLM spelunking, is in using CPUUsageNSec deltas over a timer interval, which has a few nice properties:

  • Short CPU spikes are ignored, since the timer provides natural hysteresis
  • Sustained abuse (>5%) triggers restart
  • Pegged at quota (90% of 10%) triggers immediate restart
  • Runaway loops are contained by CPUQuota
  • Everything is systemd-native and auditable via journalctl

It’s not perfect, but at least I got a reusable pattern/template out of this experiment, and I can adapt this to other services as needed.

Ovo

Yeah, I don’t know what the grasshoppers want with the egg either
Another great evening spent in the company of Cirque du Soleil
Archives3D Site Map