Field Notes From The AI Battlefield

Since today is a bank holiday for me, I decided to consolidate a few more of my notes into a post. What follows is a set of guiding “principles” that I’ve found useful over the past year or so and that I’ve codified into various bits of scaffolding I reuse across my projects.

As usual, I’ve tried to strip away all of the hype and fuzziness and stick to facts, but everyone has their own way of leveraging AI, so your mileage may vary.

However, unlike most of what I read online about AI these days, I am not pitching any specific tooling, although all of this is based on my experience.

Full Disclaimer: I and have a personal Codex account that OpenAI provided for my OSS work, as well as access to random Tier 2 providers that I use to test piclaw.

If you like this, you might be interested on , a minor rant about and my .

Do Not Blindly Trust AI-generated Code

A great example I usually point out is that if you ask an LLM to do extensive error handling on a piece of code, it will almost invariably (at least in ) generate empty catch(){} blocks and call that “error handling”.

Another is when I asked it to optimize a particular tree traversal function for an edge case and it just hard coded the result.

And this applies to nearly everything you ask any LLM to do–but code can be validated, and tested, and measured in various dimensions, and you can turn some of its foibles against it.

In the case of the first example above, a linter will catch that, and you can force the AI to turn those empty catches into something useful (like warning messages in logs).

The second one is nastier, but it too can be fixed through proper test fixtures (dynamic but non-repetitive).

Which is why I invariably wrap all my AI-driven projects into several layers of deterministic testing and automation.

Automate Everything Away from the Model

The ground rule I follow is that even SOTA models are inherently unreliable, so when I set up a project or after the first few days of goofing around with a prototype, I try to make sure everything runs on rails.

I typically start with putting together a Makefile because it works/is preinstalled everywhere, is extremely familiar to LLMs, and means I have to do zero thinking myself when running steps manually, but you can use whatever you want.

The important thing is that it must cover the entire development and release cycle, because your agent will inevitably start drifting off and forget how it should do things.

I set it up like this:

  • Makefile targets to do everything (that way there is no “secret sauce” only the model “knows” to do tests, a build, etc.)
    • linting/static analysis (go vet is great, but you should also prepare for typical LLM “lazy” idioms like empty catch blocks, which should be considered critical errors)
    • tests (unit/fuzzing/functional)
    • builds
    • packaging
    • upstream dependency updates (packages and vendored files)
  • One or more SKILL.md file(s) that explain how to use the Makefile and cover the dev/test/debug/release workflows. You should make sure those are referenced from AGENTS.md or use the .github/copilot conventions (insert your flavor of choice here).

The key thing is to always aim for reproducible steps. The model will always go off into the weeds seeking an adventure regardless of how many admonitions you put in AGENTS.md or equivalent, especially when debugging things, but the Makefile (or equivalent) should be your ground truth.

The SKILL.md files are… Well, of dubious value, really. I’ve found to have made them less effective since unlike gpt-5.3-codex newer models often don’t even read the files, but your mileage may vary.

Keep An Eye On Tests

In short, LLM-written tests are generally crap. Anthropic models, in particular, just plain cheat at writing them, so if you ask your LLM to write them, make sure you actually read them.

Unit tests written by LLMs very seldom do anything beyond the obvious, miss edge cases, etc. The only models that write halfway decent tests (as of mid-2026) are the Codex family of GPT models, and even vanilla 5.4/5.5 regressed on that from my standpoint, so my usual tactics are:

  • Build a set of prompts to have different models refactor tests without looking at the internals of your code (i.e., focus on contracts).
  • Treat tests as a black box that outputs a report, so that the session you are coding in does not see the tests and the session that runs and writes the tests does not see the code. You can call these different agents if you want–I call it separation of concerns.
  • Set up CI/CD flows that run all of the tests with zero agent intervention, but have CI/CD generate concise Markdown reports the agents can consume.

The last point is critical, so set it up as soon as you can–it frees up time on your machine and any decent agent can use gh (or equivalent) to fetch CI/CD artifacts, review the results and file issues for itself.

Use LLMs to Fast-Track User Stories

This is where SOTA models shine. Even Sonnet, bless its little stupid heart, can take a set of requirements and distill them into user stories and feature files much faster than formal committee-style BDD processes, and the quality and coverage (so far) seems to be better than humans’.

If you work with customers, this last bit is very important–humans will want to describe the user stories that matter to them in exquisitely irrelevant detail while completely skimping on the ones they don’t care about, whereas LLMs won’t care if they are describing boring bits or not, and they won’t quibble at the details–they will just do it.

The resulting user stories need to be reviewed, of course, but piping UX requirements through an LLM and Gherkin typically generates pretty decent scripted tests, especially if the LLM can look at your Preact/Vue/etc. code and build corresponding Playwright scripts.

This will save you weeks of work, and catch dozens of inevitable regressions as LLMs subtly break your front-end code en passant while implementing new features.

Ask me how I know.

Again, Never Let The LLM Run Tests

Mind that I never rely on the LLM to run Playwright for the actual tests directly - it will either cheat, be creative about how it inputs things, refresh the page to see if the DOM changes and break test state, etc. – it’s fine to use it to explore an app and draft the scripts, but when you run these things in CI/CD, you want them to be extremely deterministic.

And you want evidence of all functional tests, so I have a little toolkit to gather that evidence:

  • Playwright for web testing
  • tmux for TUI testing (rmux is also a thing now, but if you work in regulated industries the paperwork to get it baked into an image will likely outweigh the benefits)
  • A custom VNC harness for my retro emulators (using tesseract for OCR, which is surprisingly capable)
  • And, sometimes, a webcam or an USB video capture adapter (plus a sub-agent that only describes what it sees)

As a bonus, besides a Markdown report, I also generate a PDF report with screenshots and logs for the failing cases–and an override switch to screenshot all the tests for occasional audits.

Again, ask me why.

Do Not Let The Models Edit Freely

LLMs will always mangle long files, regardless of how big the model or context window is. Anthropic models (as of mid-2026) are particularly prone to that for some reason (as well as “drive by shootings” where they mangle tangentially related files).

You need to decrease your exposure to this kind of risk and do some proactive damage control by decreasing the impact of any such errors. It is not a matter of if, it is a matter of when, and it will nearly always manifest as weird regressions a few days down the line.

What I do:

  • If possible in your harness, disable full-file write tooling and force the model to use edit or diff for focused edits. The added friction will typically prevent it from mangling entire files.
  • Set strict caps on file sizes and (depending on the kind of package) guidelines for breaking up functionality.
  • Review changes to see if unexpected files were touched (I have been meaning to create a SKILL.md for doing this automatically, but eyeballing by listing uncommitted files it is just easier).

Sometimes I wish I could just make unrelated files read-only before letting the LLM loose on React/Preact code, so I am looking into LSPs and static analysis to see if I can do the coding equivalent of raycasting–projecting out which files would be related to a specific change.

Aggressively Refactor at Every Opportunity

Every few sessions. stop and refactor the code. Most technical debt from AI use comes from letting it literally piss all over your nice module structure.

In particular, I’ve found that LLMs like to define redundant types and duplicate code pretty much at random because they can’t see across your entire code base. If they’re operating in one part of the tree, they’ll be completely oblivious to the rest.

What I do is that once I have implemented one feature (or a sequence of features) and tests pass, I aggressively go in and review every single type, helper and filename.

Models can do baseline audits (the trope about OpenAI models fixing code Anthropic ones wrote is very much true in my experience), and you can trust the outlines of the audits, but with some caveats:

  • They will always cut short the depth to which they analyze code
  • They will often stop at module or dependency boundaries
  • They will only try to merge or remove duplicate code if it is blatantly obvious (and even then it is not a guarantee)

I do use models for audits, but only as a starting point. Then I go in and:

  • Point out where there was feature creep or duplication of code/responsibilities in the module structure
  • Enforce things like centralized logging
  • Manually flag duplicates and give instructions by adding TODO comments to the code

In (which I have sort of gravitated to recently due to the balance of great profiling and refactoring tools and less cognitive overhead than ), gopls can significantly help the model do most file splitting/refactoring automatically and without any chance for the model to mess things up, so every so often I fire up a dedicated session, hand it a prebaked set of guidelines and do a full-on refactoring pass.

Prune Abstractions

Models have a tendency to follow “best practices” to a point where they create untenable messes of nested abstractions, very much like the sort of people who write Python as if they were cosplaying at writing Java–classes, accessors and factories everywhere, etc. You know what I’m talking about.

This is something that initial SPECs and system prompts actually help with, until the context window is so full that those guidelines are “forgotten”.

Weed those out ruthlessly. By all means define reusable contracts and use strong typing ( is a godsend in that regard), but expect your linter and LSP to catch your LLM red-handed.

Learn To Walk Away

There are many ways to work with AI, and none of them work for everyone, but there are some basic tenets I follow:

  • Shorter Sessions = more attrition. One-shotting features will just create more pain and technical debt down the line, and they foster an illusion of progress, not stuff you can actually rely on.
  • Make sure you are willing to put in the design and spec effort. The more you think and plan yourself, the more grounding you can provide to an agent to keep it on track.
  • Leaving the agent to its own devices for an hour or so will give you time to ponder–yes, it might be risky token-wise if you haven’t specced out the work well enough, but that is part of the challenge here.

I think Ralph loops are profoundly stupid and wasteful, but am very much a fan of writing a SPEC, chunking it into a plan.md (or your harness’ equivalent) that includes clear directions for testing and then using things like /goal complete the plan.md file, because that provides the agent with a clear cut set of steps.

Goal seeking of various forms (, performance optimizations, etc.) can be extremely effective and reliable, but only if you’ve stacked up most of the previous tricks written above (and even then I’ve caught LLMs cheating at benchmarks in the most egregious way: “the simplest option is to not execute the query” is a real thing that actually happened).

Aim For Reproducible Everything

Again, do not trust any of the code the agent puts out. And even if it works, keep track of how it works–in a sentence, instrument the crap out of everything:

  • Enforce structured logging as soon as possible, and have automated checks to ensure that errors/exceptions/etc. are logged.
  • Maintain a set of benchmarking/regression tests that output actual metrics (if you don’t use OpenTelemetry, try to at least have a text file with key metrics)
  • Be very thorough about regression testing. Taking the time to rebuild and run last week’s version will often show that you’ve missed either testing for something or measuring something important.

Again, CI/CD is your friend here, and a lot of my time, even on personal projects, has been spent on building test and smoke harnesses of various kinds:

  • Mock up external APIs and write various failure modes into the mocks so that the LLM will have to deal with “errors” from the start.
  • When doing emulation/JIT work, create a test harness for each specific operation that you can gdb through (LLMs can actually do this pretty well), then a smoke harness that you can compare with QEMU, etc.
  • When doing microcontroller work, build and test subroutines separately in the host machine before assuming they will work in the microcontroller.
  • When doing inference optimizations (like in go-pherence), cross-check similar kernels across back-ends and architectures to ensure they all provide the same results

The list goes on, but the key thing is that everything should be automatable and outside the control of the LLM.

Is all the above hard work? Yes. But can you take most of it along with you when you start a new project? Also pretty much yes–and the icing on the cake is that once you’ve gotten the basics down, the principles are all transferrable across stacks/environments/runtimes and the thought process will keep your wits sharp.

Not to mention these things will save you a bunch of time.

Notes for May 24–31

Today I realised that I could just spend the day doing essentially nothing and that nobody would hold it against me (at least in Western nations), so… I might well do just that, with a few caveats:

Wi-Fi Fallout

Something very weird happened after I published – it made it to Hacker News (a day or so after I submitted it myself, because, as usual, most of my self-submitted links still appear to be shadow-banned despite 30K+ karma–and no, I don’t understand that either), and it was very popular among the usual band of armchair networking experts.

But then something really weird happened: I got an alert from Cloudflare that the lowercase-rewrite worker I’d deployed as a fallback for incorrect linking was exceeding the free-tier limit (100,000 runs, if I recall correctly), which made me curious enough to dig into the analytics:

Cloudflare page views control chart showing two out-of-control spikes reaching ~70,000 views/hour on 30 May
The control chart doesn't lie. Those orange dots are not normal.

I have CF’s anti-bot crawling settings active, I turned on CAPTCHAs again after the initial peak, and yet… 70,000 views in an hour, twice? Has to be crawlers. And how did CF let them through and count them?

So I went and plotted Clarity’s chart of “human” visitors (always an undercount, since it only captures people without JS or ad-blocking, but useful as a sanity check):

Microsoft Clarity unique visitors chart showing the genuine HN-driven spike to ~8,000 unique visitors on 29 May, with traffic returning to normal shortly after
The real HN spike was Thursday. Everything after is noise.

Definitely bots after the initial HN flood. I have to wonder why, why now, and whether Cloudflare’s free tier is still even marginally effective at blocking them.

go-pherence

The most interesting work this week was grafting speaker diarization onto go-pherence. Whisper tells you what was said; knowing who said it is a separate problem, and the standard answer is SpeechBrain plus a Python subprocess plus a fairly heavy PyTorch dependency. I did not want any of that. Instead I ported ECAPA-TDNN – the speaker embedding model SpeechBrain uses – to Go, and it all now mostly works with zero Python, even if it still needs a lot of tweaking.

There’s a speakercheck validation harness that runs spot-checks against windowed audio segments, scores against expected speaker labels, and outputs JSON reports, and a diarize-vtt command that accepts an optional ECAPA model and emits speaker-tagged VTT output. I expect to drop this onto one of my current hardware test subjects soon.

In Other News

I’ve been tinkering with more new hardware, but some things just take time and I’m still putting together my notes on those.

On the other hand, I am still very much impressed with the running , and I’m enjoying building little plugins for it as I go:

Niri display layout plugin showing the Kuycon P20 external display and built-in DSI screen arranged in a stacked layout
A Niri plugin to manage display layout, because of course I wrote one.

I will eventually publish these somewhere…

Mildly Parboiled

Allergy season is finally fading (at least for me), but today was the first time I had to turn on the AC in the office, and it was great to realize that and almost four years of potential HomeKit foibles, my is still working perfectly.

Those minor joys aside, I’ve been actively trying to get out of the house to do some exercise at least one hour a day and it is clearly not going to happen at lunchtime anymore–well, not every day, at least, so I’m starting to get cabin fever.

All of this to say that I’m feeling as if I am starting down the slippery slope to both physical and mental burnout again, and this time I’m backing off as early as possible.

For starters, I am currently profoundly annoyed at my current working arrangements, since my days of wall-to-wall meetings with completely random 15 minute breaks are both utterly destroying my health and eroding my ability to focus. Sometimes, and despite being remote for many, many years, I would really prefer to be back working at an office, if only because I miss walking about and using stairs to go and talk to people.

Turns out my closest project team are now in Madrid (plus Belgium, Sweden, Canada, etc.), so that isn’t going to happen. And, truth be told, online meetings are now so stupefyingly more productive (as meetings go) that actual work is still best done remote–as long as you can cut through the tremendous amount of AI-augmented cruft that a meeting now entails.

I, as usual, have been pragmatic about it and crafted my own agent to summarize meetings the way I want them, and to craft terse, minimalist works of corporate obeisance that avoid the walls of text I get by default and focus on the stuff I need to do instead of spouting corporate cheerleading (it has become ).

Anyway, my priority is now, again, my well-being. But I feel like my entire lifestyle is in dire need of an intervention, and the obvious life hacks most people suggest like exercising in the early morning (when I am trying to do my daily reading and research) or at the end of the day (when I am just bog tired) just don’t work for me, so the upshot of all this is that I am currently trying to carve out slots throughout the week to just get out of the house for 30 minutes.

Which is completely stupid.

This has to change (somehow). In the meantime, part of that carve-out is also going to be about mental health–I’m phasing out Twitter/X again, as well as a bunch of other “social” distractions and hypefests like HN.

Indoor Wi-Fi Roaming with OpenWRT

A few months after writing up the units and moving the house over to , I ended up revisiting the one bit I had deliberately waved away as “good enough”: roaming.

A real house, with a mix of phones, tablets, laptops and a few stubborn IoT things that insist on staying in 2016, has… issues. But they’re not always obvious, and given we’d both upgraded the 5GHz band and changed the locations of the access points, it took a while to figure out where the new rough spots were.

If you’re just tuning in, I have a hard split between a legacy 2.4GHz network and the modern 5GHz one. I already had client-managed roaming and basic handoff guidance, but now I added usteer, 802.11k neighbour reports (because hostapd was not cooperating), and things are now pretty much perfect.

The long version is below, with anonymised data and enough detail for future me to remember why I did this.

Why I Did Not Merge The SSIDs

The obvious advice for roaming is “use one SSID everywhere”, and that is often correct if you’re running Wi-Fi in an office, a public venue, or generally somewhere where you don’t have (or care about) legacy devices. It is also not what I did, because the 2.4GHz side needs to remain friendly to older and slightly terrible IoT devices, which means WPA2 compatibility and a conservative setup.

The 5GHz side is where the more modern clients live, and despite losing 5GHz access for a couple of things, I was happy to move it to WPA3. So this is what things look like from a high level:

  • 2.4GHz: legacy-compatible WPA2-ish network for IoT and old clients.
  • 5GHz: modern client network with WPA3/SAE
  • 2.5GbE backhaul across four “dumb” APs
  • Zero cloud management or vendor-specific software. Nada. Zilch. Non-negotiable.

User Feedback

However, I got a few complaints that when moving about the house, iPhones, iPads and MacBooks would not switch to another AP. Since our flat is wrapped around a couple of elevator shafts and there are a few spots (like the kitchen) where tiling, pipes and tiny RF nuisances like fridges were prevalent, that sort of tended to happen a lot–and Apple devices are notorious for being opinionated about that base station they want to stick to.

The baseline seemed fine. All four APs had 802.11r/k/v-related options enabled. Fast Transition was also demonstrably happening–the AP logs had auth_alg=ft entries that showed fast transition was happening, I had installed wpad-mbedtls for “mesh” support, but roaming clearly needed to be improved.

And my setup meant it had to be improved within each band/SSID, not across bands. Cross-band roaming is the client’s job, and many clients are not especially good at it.

Adding usteer

But two things stood out:

  • There was no steering daemon installed. Clients were making all roaming decisions on their own, which usually means they hang on to a far-away AP until their signal is frankly embarrassing.
  • rrm_nr_list was empty on every radio. In other words, even though 802.11k was enabled, hostapd was not exposing neighbour reports to clients, so… no real way to steer anything.

So I installed usteer and its LuCI companion package on all four APs, enabled it, and left the initial configuration at defaults:

opkg update
opkg install usteer luci-app-usteer
/etc/init.d/usteer enable
/etc/init.d/usteer restart

The default configuration is minimal: LAN gossip, syslog enabled, IPv6 disabled for the daemon (because, for reasons, I don’t trust our current ISP router to do anything reliably except act as an ONT), and a moderate debug level. That was enough for all APs to see one another and exchange client data, which is exactly what I wanted.

However, the 802.11k neighbour list wasn’t being populated. After poking through the OpenWRT forums, I realized the missing piece was static-neighbor-reports, which is one of those tiny OpenWRT packages that does exactly what it says and nothing more.

Each AP can generate its own 802.11k neighbour report element via:

ubus call hostapd.<iface> rrm_nr_get_own

But clients only get useful neighbour lists if each AP is told about the other APs. So I generated per-band lists and installed them per AP:

opkg install static-neighbor-reports
/etc/init.d/static-neighbor-reports enable
/etc/init.d/static-neighbor-reports restart

The important detail is that the reports are band-specific: 2.4GHz radios only advertise 2.4GHz peers, and 5GHz radios only advertise 5GHz peers. No cross-band mixing, because the two networks intentionally have different SSIDs and security settings.

After that, every AP had three neighbours per radio, usteer had AP/client state, and hostapd has explicit 802.11k neighbour data to hand to clients that ask for it.

What Changed

The first comparison is a little boring, but useful. Here is the 2.4GHz SNR before and after the change (this, like the other charts here, was generated from data):

2.4GHz SNR over the week
2.4GHz SNR over the week

2.4GHz SNR: pre-rollout vs latest
2.4GHz SNR: pre-rollout vs latest

There is no miracle here. 2.4GHz remains 2.4GHz–crowded, noisy, full of junk devices and crowded by all my neighbors. Two of the APs improved or stayed roughly level, two got worse in the sampling window, and I have zero expectations about ever clearing this kind of congestion without moving to the countryside.

The 5GHz side is more encouraging, even if you do need to know when we were near which AP at what time when you look at active bitrates:

5GHz bitrate over the week
5GHz bitrate over the week

The interesting part, though, is that at least between two APs, there was a noticeable shift in usage–which seems to reflect where clients should be registered in practice:

5GHz bitrate: pre-rollout vs latest
5GHz bitrate: pre-rollout vs latest

But the best sanity check is the sticky-client view, because that is what started this in the first place:

Sticky-client check
Sticky-client check

The number of merely weak clients did not disappear–one extra client fell below -75dBm in the later sample–but the very weak clients went away. That is the bit I care about: the previous -90dBm-ish sticky associations were gone in the later check, which seems to indicate clients are not getting hung up on their previous AP and are indeed roaming.

Caveats

A single sample is not science, and Wi-Fi is a swamp of client decisions, radio noise and domestic entropy. I also saw one new Fast Transition log entry after the rollout:

FT: Missing required pairwise in pull response from a peer AP

That happened once in the latest check. It is not enough to call the setup broken, but it is worth watching–especially because SAE and FT have enough moving parts that I would rather trust logs than assumptions.

Going Forward

I will be keeping an eye on this over the next few weeks… somehow. I got an LLM to do the Graphite queries and chart scripting for me, and ain’t nobody got time to build dashboards only I would look at, but the metrics aren’t going to go away and the stable config lives in my local instance now, so there’s really no excuse not to do a spot check in a few months.

But I really like my Cudy APs. No cloud controller, no meshing, no mobile app and no secret sauce. Just OpenWRT, collectd/Graphite, and the odd ssh session to check configs.

That is still the main thing I like about this setup: when it gets weird, it gets weird in ways I can inspect.

Notes for May 17-24

My sinuses are still giving me grief, but this week was much more successful at pretending to be enjoyable, at least. For starters, we watched Project Hail Mary, and it was every bit as good as I would expect it to be, which is very rare in movies these days.

Meetings Suck More In Summer

Insomnia seems to be fading, but as the weather improves, the time windows for leaving the house and enjoying exercise before the heat kicks in have become narrower and are in full-on collision with typical meeting schedules, and that has become a major drag on my optimism since I have to wonder why, as an industry, we haven’t really solved meetings.

The technology is fine–it’s a culture problem. Stand-ups, project syncs, account planning, everything requires far too many unproductive meetings that just accrete overhead because a) people don’t really prepare for them and b) people don’t have time to prepare for the meetings that matter because of all the other meetings.

And, of course, everyone thinks their meetings are the ones that matter.

Couch Time

Either way, I’ve finally started having more enjoyment off-work. A good deal of it stems from the fact that I can now use piclaw as an interactive notebook across all of my projects and just scribble on a tablet screen (including annotating images and text to feed back into the agent).

Using piclaw on the couch
Using piclaw on the couch

I have already gotten most of the annotation experience to work on my as well (and with a local agent to boot), so I’m starting to wonder when OpenAI or Anthropic will pick up on this (neither of them has a decent tablet UX, and they clearly don’t seem to care about that).

In the meantime, I’m looking for an Android tablet that would be at least as good as a Samsung one, but without any of their UI junk–the TCL NEXPaper ones seem very interesting, but it’s apparently impossible to reach any of their marketing people…

Joking Around

One of the things I’ve been playing with a la longue is Joker, my souped-up version of a runtime for . Well, go-joker now has a proper notebook interface–cells with run states, rich outputs, inline SVG rendering, WASM-backed bitmap demos, and a parallelised Mandelbrot cell that renders fast enough to feel interactive.

This is another step towards the -for-code thing I a few weeks ago, except it’s running in a Clojure interpreter that I developed in another notebook-like interface:

go-joker notebook with Mandelbrot rendering
go-joker notebook with Mandelbrot rendering

The irony of constantly working on notebooks within notebooks is not lost on me, but it does look very good right now.

Inference Hardware

I just got a SpacemiT K3 board to test, which is both my and a refreshing take on the ecosystem, because a) it was zero hassle to set up b) came with 32GB of RAM and c) has a promising (if weird) NPU arrangement that I fully intend to exploit, even if (as usual) source code and documentation is a little sparse.

On the GPU side, I’ve been trying to shoehorn a Qwen model with MTP and KV cache optimizations into my 12GB 3060 in parallel (without any real usable solution yet), so alternative hardware is even if (at least right now) it poses a completely different set of problems to solve.

Emulation Progress

My long-delayed build draws near–after pondering my options I ordered the mini-macintosh PCBs and parts (5 of them, even though I only have 2 Maclocks) and have been poking at the Mac JITed emulators a bit, but I got sidetracked into getting the MMU to work in previous-jit and… I haven’t really paid much attention to any of the other bits.

I did try to get ios-linuxkit to run faster through a variety of strategies, but the truth is that performance work on interpreters is humbling–most ideas that sound good measure worse, and none of it panned out except some iOS fixes–terminal input latency, soft keyboard lag, DNS fallback, and iPhone canvas scaling.

The gap between “works on my iPad Pro” and “works on an iPhone” is always wider than expected, and in this case I am actually considering removing ghostty-web from the iPhone version given the added overhead.

Logitech Combo Touch: Four Years Later

I think it’s time for an update on my iPad Pro M1 and, most importantly, the Logitech Combo Touch I got for it. Think of it as a long term review of sorts.

In short, I bought another Combo Touch–the old one was falling apart.

Disclaimer: I paid for this with my own money, as I did the first one, but Logitech did offer me a discount. As usual, this article follows my .

The Good Bits

I had originally chosen the “sand” color, which was a sort of calculated bet–I wanted something different from the traditional black, and mentally prepared myself for it to accrue stains or dirt over time.

Guess what, it really didn’t. I guess it will look slightly darker and dingy if put alongside a new one, but I have zero complaints about the fabric-like parts and can only find a very small (sub 5-mm) stain if I look really hard. Maybe I was lucky, but those bits still look great.

I have also had zero issues with the keyboard. Yes, it has short travel, but it is effectively full size, the international English layout is excellent for coding, and it has been extremely reliable over the past four years. The only key with a (cosmetic) issue is my S key, which was slightly marred by a stray solder blob.

And the trackpad is simply sublime–it is the best non-Apple trackpad I have across all my hardware, not to mention it is luxuriously large for a tablet trackpad.

The Bits That Fell Apart (Literally)

Over the years, the speaker slots (which are effectively thin strips of rubbery plastic) started deforming. First subtly, then to the point where they are now either broken or completely deformed:

Deformed speaker slots on the old Combo Touch
Deformed speaker slots on the old Combo Touch

This does coincide with how I hold it for writing in both landscape and portrait mode (the inner cover edge is also flaking off on the bottom left side in portrait orientation), but… I’m at a bit of a loss as to why this wasn’t factored into the design somehow.

Buying Another One

Unfortunately, Logitech does not offer the possibility to buy only the cover, otherwise I would have kept my current keyboard.

And there were no refurbished ones shippable to Europe either (for whatever reason), so I ended up reaching out to support and then buying an entirely new “Oxford grey” one (which was effectively the only color available).

Oxford grey Combo Touch next to the old sand one
Oxford grey Combo Touch next to the old sand one

The new one is physically identical as far as I can tell–same connector, same kickstand, same key layout, same excellent trackpad.

Which means everything I still applies, and I won’t repeat it here. What I’m more interested in this time is whether this one will last longer without deformation.

I have my doubts, of course.

TIL: Noctalia Shell Lock on Suspend

This is a little bit of follow-up to my – I keep using it routinely (especially when we travel for leisure) and love the little thing to bits, but I’ve been wanting to run it mostly on power saving mode to reap the most benefit out of the hardware (and battery, of course), so I started looking at desktop environment alternatives.

Yes, I could already get a full afternoon (and then some) out of it, but Apple Silicon has spoiled me as far as battery life expectations go, and has a little bit too much baggage for that kind of extended use.

Since I spend 90% of my time on it writing or coding and still have a penchant for keyboard-driven desktops, I initially switched to Fedora Sway Atomic (gotta love being able to swap environments with a single command…), but later installed Niri and Noctalia Shell because I really like both the idea of a scrolling window environment and the sheer polish of the whole thing–even if there are some rough edges here and there.

I am very happy with it, and writing plugins for it is trivial:

I hacked together a Bing Wallpaper plugin in 30m
I hacked together a Bing Wallpaper plugin in 30m

The one thing that annoyed me to no end, though, was locking on suspend, which Noctalia Shell should do but apparently doesn’t in , so I had to resort to two hacks:

Locking on Lid Close

The first was adding a switch-events block to the Niri config to trigger the lock screen when the lid closes:

switch-events {
    lid-close {
        spawn "qs" "-c" "noctalia-shell" "ipc" "call" "lockScreen" "lock"
    }
}

Idle Lock via swayidle

The second was setting up a swayidle systemd user service to lock after 5 minutes of inactivity and suspend after 10:

[Unit]
Description=SwayIdle Service
After=graphical-session.target

[Service]
Type=simple
ExecStart=/usr/sbin/swayidle -w \
    timeout 300 'qs -c noctalia-shell ipc call lockScreen lock' \
    timeout 600 'qs -c noctalia-shell ipc call sessionMenu lockAndSuspend'
Restart=on-failure
TimeoutSec=30

[Install]
WantedBy=graphical-session.target

This last one feels extremely gauche and I hope to find a better way, but I guess this comes with the territory. I don’t really care about having a trendy Wayland desktop (I just want a dead simple one with a bit of polish), but I hope this kind of hacks won’t be necessary for much longer.

Oh, and of course I set gsettings set org.gnome.desktop.wm.preferences button-layout 'close,minimize,maximize:appmenu' to match macOS decorations.

Apple Papercuts

I know this blog has strayed a fair distance from its Mac-centric origins, but I’ve been keeping a mental list of all the things that are broken, missing or inexplicably neglected in ’s software, and it’s gotten long enough that writing it down feels like a public service1.

This isn’t about or grand design failures–those are well documented . This is about the small stuff. The papercuts that, individually, you learn to live with, and collectively make you wonder whether anyone at Apple actually uses their software.

Despite the somewhat surprising length of this post after stitching together all the notes, I’m actually focusing on the things I hit every week (not trying to put together an exhaustive catalogue), and others will have their own lists–and that’s part of the problem.

Mail

is the first app open every day and the one I find hardest to defend, and I’ve been defending it for twenty years (longer if you remember the original NeXT mail client).

The broader story is one of abandonment. used to be extensible–there was a plugin API that third parties used to build genuinely useful tools (GPGMail, SpamSieve, Act-On, all manner of filing and productivity helpers), and I used it to, among other things, have HJKL keybindings.

Apple deprecated that API, replaced it with a (much more restrictive) MailKit surface in 2021, and proceeded to lock MailKit down so hard that barely anyone shipped an extension.

And then they quietly stopped mentioning it. The result is that Mail is now less extensible than it was in 2010.

In particular, in this age of desktop AI agents, I come time and again across the fact that support in Mail has been left to rot. I wrote about via AppleScript years ago, and even then it was a workaround for missing functionality.

Today the dictionary is unchanged, the bugs are unchanged, and the “Apply Rules” menu option–which used to let you re-run rules on selected messages–no longer works consistently on multiple selections, if it works at all.

And searching for messages is such a mockery of a user experience that I’m not even sure how to describe it–suffice it to say that it never searches solely inside the folder I’m in and that it often fails to find messages that I know are there, even with the most basic criteria.

Mail on iOS Is Just Consistently Worse

And then there are the basics that have simply never arrived on iOS:

  • There is no way to filter messages on an . Not “limited filtering”–none. You cannot create a rule, you cannot sort by sender, you cannot batch-select by criteria.
  • Smart folders don’t exist on any version (no, the stupid Categories thing doesn’t count). They’ve been on the Mac since… 2004?
  • And, of course, there is no way to have Mail rules sync from the Mac to iOS. For a company that talks endlessly about ecosystem coherence, this is bizarre.
  • Download progress is opaque. When Mail is pulling thousands of messages from an IMAP server, the feedback is either nothing or a tiny spinner.
  • in Mail amounts to a summary button that occasionally produces useful one-liners.

There’s no smart filing, no suggested rules, no priority inbox–nothing that would actually reduce the cognitive load of managing email. had most of this a decade ago.

Time Machine

I wrote , and if I had the patience, I could probably write twice as much.

But I’ll just add that the performance is abysmal if you have thousands (or millions) of small files, and that things like asimov (or manually setting the right extended attributes manually for excluding development folders, something I routinely forget to do) shouldn’t exist, because it should work properly in the first place:

  • It should have much more transparent progress indications
  • It should never fail silently
  • It should recover gracefully from failures
  • It really should suggest automatic exclusions and have a proper UI that is not “Add this huge top-level folder” for exclusions

Again, this isn’t rocket science. I installed Borg Backup the other day on some of my Linux VMs, and it is so good that it defies explanation how Apple still hasn’t gotten this right.

Craig Hockenberry recently wrote up an experience that captures the problem perfectly: his iPhone’s Spotlight index corrupted, search stopped working across App Library, , Notes, Messages and Settings, and after trying every remedy he could find online–forced restarts, language changes, toggling Siri, developer mode reindexing–the only “fix” was a full device backup and restore.

Which took hours, broke Apple Pay, reset FaceID for two dozen apps, wiped TestFlight builds, and generally made his life miserable for days.

On the Mac, rebuilding the Spotlight index is a one-line terminal command that somehow I keep not memorizing despite needing it once a month. On iOS, that affordance doesn’t exist.

“It just works, my ass” was Craig’s summary, and it’s hard to improve on it.

Search on is slow, inconsistent, and returns incomplete results across every app that relies on it. On it’s marginally better but still loses to most third-party tools, solely because Spotlight completely made a mess of the user experience and Finder, well, can’t even find itself sometimes.

Calendar

This, again, is something that I come across every single time I need to manage personal time, and that is essential if we want any form of serious AI assistants to work (or integrate with Apple stuff).

But I’ll cut right to the point: the app has barely changed since iOS 7, and the parts that have changed are worse.

  • Event metadata parsing is broken. If someone sends you a calendar invite with a video call link, Calendar will sometimes pick it up, sometimes not, and sometimes create a phantom “location” that’s actually a URL fragment.
  • There’s no way to see a compact list of upcoming events without also seeing the full calendar grid.
  • Calendar sharing within a family is functional but graceless.
  • support is just… not there. It sort of works, but ever since Apple decided to move both Calendar and Reminders to CloudKit (or whatever), all you will get (for Reminders, at least) are the leftover entries that they left in the store before the migration.

Oh, and need I mention that Siri is terrible at calendar operations, including the extremely basic “at what time did my wife book dinner”?

Automation

I know. Most of the parts about some apps above are also about automation, and I did post about this in my , but it deserves a dedicated entry because in this age of Codex and Claude being able to control your desktop, it rankles.

  • actions break between OS versions. Not occasionally–routinely.
  • is unmaintained, and despite what I wrote earlier, is now presumed dead.
  • There is no cross-platform automation story whatsoever. No, Shortcuts is not useful there, save for the laudable exception of being able to use my iPhone to automate switching watchfaces (which is something very few people are likely to use).
  • Accessibility sort of works, but it is so clunky in practice that some of the workarounds I’ve seen implemented in Claude and Codex border on the hilarious.

The bottom line, for me, is that Siri Shortcuts integration is shallow compared to what offers through intents, or what Windows offers through COM automation (or even Win32, which surprisingly still works so well that it took me 15 minutes to do an agent tool).

Virtualisation

In keeping with Apple’s inability to make the iPad truly useful, has no hypervisor support today–it was removed in iOS 16.4, and nothing has effectively replaced it since. The result is that you can’t run a Linux VM on an iPad, and you can’t run Docker containers on it either, which means that the entire ecosystem of local LLMs, coding agents, development environments and monitoring tools that I rely on for work and play is completely inaccessible on the iPad.

has had Hypervisor.framework since… Yosemite, and Apple Silicon Macs run VMs beautifully–but on and , the entire concept doesn’t exist, and we are forced to run half-assed emulators like (which I’ve been banging on for a month as a way to prove my point).

This matters to me because a huge amount of the software I use daily–local LLMs, coding agents, development environments, monitoring tools–runs in containers or lightweight VMs. I can do all of this on an EUR 50 ARM board running . I cannot do any of it on an EUR 1,400 iPad Pro with an M4 chip, without jumping through hoops to get AltStore to run on it so that can pretend it has proper virtualization.

I know that Apple doesn’t care about this now that they feel buoyed by the ’s runaway success, but I am actually looking forward to trying out a solely because Google has reasonably decent support for running Linux userlands on ChromeOS and Android, and I want to see how that compares to the iPad’s non-existent support.

Home Automation

I could possibly write a book about this by now, considering that I’ve been at this . could be so much better, but it is also a part of the Apple experience where the gap between promise and reality is most painful.

Yes, is coming, etc., etc., but a new protocol will never solve any of the shortcomings of the Home app:

  • Scene chaining doesn’t exist.
  • If-this-then-that logic is barely functional.
  • Presence detection is flaky and not granular enough for room-level logic.
  • There is no scripting layer. can trigger HomeKit actions, but HomeKit automations can’t call Shortcuts.
  • Adaptive lighting is half-baked.
  • Multi-home support is a mess.

I’ve papered over most of that with and Homebridge, and of course Home Assistant can do all of the above, but, again, my main point is that it shouldn’t need to exist for people who’ve bought into the Apple ecosystem.

At this point, Apple should just buy Homey and can their entire HomeKit stack.

Apple Watch

The Watch deserves its own entry because it’s the device where Apple’s failure to prioritise timekeeping is most absurd, and with the rebirth of , I was reminded of how awesome smartwatch UX can be and how Apple never even got close.

In particular, the “Smart” Stack (the thing you get when you swipe up from the bottom) is never aligned with what I actually want to see, or what is up on my calendar.

The ’s timeline view remains the high-water mark for watch UX–one button tap, chronological day view, no widget carousel.

Apple’s Calendar app on the Watch tries to replicate the iPhone calendar grid on a 45mm screen, which is about as useful as reading a newspaper through a keyhole.

A watch should be the single best device for time-aware context. Instead of building a timeline, Apple built a widget carousel.

iCloud and CloudKit

I once spent a week building a client to talk to iCloud Reminders and Calendar, and the experience was a masterclass in Apple’s backwards-compatibility approach: it works, except when it doesn’t.

  • Newer Reminders lists silently migrate to CloudKit and disappear from CalDAV entirely.
  • Apple Notes is completely gone from IMAP–all content is now behind CloudKit’s protobuf CRDT format, which Eric Migicovsky recently reverse engineered
  • Calendar event recurrence expansion doesn’t work properly through CalDAV.
  • App-specific passwords are required if you want to have third party clients sort of work, but limitations are documented nowhere.

The pattern mirrors a lot of my gripes about the original iCloud services: Apple builds new infrastructure, migrates data silently, leaves old APIs running but progressively useless, and provides no supported path for third-party access.

Terminal

Yes, it got updated recently. No, it is neither good nor fast nor consistent when you use daily, and that is why I use . Like a lot of other core Mac tools, I have feelings about it, some of which I cannot express politely.

Developer Experience

I write because I have to, not because Apple makes it easy.

The language itself has been through enough breaking revisions that code from three years ago often won’t compile without changes. is worse–views that worked on iOS 17 already behaved differently on 18 and now seem broken in 26, and the abstraction leaks the moment you need anything beyond a list and a navigation stack.

The result is a UI framework that feels modern in tutorials and feels like debugging a black box in production. I’ve lost count of how many times I’ve had to drop to UIKit to work around a SwiftUI layout bug that, once I start searching for it, I realize has been reported for years and yet nobody at Apple acknowledges.

And then there’s the $99/year developer fee, which Apple charges you for the privilege of running your own code on your own hardware. Not to publish on the App Store–just to run an app on a device you already paid for. The certificate expires annually, and if you don’t renew, your sideloaded apps stop launching. In 2026, on hardware I own, I need a subscription to run my own software.

The App Store itself is a whole separate set of papercuts–review delays, opaque rejections, the 30% cut, the inability to distribute updates outside the store–but those are well-documented grievances.

The one that gets me is simpler: the entire developer toolchain assumes you are building a product for sale, not a tool for yourself. doesn’t have a “just let me run this on my phone” mode that doesn’t involve provisioning profiles, entitlements, and a certificate chain.

Until I started using , every personal project started with ten minutes of ceremony. Now I never even open .

Phone Size

I still have an in a drawer, and every time I pick it up I’m reminded of what a phone that fits in your hand actually feels like. It’s delightful to hold–thin, light, one-handable without gymnastics, and the screen is perfectly usable for everything I actually do on a phone.

Every iPhone since has been bigger, heavier, and harder to use one-handed, and the Max/Plus variants are actively hostile to anyone with normal-sized hands or normal-sized pockets. Apple keeps making the screens taller and the bezels thinner, but the fundamental ergonomic regression–that phones stopped being things you hold comfortably and became things you grip–has never been acknowledged, let alone reversed.

The iPhone SE was the last concession to people who wanted a small phone, and Apple killed it. The Mini lasted two generations before being quietly shelved. The message is clear: you will hold the slab and you will like it.

The Pattern

Every one of these is fixable. Most have been fixable for years. The pattern isn’t technical inability–it’s neglect.

Apple has the engineers, the money, and the platform control. They’ve chosen not to, repeatedly, and I suspect writing about it won’t make any difference, but as someone who has been using Macs since the System 6 days and writing about OSX here since the very beginning, I like to keep a scorecard.

And right now, it’s neither looking good nor reassuringly future-proof, unless, of course, you happen to love Liquid Glass.


  1. And, as it happens, two weeks of insomnia and allergies provided both the time and the inclination to write it all down… ↩︎

Notes for May 10-17

The weather has gone a tad cloudy again, which provided me some relief from my allergies–but not enough for proper overnight rest, so yet again I arrived at Friday afternoon totally exhausted.

Read More...

Announcing ios-linuxkit: Linux on iPad, the Hard Way

I’m done waiting for Apple to fix things. And one of the things I think should exist is a decent way to run Linux binaries on my iPad.

Read More...

Unexpected Synology Woes

Last weekend my decided, for some unfathomable reason, to stop working after I took it out of the closet, dusted it and put it back, and I have feelings about it.

Read More...

The Siri For Families Apple Will Never Build

The got me thinking about the one thing I keep wishing would build and almost certainly never will: a family-scoped AI assistant that actually works across all our devices.

Read More...

I Think I Figured Out What an AI IDE Looks Like

I’ve been mulling the UX arc I’ve been going through over the past couple of years, and I think it was mostly the same for everybody:

Read More...

Notes for May 3-10

This was a weird week, both because I keep waking up at 5AM with my sinuses clogged, and because I feel like I’m losing momentum. Feeling almost permanently cotton-headed, sleepy due to sheer exhaustion or because of antihistamines certainly has something to do with it, but .

Read More...

The Local AI Moat

Regular readers will know that I’ve spent most of the past two years shoehorning LLMs into single-board computers, partly as a learning exercise and partly because there are lots of local/”edge” applications where semantic reasoning (no matter how limited) and “interpretation” of sensor data are actually useful.

Read More...

Notes on GPT 5.x Model Regressions

I’ve been getting annoyed at constant code regressions in piclaw for the past few weeks. Something was off–even after bumping the test suite to the point where it catches most mechanical errors, gpt-5.5 kept making unrelated edits to code that should have been left alone, and I was getting really annoyed at babysitting it.

Read More...

Notes for April 27 – May 3

This was an absurdly productive week, at least on a personal level. I’m not sure whether to be pleased or worried about the number of projects that moved forward simultaneously, but here we are.

Read More...

Lessons on Building MCP Servers

I’ve been building servers for a while now–I wrote about last year, started out by creating umcp, and I’ve recently opened up an Office server that’s been battered by enough models against enough real documents that the patterns have settled.

Read More...

App Notes: Web App Viewer

I got annoyed enough with Safari Web Apps to write my own replacement.

Read More...

Notes for April 20-26

Amidst the chaos brought upon my usual seasonal allergies, work turned out to be calmer than usual–the usual industry churn and constant rumors of layoffs have made “calmer” a relative term, though–so most of my evenings went to projects.

Read More...

Notes for April 13-19

This was a pretty decent week despite my allergies having kicked in to a point where I have constant headaches, but at least I had quite a bit of fun with my projects.

Read More...

Archives3D Site Map