I’ve been building MCP servers for a while now–I wrote about the general approach last year, started out by creating umcp, and I’ve recently opened up an Office server that’s been battered by enough models against enough real documents that the patterns have settled.
I’m still not a fan of MCP, but what follows is what I’ve learned about making tool chains actually work, condensed from swearing at logs rather than reading papers.
Disclaimer:This is a condensed version of CHAINING.md, which was itself stapled together from a bunch of notes in my Obsidian vault. The full version has more code examples and a techniques inventory table that Opus just _had to add, and I’ve since beaten that out of it and restored most of the original text (minus typos).
The short version: the MCP servers I design do most of the work, while the model walks breadcrumbs.
They look at the conversation, scan the tool list, and grab whatever looks more probable. That’s it. There is no hidden planner. If you want chains that finish somewhere sensible, the server has to make the next call blindingly obvious at every step.
After a year or so, I have pared down my approach into these three things, roughly in order of how much pain they save you:
A small named core verb set covering most intents
Output that suggests the next call
An addressing scheme that survives between calls–anchors, IDs, paths, anything but line numbers.
The Office server exposes over 100 tools. Its get_instructions() funnels models toward eight:
…start with office_help, then prefer office_read, office_inspect, office_patch, office_table, office_template, office_audit, and word_insert_at_anchor. Treat specialised tools as fallback, diagnostic, legacy-compatibility, or expert tools when the core flow is insufficient.
That single sentence does an outsized amount of work–it tells the model there is a recommended path, that the path is verb-shaped (help -> read -> inspect -> patch -> audit), and that everything else is opt-in.
Without it, models cheerfully reach for word_parse_sow_template when office_read would do, and you end up with five-call detours for one-call jobs.
So I quickly realized that I needed to be ruthless about which tools to surface and when. The specialised ones still ship–hidden under a “for experts” framing, and a handful of legacy ones filtered out of tools/list entirely.
I also make liberal use of activation sets–the surface the model sees is small; the surface it can reach is large.
Again, models chain whatever is most likely (or rhymes), and the most effective tactic, for me, has been taking advantage of that.
All Word tools are word_*, all Excel excel_*, all unified office_*. A model that just called office_inspect will reach for office_patch next, not word_patch_with_track_changes, because the prefix matches.
This particular server also makes liberal use of annotations and a little intent/inferrer hack that reads those prefixes to assign readOnlyHint/destructiveHint automatically, so naming discipline turns into safety metadata for free.
The prefix is the plan. The verb is the step. If you take one thing from this entire post, I’d suggest this notion…
This was the single change that made things behave on smaller models. The big ones will plan a chain from a tool list and a goal; the wee ones won’t–they grab the first plausible tool and stop.
The fix is stupid simple: every response ends with a breadcrumb dictionary of hints to follow. At minimum next_tools: [...], plus usage: "<exact call>" whenever the current tool produced a value the next one needs.
A model that can’t assemble arguments from a schema can copy the usage string verbatim. In fact, they will copy it, because it is still the most likely outcome as it fills in tokens, and thus those usage hints funnel the path the model takes.
Another thing I hit upon was that signposting needed to be curated.
Borrowing a page from intent mapping, office_help(goal=...) returns a structured record–recommended chain with rationale, fallbacks, diagnostic strings to watch for, one imperative next_step sentence. Not prose. Not a README, not skills. Data the model can act on without reading comprehension.
Called with no arguments, it returns the catalogue. Called with an unknown goal, it returns the supported set rather than an error, which turns a potential workflow-stopping error into an actual useful catalogue.
The biggest reason simple models can’t follow chains is the model losing the thread between calls. “Insert a paragraph after the introduction” is fine in English but catastrophic if you expect it to remember a byte offset across three tool calls.
In this particular scenario, I cheated and since most Office documents have headings (or cells, or internal structured paths inside OOXML), I used either verbatim text from the document or immovable coordinates (which was particularly hard in PowerPoint, by the way).
So besides suggestions and hints, return identifiers your tools will later accept as input. If you find yourself returning data the model has to describe back to you in natural language, you’ve made a chain that will misfire on a Tuesday afternoon when you’re not watching.
I started out with individual editing tools per format, which was very easy to do automated tests for but incredibly wasteful of context, so at one point I decided to make things much simpler for initial discovery, and since I needed to make all outputs auditable, I then tagged available sub-operations risk-wise.
office_patch is the same code path whether you ask for dry_run, best_effort, safe, or strict. One tool, four modes, one entry in tools/list.
Discovery cost scales with tool count, not mode count. And dry_run -> safe -> strict is an escalation chain the model figures out on its own without being told.
If you have N tools that differ only in how cautious they are, collapse them. You’re wasting everyone’s context budget.
Linear chains are easy. Real chains have loops, and loops only happen when the server invites the model back in. Every mutating tool returns a standard envelope with status, matched_targets, unmatched_targets, and next_tools.
The model then branches on a small subset of options “locally” without needing to go over the entire context, and if you name the diagnostic fields with exact strings the model will see again in your instructions, it will just reinforce them.
In this particular case, again, I cheated. I figured out that the models were starting to call tools at random because they couldn’t introspect the document well enough and ended up breaking files, so I always gave them at least one read-only tool, so the penalty for “I’m confused, let me look again” is one extra round-trip, not a destructive cock-up.
Pick five to ten core verbs and name them in get_instructions() or your local equivalent
Use consistent prefixes by surface
Provide a discovery tool that returns recommendations as data, not prose
Make the discovery tool browseable–no-arg returns the catalogue, unknown input returns the supported set
Embed forward breadcrumbs in every tool response
Provide a map/anchors tool so addresses survive between calls
Give every mutating tool a mode enum including dry_run
Return named diagnostic fields and cite the recovery tools
Standardise the mutation envelope. If one tool changes something in a specific way, make sure the others are consistent (arguments, semantics, etc.)
Reject unknown arguments strictly (this is much easier in some runtimes than others)
Provide an audit tool so the model has somewhere to land
Cache anything the recovery loop calls more than once, because, well, it will get called dozens of times even if you carefully curate paths through your tooling with hints.
Make repeat calls safe–models retry, and they should be allowed to (idempotence is hard, and often impossible).
Do the boring work in the schema and the descriptions. The model will happily do the clever bit if you stop making it guess.
I got annoyed enough with Safari Web Apps to write my own replacement.
It took about five minutes to get the core working, and maybe another hour of incremental tweaks spread over a day or so. That ratio–five minutes for the thing, an hour for the polish–tells you something about the state of the problem it solves.
Web App Viewer is a tiny native macOS shell that opens a URL in a WebKit window with no browser chrome. No address bar, no tab strip, no toolbar, no Safari-style fullscreen frame. One web page, one native window, as little visible UI as macOS will reasonably allow once a page is loaded (it hides traffic lights and scrollbars when the mouse is away).
You can drop URLs onto it in the Dock, send them from the Share sheet, a .webloc file, or a custom webappviewer:// URL scheme.
Safari’s “Add to Dock” Web Apps have been around for a while now, and the idea is sound–pin a website as a standalone app, give it its own icon, get it out of the browser tab pile. The execution, though, is maddening, and it has always been broken across the board, but on macOS it is horrendous.
The resulting windows still carry persistent browser chrome I can’t hide, and the whole flow of creating one (find the menu item, wait, hope it picks up the right icon, hope it doesn’t break on the next Safari update) feels like an afterthought rather than a feature anyone at Apple actually uses.
This is one of dozens of Apple papercuts that accumulate into a kind of low-grade daily friction, and I have a growing list of them that I intend to write about at some point. But this one was fixable before dinner, so I fixed it.
I fired up Codex with the kind of detailed mini-spec I described in my agentic development piece–what the window should look like, how URLs should be accepted, what the drag behaviour should be–and told it to reuse the window styles and approach from Daisy and the USB Video Viewer (another small Swift project I built to test SBCs via USB capture without adding more monitors to an already cluttered desk).
Disclosure: OpenAI provided me with a 6-month trial of Codex for my Open Source work (which has also helped me fully isolate that from work), but you could probably do this with an brick-brained open-source local model (even if Swift is a mess and under-represented in LLM training sets, which is a problem even with SOTA models).
The core is just WKWebView in a native window with chrome that fades in on hover. The Share Extension, the macOS Service, and the URL scheme were bits I tacked on after, and all the scaffolding (Makefile, signing, etc.) was AI-generated, because there is absolutely no reason to do that by hand in 2026.
There were, however, two things that were a right pain:
Adding an invisible drag strip needed a nudge from memory, but Codex was useless there. I knew how I’d have done it in Objective-C and just guided it through the Swift equivalent until it worked. Everything else was straightforward.
Web manifest icon detection in Swift was… oh boy. The fact that Swift still does not have a sane async model (at least like I would expect) and would poke at the page and web manifests but fail to wait and load the bigger icons took me a few tries to get right.
But it was totally worth it. I now have six instances of this running, and I found (and fixed) subtle bugs when trying to create each one of them, so I’m pretty much calling it “done” other than some manual UX tweaks I want to do to the menus and dialogs.
The original motivation was wrapping Piclaw’s web UI as a frameless native-feeling app, and that works exactly as I wanted. But the nicer surprise has been dropping other self-hosted URLs into it–Grafana dashboards, Proxmox consoles, internal tools–and getting a clean, chromeless window for each. It turns out that removing the browser frame makes everything feel lighter.
And I am casting one of them to an Android device via AirPlay (more on that later when I get that one stable), and the lack of browser chrome makes it… just great. Zero wasted pixels, no distractions, just the content.
But the way it really improves on what Apple didn’t do for me is usability and practicality. Drop in a URL, check it out, then hit Cmd+I and a new copy is installed to my ~/Applications folder, ready to launch from Spotlight, without cluttering the Dock or trying to figure out where they hid it in the sharing pane.
I was a happy Fluid user years ago, and I know there are paid apps that do roughly this. But the uncomfortable truth for Apple indie developers in the age of AI is that there is zero reason to pay for any of them when I can build a tailored version for my own needs this fast.
That’s not a criticism of those apps. It’s a warning sign about what AI-assisted development does to the economics of small, focused utilities–and, in the context of Mac apps, which were always a tiny cottage industry, is going to be worrisome for many.
But the real lesson here, I think, should be about what Apple ought to have just built into macOS instead of shipping the half-baked Web App support that provoked all of this in the first place.
A brisk, brilliantly coded tutorial on vector quantisation: how far you can push compression on model KV caches and embeddings without breaking what matters. The interactive sliders and diagrams do the teaching before the maths catches up.
I built vibes in January, then turned it into piclaw in February, because I wanted an agent interface that actually fits around my day. Federico Viticci’s Remodex review Remodex is a neat reminder of why this matters: yes, a Codex remote is useful, but building around a single vendor app still feels a little brittle.
I have a Codex Pro plan and I do enjoy it; I just can’t see myself using it exclusively on the Mac when the same workflow should follow me everywhere.
Amidst the chaos brought upon my usual seasonal allergies, work turned out to be calmer than usual–the usual industry churn and constant rumors of layoffs have made “calmer” a relative term, though–so most of my evenings went to projects.
I also re-read Project Hail Mary–partly because I needed something absorbing that wasn’t a screen, and partly because Weir is one of the few authors who makes engineering problem-solving feel like a page-turner. It holds up, and I can’t wait to see the movie.
Last week’s PPC detour is, surprisingly, working much better than the 68k JIT, but already paid off: my naïve take on memory layouts meant that I hit one of the banes of modern emulation very fast–ASLR on aarch64 Linux was randomising addresses that the JIT needed to be fixed, but now I understand a lot of the issues I was having with 68K version.
The fix for now was to have the binary disable its own ASLR at startup via personality(ADDR_NO_RANDOMIZE) and re-exec, which is ugly but works and is the sort of thing nobody documents. And after doing that on the BasiliskII side as well, a lot of issues went away.
Both JITs now have proper Makefile workflows with tmux targets, which means I can build, test, run and kill either emulator from a single command–which I’ve been doing with my iPad, from the comfort of my couch.
As to the Cydintosh, it is not assembled, because the resistive touch screens I have are borderline unusable for precise tapping (so good thing I only 3D printed a test fit with old filament). I ordered a couple of larger capacitive ones and a bunch of other ESP32 stuff, so I expect to come back to that next weekend.
My little Proxmox hack has been working great–although I had to fix a few things after upgrading one of my nodes (regression testing is the bane of my existence these days), pve-microvm now supports all the operating systems I care about, a few I had never considered using, and other than the fact that I am creatively patching Proxmox’s interface, it has been pretty stable, which was unexpected.
I got piclaw to hack in a custom OCI dialog to replace the Create VM wizard, an xterm.js console tab for microVMs (noVNC makes zero sense for serial-only machines), and a bunch of other features.
And of course it broke when Proxmox shipped a patch release, but since I have a z83ii as a sacrificial node I can contain the blast radius of any upgrades. Mostly.
But right now I’m converting most of my LXCs to microVMs, and it’s been a blast–the speed is fantastic, and the fact that I can run Plan 9 in a microVM is just icing on the cake.
Like I wrote above, regressions are the bane of my existence, and I am getting really annoyed at TypeScript because despite all the nice tooling, it can still pass most linting and “compiling” and fail spectacularly at runtime. And since the upstream Pi packages have been undergoing considerable churn and breaking changes, a lot of piclaw broke in various ways, and experimenting with different models really doesn’t help.
Even as I’m typing this, I am (yet) again waiting for an OpenAI model to audit some UI breakage that Anthropic’s models caused, because they just drop chunks off the code when editing it sometimes, but I am getting really annoyed at fixing things three times in a row…
And yet, the flexibility of Pi and its extension model is pretty amazing–I decided to adopt it wholesale and have started breaking off pieces of piclaw into a piclaw-addons repository, into which I can throw all the mad experiments I want–for instance, yesterday I hacked together a “cheapskate” addon (a cost-conscious model router) that lets you use a bunch of free tiers across various providers, something that would be impossible to do in most harnesses…
And yet, I think it’s time to have a backup. So I created gi, a Go harness inspired by Pi and designed for extensibility, but where all the extensions are externalized to the point where they can’t (hopefully) break the core, and where I want to try to rewind the clock to the simpler times of LISP machines–take your workspace, copy a state dump to another machine, and just carry on.
So I designed it as a single Go binary that can pack everything into a single SQLite database, and that binary embeds both a Clojure dialect (via Joker) and a JavaScript engine that can hook into the state machine–so extensions can be written in either and live inside the SQLite blob alongside everything else.
And in true belt and suspenders style, I’m going to pack both a TUI and a web UI in the same binary.
But, most importantly, I’m taking a completely different approach at dependencies and testing–starting with bringing together most of my previous stuff in various forms, and writing a functional test suite and not just a code one. Still missing tool execution, keychain, workspace indexing–but it’s at the point where I can sit down and have a conversation with it.
Yeah, I know. Another project. But I realized that I needed to remind myself of how to bootstrap a kernel on bare metal before I even try to get Haiku running outside QEMU, so I started poking at porting 9front to one of my ARM SBCs.
Plan 9’s ideas about distributed computing and per-process namespaces have been rattling around in my head since the 90s, but more to the point it is a very simple system, and shifts the bulk of the effort into getting uboot and hardware bootstrapping to work instead of trying to figure out everything at once.
As a fun detour from that, I ended up creating a simple USB Video viewer to pull up video output from a USB capture card to watch things crash spectacularly.
While I was at it, I finally got around to refreshing rcarmo.github.io–my open source landing page, which had been accumulating a decade of pixel dust while I was off doing other things.
It’s nothing fancy: a single page that groups some of my repositories by topic (AI agents, cloud, hardware, infrastructure, libraries, macOS, terminal stuff) with one-line descriptions for each, and acts as a sane front door for anyone who stumbles onto my GitHub profile and doesn’t fancy scrolling through 380-something repos.
The refreshed landing page, sorted by topic and (slightly) opinionated about what's worth highlighting.
The rest of the week’s GitHub activity was the usual scattering: a small go-ai update (the unified LLM client I’m using inside gi), some ground-init and mdnsbridge cleanups, a zmk-config-totem tweak for the split keyboard I’ve been slowly getting used to, and a couple of apfelstrudel commits–because if I’m going to break my brain on emulators all week, I might as well let an AI agent help me make some weird music every now and then.
Flint, my “very stable” agent, kept earning its keep on the side: I finally split out MLX and embeddings as their own AI subsections (consolidating entries that had been awkwardly squatting in the language tables) and tucked away a couple of agentic odds and ends–notably pi-draw and a baloney detection kit–into the relevant pages.
None of this is glamorous, but the resource pages have been drifting for a while, and having an agent do the boring sorting (and ask me sensible questions about edge cases) is exactly the kind of thing to deal with chores I’ve been putting off for years.
And yeah, I know it’s too much, and that I’m spreading myself too thin.
So this is happening. Cook moves to executive chairman, Ternus takes… his turn at the helm.
Cook turned Apple into the most efficient manufacturing and logistics company on the planet–something I’ve been reading about in detail via Patrick McGee’s Apple in China, which makes a painfully convincing case for just how deep that dependency runs. He also built a massive services and content business on top of it.
But despite all of that, the soul of the company has felt increasingly bland, and the accumulating faux pas in software quality–culminating in the Liquid Glass debacle and the general state of macOS and iPadOS–have tested even the most faithful.
Ternus is a hardware guy, and very likely deeply involved in the MacBook Neo. My hope is that he has a better feel for what good product actually looks like, and can drive the kind of change that has been overdue for a while now.
I’d start with fixing macOS and iPadOS, preferably in a way that matches what people actually expect from their devices rather than what a design committee thinks looks modern.
Whether that happens is another question entirely. But at least the new CEO isn’t from the services side.
(It would also be nice if Apple realized that remote work is a thing, but I think that boat has sailed)
This was a pretty decent week despite my allergies having kicked in to a point where I have constant headaches, but at least I had quite a bit of fun with my projects.
Yeah, I find Opus sycophancy and its traits obnoxious, but this time it’s right–I was trying to get Cydintosh to work with my particular flavor of Cheap Yellow Display and having so much trouble matching screen corruption and flipped colors (and bits) to the display code, that after I finally managed to get at least a stable (if broken) boot picture on screen, I thought to myself… why not let piclaw sort this out for me?
So I plugged the CYD and a Logitech Brio 4K into the Orange Pi 6+, and… I got the most surreal ESP32 closed loop debugging setup going:
I ended up moving the camera farther away to get better focus
Five minutes later, I had all the display bugs fixed except for touch input, which was still rotated–a fair bargain.
I was looking at smolvm and going through my notes on Firecracker and other sandboxing mechanisms, when I realized I had come across QEMUmicroVMs a few months ago when looking at agent sandboxing mechanisms and the old QEMU JIT.
Now, I actually think that microVMs are way overrated, but I was literally in the shower when I realized that, for me (since I have zero interest in running microVMs in my laptop) Proxmox would be the perfect way to manage them (also since I have zero interest in running another exotic hypervisor).
So I did a little spelunking, and… It worked. Badly, but it worked. I took my terminal session, added a few notes, and asked piclaw to investigate if it was possible to patch the UI–and guess what, it was a pretty simple patch–I got the agent to flesh out a Debian package, turn my hacks into a CI/CD workflow that builds and packs a suitable kernel into the .deb, and now I have a nice VM template, decent integration of microVMs into the web UI, the works.
pve-microvm patches qemu-server to add the machine type, ships a template workflow that pulls OCI container images and converts them to PVE disk images, and redirects serial to the web console so you get a proper terminal in the UI. There’s also init support and a balloon device (as well as qemu-agent support), but the OCI images are so barebones that I haven’t yet sorted out all of the ergonomics about using them to automatically deploy stuff.
Proxmox microVM integration in action
This looks like a very low impact addition to Proxmox so far and I would love to upstream it, but I’m not holding my breath since maintainers aren’t trivial to reach and the old-style “join our developer mailing-list” approach is… just too effort-intensive as I have so much stuff to do these days.
The macemu work took an unexpected turn–I shifted from BasiliskII (68k) to SheepShaver (PowerPC), and things moved a lot faster than I expected. To make a long story short, it was Friday and I idly asked piclaw to do a comparative source analysis between both emulators, hoping for something that I’d missed in the quagmire of ROM patches I’ve been wading through.
Turns out that it told me that there was no real JIT support and did a comparative analysis of opcode coverage, ending with “there are, however, much less opcodes to translate in the RISC architecture. Do you want me to set up a quick opcode test harness for PPC”?
Uh… yeah? By Friday evening, every opcode family except AltiVec had native ARM64 codegen and was booting to the Welcome to Macintosh screen (and crashing, but this was comparatively 100x faster than the 68k work), and yesterday afternoon, after some back and forth about creating a second harness (effectively a headless Mac with no hardware to skip problematic ROM regions), I got it to do AltiVec via NEON (which the Orange Pi 6 Plus supports–I’ve yet to devise a fallback path for older chips).
The process was straightforward: point piclaw at an opcode group, have it implement the native codegen, run the harness, iterate on whatever broke, then once an opcode group was “done”, smoke test it on the headless Mac harness. The AltiVec stuff was the most satisfying part–mapping NEON intrinsics to Altivec semantics is tedious but tractable, exactly the kind of work where AI earns its keep and the harness catches every subtle difference.
SheepShaver now boots Mac OS to a desktop with VNC input working. There’s still a long way to go because I have done zero hardware testing (it’s got no audio, only VNC input and, more importantly, no network or graphics acceleration), but a from-scratch PPC JIT on ARM64 booting to a desktop in around 24h is… not nothing.
I wish I could finish the 68k JIT, though, the register allocation strategy I guided the agent towards and the weird ROM patches BasiliskII does just don’t get along.
The fun part for me has been that a lot of this has been done on an iPad on my couch, using the Apple Pencil or iOS voice typing to scratch out instructions. After an outing yesterday, I had the idea to just swipe between agents, and… oh boy.
The idea is simple–swipe left or right on the timeline to switch between agents–but making it feel right on an iOS PWA required far too many weird CSS and JS hacks, and the one real problem I’m having is that AI, no matter how many times you specify in painful detail what you want and how many actual code samples you give it, is still too prone to breaking very intricate UX–I’m getting really tired of weird regressions every time I add another feature.
I’m not an Anthropic customer (besides GitHub Copilot’s model selection, which now also includes the new, lobotomized Opus 4.7, I have a personal Codex subscription for OSS work), but so many people seem to have been caught by their ban on third-party coding harnesses that I decided to dust off Vibes, start porting it to Go (which I had already in my backlog) and turning it into an ACP-only wrapper so that people can use Claude with a nice web UI.
I think it’s the least I can do, and also gives me a decent web UI to drop in for my own work when I absolutely have to use Copilot.
And, of course, since I have far too many projects already, I decided to see if I could get Haiku to boot on ARM64. I don’t particularly care about doing AI for salesy startupy business stuff, but I love using it to build things I think should exist, and I have quite a few more I’d like to make happen…
I have a soft spot for tiny Macintosh projects, and this one pushes all the right buttons–an ESP32 Cheap Yellow Display board running a Mac Plus emulator inside a 3D-printed case. I haven’t finished hacking my Maclock yet, but it’s a perfect fit with my ESP8266 hackery, not to mention the collection of vintage emulation hacks I keep filing away and my never-ending ARM64 JIT for BasiliskII, so I had to link to it.
The utterly brilliant part is that doesn’t stop at getting System 3.2 onto a small screen–it adds little Retro68 utilities for weather, Wi-Fi status and hardware control, which turns the whole thing into equal parts retrocomputing in-joke, embedded hack and practical home automation gadget.
It’s already printing (in the obligatory platinum-like PLA I keep around for special occasions), and I am so going to plug this into HomeKit somehow…
Mistral published a 52-minute read on how Europe should build an independent AI stack–talent pipelines, single-market scale, local infrastructure, sovereign compute, the lot. It reads like a policy brief dressed up as a manifesto, and while it has a glaring flaw and some of the proposals are predictably self-serving (Mistral is, after all, the company that would benefit most from “buy European AI” procurement rules), the underlying analysis is hard to argue with.
The five pillars–attract talent, scale the single market, drive adoption in the real economy, build local infrastructure, and secure sovereign AI capacity–are all sensible, and the specific measures (an EU AI talent visa, streamlined regulation, public procurement mandates, European cloud infrastructure) are concrete enough to be actionable rather than the usual Brussels hand-wringing. The 40% figure they cite for Europe’s share of global AI research output versus its minuscule share of commercialisation is the kind of stat that should make policymakers uncomfortable.
Where it completely falls apart, though, is that Mistral, even as an European company, currently doesn’t hire remotely in Europe – so the whole thing feels a tad insulting if, like me, you’re actually in the industry and not in politics.
There is some merit to it, though, and whether any of this actually happens is a different question entirely. Europe’s track record on turning common sense into working industrial policy is, to put it generously, mixed–and the current geopolitical climate makes “digital sovereignty” feel less like an aspiration and more like an urgent necessity that nobody has quite figured out how to fund.
Thanks to a bit of spillover from Easter break, this was a calmer, more satisfying week where I could actually get stuff done and even have a bit of fun.
My idea of fun, apparently, is to do 3D visualizations in piclaw
Now that piclaw is in cruise mode, I’ve started focusing on actually using it.
So I created an instance called Flint, which manages not only my Obsidian vault but also all of my personal pursuits and most of my homelab: I gave it the API tokens for my Proxmox cluster and Portainer, and over the past week it’s been busy:
It re-tagged most of my notes and drafts (as well as adding reference URLs for ongoing drafts), quizzing me on what to do with specific notes as it went
It rebuilt and redeployed my GPU sandbox (which I broke last week): recreated the VM, mounted the Ubuntu ISO, prompted me to run the installer, and installed the latest NVIDIA drivers, nvidia-docker and a baseline set of utilities.
I then asked it to look at the Portainer stacks in my gitea instance, my Obsidian notes, and what needed to be set up, and it installed the Portainer agent and brand new versions of the stacks with tweaked network and volume settings, updated my notes, and upgraded the pinned image versions (troubleshooting as it went).
It developed and published an OPDS server and an EPUB read later service so I can fetch interesting web pages and read them later on the XteInk X4, including monitoring the CI pipeline and redeploying the containers
It audited my Cudy OpenWRT config and set up centralized stats collection in Graphite, which I had been meaning to do for ages (and I intend to have it set up Telegraf on other machines to collect metrics).
So far, Flint is a resounding success (it’s using GPT-5.4, a fairly sensible and stable model), but it doesn’t just do notetaking and operations.
Flint has also become quite useful to help me tidy up my workflow—I was already using a piclaw instance to convert ancient Textile and raw HTML posts into Markdown in batches, but there are a few things that have been nagging at me for years and that I can finally make significant progress on:
Adding links to my resource pages
Drafting link blog entries
Streamlining static site builds
I’ve had Shortcuts to do the first two for ages, but they both relied on adding bits of text to Reminders that were then post-processed and added to git using either the CLI or WorkingCopy. That worked OK for a while, but my iPad mini’s increasing slowness has made them quite frustrating, especially since I tend to do that kind of quick posting over breakfast and it was taking up too much time.
As it happens, GitHub has a REST API for Git Trees, and what that means in practice is that I can update a JSON changeset with these minor changes, let it accumulate over breakfast, and then apply them in batches–or, rather, have Flint do that, with all the guidance and steps in a SKILL.md file.
So my new breakfast workflow is to just send links to Flint using the iOS sharing pane or a bookmarklet (still experimenting with both), have it create a JSON changeset for links, and occasionally ask it to screenshot a page and create a blank Markdown document for linkblog posts. That is pre-filled with a title, likely tags and the appropriate image reference, and I just pop open the built-in editor tab in piclaw, finish the post and ask it to add the files to the changeset and post them via the API.
So far, it’s been going swimmingly: zero git fetches/commits/pushes, all handled server side, and very little friction–and it works on my iPad mini, albeit still slowly.
Another thing I’ve been working on is porting the Python site builder to Go for both speed and maintainability—the current codebase has some 20-year old hangovers that I wanted to get rid of, and some kind of reset has been long overdue, so I have been slowly poking at this for the past few months.
As it happens, the overall indexing and rendering process was pretty trivial—the real challenge has been to make sure that it looks exactly the same, especially given that my engine has some pretty specific Wiki-linking rules and I’ve accumulated a bunch of rendering helpers and custom plugins over the years.
Plus everything related to HTML rendering has changed: parsing, link resolution, templating, the works. And that’s enough to juggle already, so I don’t want to change the front-end design at all (yet).
I decided to be ambitious and aim for full rendering parity. So what did my little army of AI helpers do?
It converged on doing visual diffs out of random sampled pages: Take a locally rendered version, look at the public page, and generate an image that it can easily rate as “close” or “broken” by just counting the ratio of red pixels:
This is both brilliant and scary at the same time
The process is greatly streamlined: sample 100 pages out of the nearly 10,000 we have now, render, batch compare, show me the worst ones, and then discuss and generalize the fixes (which is the only part the LLM is actively involved in). I could probably use autoresearch to automate this, but some of the fixes have to do with legacy rendering logic that no AI could ever figure out.
Still, this has converged very quickly to minor typography and spacing differences, and once I’m happy with the engine I’ll start looking at optimizing the actual blob uploading part–which I aim to standardize via rclone to remove my current dependency on Azure storage accounts, but greatly optimize with deltas.
It turns out that if you tell an AI that empty catch blocks are forbidden, the thing will just… go and add comments inside them, instead of doing something useful like a warning log message…
I’m now doing another code audit pass over the entire piclaw codebase, and this kind of mechanical fix is trivial to set up and do reliably with autoresearch:
An autoresearch session doing a code audit pass
Now to see if I can get some reading and 3D printing done as well, since the whole point of using AI in the first place was to have more free time… right?
I have been having feelings about Apple lately. This blog may have drifted a fair way from its original focus on macOS, but I am still, first and foremost, an Apple user – just not an exclusively Apple user, and perhaps not even a particularly obedient one anymore, since I use both Windows and Linux every day and have grown used to judging platforms by what they let me get done rather than by whatever story they are trying to tell about themselves.
That makes the current moment a little awkward. Apple is still extraordinarily good at making hardware I want to pick up and use, and still more coherent than most of the industry in the broad strokes, but it also feels increasingly prone to sanding off the wrong edges, reinventing the UX wheel, and constantly adding paper cuts to their software.
The iPhone is probably the clearest example of that tension. It is still the phone I would rather carry, and the one whose hardware I trust most, but iOS has become steadily more fussy without becoming proportionally more capable.
A lot of it has been the constant UI friction and pointless balkanization of features like screen mirroring, which I would very much like to have – I see zero point in using Messages on my Mac or futzing around with Handoff and AirDrop when I could just, you know, pull up a window into my phone and type stuff in.
And I know Apple could indeed engineer a way to make those features DMA-compliant if it really wanted to – I suppose breaking the user experience across the board with Liquid Glass had enough priority to preempt allocating engineering resources to, you know, proper features.
Sharing things, moving files around, background activity, browser limitations, the endless little inconsistencies in system UI and the ungainly bloat in Settings – that friction accumulates. None of it is fatal on its own, but the aggregate effect is that the platform feels far less light than it used to, even while Apple keeps insisting that everything is becoming more seamless.
I’m going to say it outright: I found Liquid Glassinsulting. Not just visually, but also because it tells me that instead of fixing glaring gaps in things like automation (Shortcuts is definitely not in good health, and AppleScript is pretty much dead) that could actually have put Apple in the forefront of automation and AI (never mind the miserable failures in Siri and Apple Intelligence), someone at Apple actually decided breaking visual affordances took priority over stability and providing consistent application intents and hooks across the board.
Even then, macOS is in a better place than iOS, but mostly because it still retains enough of its older character to be workable. Remember, I can just patch the visual inconsistencies away.
There is still a proper filesystem, there is still a shell (even if Apple seems intent on breaking the userland in very small increments across releases), there are still enough escape hatches to route around bad decisions, and Apple Silicon has papered over a remarkable amount of software bloat simply by being absurdly fast and power-efficient.
But the cracks are visible there too. System Settings remains a mess, cross-platform application quality keeps declining, and the old Mac assumption – that a user might actually want to understand how their machine works – seems to matter less every year. Meanwhile iOS keeps borrowing bits of the Mac’s vocabulary without acquiring the Mac’s actual flexibility, which leaves both platforms feeling oddly misaligned.
The iPad remains the device I most want to use more than I actually do. I may pick one up every morning to read the news and get drafts started, but the Neo nullifies any interest I might still have in upgrading my iPad Pro. The hardware is excellent, the battery life is still absurd, the pencil is useful, and for reading, sketching, note-taking and casual browsing it remains hard to beat. Fine.
But every time I try to push it into being a serious general-purpose computer, it reminds me that Apple still has not decided what it wants the iPad to be. It can approximate a laptop for stretches at a time – and sometimes very convincingly – but the moment you need proper peripheral support, predictable file handling or sustained tool switching, the abstraction turns into safety glass – and I’m back to my long-held opinion that the only good iPad is the iPad mini.
That’s what I intend to upgrade this year, even if Apple comes out with a decent foldable iPhone (and, by the way, I really like the “leaked” form factor, because phones have become stupidly tall and unwieldy).
And this is where Fedora comes in, because it has become my most useful point of comparison. Linux on the desktop is still Linux on the desktop – gloriously inconsistent, occasionally infuriating, and always willing to expose its plumbing at the worst possible moment – but my experience over the past few years is very conclusive: Fedora has reached a point where, for a lot of everyday work, it is simply easier to reason about than either macOS or iOS.
That does not make it better in every respect. It is not. But it does mean that a lot of the breakage in Apple software now has a reference point, and even considering I was always a UNIX user and deeply technical, the creature comforts that Linux now provides give me a lot more confidence than Apple’s software.
If Qualcomm wasn’t so obtuse about only supporting Windows and ARM laptops were more open, things would be very interesting indeed.
I still like the hardware, still prefer the overall ecosystem in a number of places, and still find myself evaluating a lot of the rest of the industry by standards Apple set years ago.
But I also think it is getting harder to ignore how much of the original appeal has been traded away due to sheer mismanagement of software QA and Apple’s refusal to acknowledge the gaps across iPad, macOS core applications, and a consistent user experience.
This was a long one–I spent a fair bit of time with the Orange Pi 6 Plus over the past few months, and what I expected to be a quick look at another fast ARM board turned into one of those test runs where the hardware looks promising on paper, the software is wonky in exactly the wrong places, and you end up diving far more into boot chains, vendor GPU blobs and inference runtimes than you ever intended.
The Orange Pi 6+ on a corner of my desk
Unlike most of the ARM boards I’ve reviewed until now, this one is not an RK3588 board: The Orange Pi 6 Plus uses the CIX P1 (CD8180/CD8160), with 12 CPU cores, a Mali G720 GPU, a dedicated NPU and a wild set of specs for the form factor. Boards like this promise everything at once–homelab, edge AI, dual 5GbE, low power–but they only matter if the software gets out of the way.
Disclaimer:Orange Pi supplied me with a 6 Plus free of charge, and, as usual, this article follows my review policy.
And, for a change, I decided to make sure the software did exactly that, and made it my concern from the start–i.e., I built my own OS images for it (a fork of orangepi-build) and went in a bit deeper than usual, spending around two months taking notes, benchmark logs and even Graphite telemetry as I went along.
One of the reasons I wanted to test this board is that the SoC is the CIX P1, which Orange Pi bills as a 12-core part with a combined 45 TOPS across CPU, GPU and NPU. The machine I tested came with:
CIX P1 (CD8180/CD8160), 4×Cortex-A520 plus 8×Cortex-A720 cores
16GiB of RAM (roughly 14GiB visible to Linux)
dual Realtek RTL8126 5GbE
Realtek RTL8852BE Wi-Fi and Bluetooth card
Mali G720 / Immortalis-class GPU
A three-core Zhouyi NPU
And if you’ve been paying attention to all my homelab testing, those two 5GbE ports alone make this more interesting than most hobbyist SBCs. But, of course, there is a lot more to expandability than that:
The CPU is interesting in itself–the fastest A720 cluster reaches about 2.6GHz, the A520s top out around 1.8GHz, so like many other big.LITTLE ARM architectures you get asymmetric clusters rather than a uniform twelve-core machine:
lspci is a bit more revealing, especially because you get to see where the dual 5GbE setup and Wi-Fi controller are placed–each seems to get its own PCI bridge:
Nothing exotic, which I rather like. And, by the way, the board ships with Cix Technology Group UEFI, version 1.3, so setting up boot devices and managing (very) basic settings was trivial.
This is where I took a very large detour from my usual approach: I decided early on that I wasn’t going to use a vendor image for this board.
Vendor images for SBCs like this always tend to be good enough to boot, occasionally good enough to do basic benchmarks, and almost never something I want to build on–especially if I’m doing local AI work, host-native services, or anything that requires me to trust package sources, first-boot behaviour and upgrade paths.
I wanted a server-first layout, reproducible fixes and a place to bake in GPU/NPU prerequisites, so I forked orangepi-build and started from there, with a fairly high bar:
I wanted a fully reproducible Debian 13 / Trixie build with features like /dev/kvm present, not a vendor image with stale software and missing features I wanted.
The build needed to stop treating Ubuntu as the only real target–add-apt-repository, PPA logic and software-properties-common had to be cleaned out.
Boot fixes had to be baked in from the start, not applied as post-flash rituals.
First boot had to be deterministic. If the root filesystem resize requires me nearby with serial and patience, the image isn’t finished.
I needed a clean place to stage GPU firmware, vendor userspace and NPU packages.
The Orange Pi repository included kernel 6.6.89-cix, so a lot of the above was already “there”–I just needed to hack at it, but instead of doing it entirely by hand I got piclaw to set things up on an Ubuntu 22.04 VM.
Over a few weeks (this took a while), the above list translated into a fairly concrete set of changes in the build tree:
added Trixie configs under external/config/{cli,desktop,distributions}/trixie
patched scripts/distributions.sh for Debian 13 support
fixed the board config to allow trixie under DISTRIB_TYPE_NEXT
removed Ubuntu-only dependencies from the package lists
forced standard Debian mirrors
made the kernel build non-interactive
started baking in GPU/NPU prerequisites and development tooling for later testing
The package side needed archaeological work too. I patched orangepi-config to stop behaving as though it were on Ubuntu, removed software-properties-common from the Trixie dependency chain, forced regeneration of cached packages, and went hunting through component_cix-next for whatever vendor bits still existed and matched my kernel, taking notes throughout.
My first boot-related note on this board was short: I flashed my custom Trixie image, got as far as GRUB, and it fell over because the EFI stub was wrong. The image did contain the right DTBs (SKY1-ORANGEPI-6-PLUS.DTB and friends), but the build scripts somehow commented out useful menu entries and the default pointed at the ACPI path.
But getting past GRUB was only half the battle. The first real boot surfaced another annoying issue: the partition resize worked, the root filesystem resize didn’t, and the machine failed to reboot cleanly at the handoff. I had piclaw trace the resize helper, found it was disabling itself before the second stage could run, and patched that too.
The whole thing made for a pretty intensive couple of weeks:
Build and fix timeline
In parallel, I made sure to include GPU/NPU support:
firmware symlink so panthor could find mali_csffw.bin
baked in cix-noe-umd and cix-npu-onnxruntime
and a big pile of dev tooling so the board could bootstrap AI experiments without turning into a scavenger hunt
Once the image was booting reliably, I wanted the board off SD entirely. I had a 512GB NVMe drive sitting about, so I had piclaw handle the migration–even though it had just finished patching orangepi-config, the actual cutover was done manually: partition the NVMe into EFI, root and swap, rsync everything across, patch GRUB.CFG to point at the new PARTUUID, reboot, verify, remove the SD card.
So, to recap, I had to fix these things for my custom image:
Boot chain: initially broken because GRUB defaulted to the wrong path; stable once DTB boot was forced
GPU / Vulkan: initially llvmpipe fallback or panvk failure; working with vendor Vulkan ICD on mali_kbase
OpenCL: not useful at first, functional once the vendor userspace was in place
NPU kernel side: visible from the beginning, probe messages reporting three cores
NPU userspace: present only in fragments, inconsistent package references, a lot of manual validation needed
But after the first few steps were done, I had zero issues installing or building software on this–GCC 14.2 from Trixie, Bun as the primary scripting runtime, and the usual complement of build-essential, cmake, clang and ninja for C/C++ projects.
Python 3 and pip are present for the inevitable bits that still need them, and Docker runs cleanly, plus I made sure I had /dev/kvm available for virtualised workloads–and with the CIX patches for the P1 SoC, everything went swimmingly. The kernel is PREEMPT-enabled, which is pleasant for interactive work and inference latency, though I haven’t tested RT workloads.
I even got Proxmox to run reliably on this with zero issues (including creating ARM VMs on it) before wiping the NVMe to do some AI testing.
The one area where the software story gets awkward is the vendor-specific GPU and NPU userspace–covered in the next two sections. Everything else about running Debian on this board is unremarkable, which is a compliment.
Out of the box, the Linux graphics story was absent. The kernel side was in a half-state that looked superficially encouraging–/dev/dri/* present, both panthor and mali_kbase around, the system clearly aware of a Mali GPU, etc.
But Vulkan fell back to llvmpipe, and forcing the Mesa Panfrost ICD produced Unknown gpu_id (0xc870) errors. So I had piclaw go through the Orange Pi and component_cix-next package sources and find the missing pieces: vendor userspace for the CIX stack–cix-gpu-umd, cix-libglvnd, cix-libdrm, cix-mesa and a Vulkan ICD pointing at libmali.so.
Installing those got me partway–the userspace reported No mali devices found, because the board was still on the wrong kernel path. Once I rebound the GPU from panthor to the vendor mali/mali_kbase stack, /dev/mali0 appeared and Vulkan reported actual hardware:
deviceName = Mali-G720-Immortalis
driverID = DRIVER_ID_ARM_PROPRIETARY
OpenCL also came up correctly afterwards, again via the vendor path.
This was pretty good news as far as typical SBC testing goes, since it means you can get decent (if vendor-specific) GPU support working–but getting there involved driver rebinding, vendor package archaeology and a persistent module policy to keep the machine on the right stack across reboots.
The NPU story was, if anything, even more typical of this class of hardware.
Linux clearly knew there was an NPU–dmesg reported three cores during probe–but the userspace was absent or incomplete and the package references inconsistent enough that I had to validate URLs by hand. One package version was simply gone, another worked, and I only reached a coherent install because component_cix-next still had enough usable artifacts lying about.
Not to say the NPU is fake or useless–it isn’t. But the tooling has that familiar feeling of being assembled by several teams who weren’t speaking to each other as often as they ought–and if your interest in a board like this is local AI, that matters more than any TOPS figure on a product page.
This is where the board started being interesting.
Since I have been getting more and more involved in low level AI work, I spent most of my time testing local inference–the Orange Pi 6 Plus is not a universally good AI box, but it is surprisingly usable within a narrow envelope of models and runtimes.
And to make it usable for a few use cases, I needed a model-and-runtime combination that felt like an actual working stack rather than a demo. I ended up trying four inference runtimes–[PowerInfer], [ik_llama][ikl] (which is a CPU-optimized version of llama.cpp), vanilla llama.cpp, and my own Vulkan-patched version of llama.cpp that for the Orange Pi 6 Plus’s GPU (the NPU, alas, like many other ARM SoC NPUs, is designed more for vision processing than LLM work, and I spent a few evenings trying).
I ended up running well over a dozen different combinations of models and runtimes, and these five were the ones I invested the most time in, since I wanted a model that was powerful enough for “production” use even if it was a little slow in practice:
Inference performance by model and runtime|669
The dark bars are generation speed, the lighter bars are prompt processing. The verdicts on the right reflect what happened when I pushed each model through a real agent pipeline with tool calls, not just a short benchmark prompt–and that is where the gap between “fast on paper” and “actually works” showed up.
The Liquid models posted impressive raw tok/s figures but broke down in practice with blank responses and formatting failures. The 35B sparse model was surprisingly fast under ik_llama.cpp but ate all available RAM and failed roughly 40% of the time.
Only the Qwen 4B on Vulkan held up as something I would actually leave running and the best all-round result was Qwen3.5 4B Q4_K_M on Vulkan:
Metric
Value
Runtime
llama.cpp Vulkan
Prompt t/s
8.4
Generation t/s
9.7
Typical response time
6-25s
RSS
~5.3GB
Stability
10/10 pass at -ub 8
Not desktop-GPU territory, but enough to move the board from “cute” to “useful”. More importantly, it was stable–it followed my coding assistant’s AGENTS.md prompt correctly, handled tool calls, and didn’t chew through all available memory.
The production configuration I eventually settled on was:
Every flag has a story–especially (-ub), the micro-batch size, which controls how many tokens llama.cpp tries to process per Vulkan dispatch.
It turns out that the Mali Vulkan backend had a descriptor-set exhaustion issue that needed patching upstream before it stopped crashing (yes, I spent a while debugging Vulkan…), and I ran a set of benchmarks specifically for that:
Vulkan micro-batch tuning sweep|695
Bigger batches should mean better GPU utilisation and faster prompt ingestion, but the Mali G720’s Vulkan driver has a hard limit on descriptor sets–exceed it and the backend either crashes or silently degrades.
The green bars are stable configurations, the orange ones are not–and the dashed box marks where I landed for production. At -ub 16, prompt speed collapsed because the driver was already struggling; at 64+ it fell over entirely.
The tuning sweep showed where the practical ceiling was rather than the theoretical one:
At -ub 2, the setup was stable but underwhelming: about 4.3 prompt tok/s and 9.7 generation tok/s.
At -ub 4, prompt speed improved to 5.9 tok/s with the same 9.7 generation rate.
At -ub 8, which is where I eventually landed, prompt speed climbed to 8.4 tok/s and generation stayed at 9.7 tok/s.
At -ub 16, the whole thing became temperamental and prompt throughput actually collapsed to around 2.0 tok/s.
At -ub 32, it could survive a test run, but not in a way that inspired confidence.
At 64+, it was simply crashy.
So the practical production setting was not some elegant theoretical optimum–it was simply the highest value that stopped the Vulkan backend from crashing. That, in a sentence, sums up a fair bit of the experience of using this board.
llama.cpp on Vulkan was the best all-round practical setup, but only after patching and tuning.
llama.cpp on CPU was useful as a baseline and for sanity checks, but too slow once model size started to climb.
ik_llama.cpp on CPU turned out to be dramatically better for some 2-bit and sparse-ish workloads than I had expected, to the point where it occasionally made GPU offload look silly.
[PowerInfer] remained interesting mostly in theory; in practice it was too awkward and too far behind the other options to matter.
GPU offload was not always the right answer. A lot of the marketing gravity around boards like this points you toward the GPU or NPU as the only interesting path, but once you start timing things, the answer is much more conditional.
Qwen3.5 35B-A3B IQ2_XXS was instructive. Under stock llama.cpp, far too slow. Under ik_llama.cpp, dramatically faster on CPU–to the point where it occasionally behaved like a real system rather than a cry for help. But it had a roughly 40% empty-response rate, consumed nearly all RAM and swap, and was slow enough end-to-end that I would only call it “working” in the same tone one might describe a vintage British car that has just completed a short journey without shedding visible parts.
For that model, the runtime comparison was actually rather stark:
Upstream llama.cpp on pure CPU (-ngl 0) managed about 0.63 prompt tok/s, 1.07 generation tok/s and took 76.67s end to end.
Upstream llama.cpp with a token amount of offload (-ngl 8) was, if anything, slightly worse at 80.03s total.
ik_llama.cpp on CPU was the surprise winner by a ridiculous margin: 16.24 prompt tok/s, 5.24 generation tok/s and 12.75s total.
ik_llama.cpp with -ngl 8 promptly ruined that advantage and fell back to a miserable 71.33s total.
That is one of the more useful things I learned here: for some quantized models on this machine, CPU inference with the right runtime was not just competitive with GPU offload, it was much better.
The Liquid models were interesting for a different reason. LFM2 8B-A1B Q4_K_M managed roughly 46.7 tok/s prompt and ~32 tok/s generation on Vulkan–objectively impressive for the active parameter count–and LFM2.5 1.2B pushed generation to around 45 tok/s. On paper, these look like the hidden sweet spot. In practice both failed when pushed through the full agent pipeline: blank output, formatting failures, over-eager obedience to internal conventions. Useful to know, but not deployable.
For reference, the ranking I ended up with:
Qwen3.5 4B Q4_K_M on llama.cpp Vulkan at 9.7 generation tok/s was the only setup that felt production-usable.
Qwen3.5 35B-A3B IQ2_XXS on ik_llama.cpp CPU at roughly 5.3 generation tok/s was the most surprising result–impressive, but too flaky and memory-hungry to trust.
LFM2 8B-A1B Q4_K_M on Vulkan at roughly 32 tok/s generation posted a great benchmark number but broke down in real agent use.
LFM2.5 1.2B Q4_K_M on Vulkan at roughly 45 tok/s generation was quick but not dependable enough to matter.
Qwen3.5 0.8B Q4_K_M on CPU at about 46 tok/s sounds good until you ask it to cope with a full agent prompt.
So yes, the board can run local models. It cannot run all of them well, and a distressing amount of the work lies in sorting out which bits of the stack are broken on any given day, but it was a much better experience than with Rockchip boards, and I intend to try out Gemma 4 and more recent models soon.
While the above was going on, I kept tabs on both thermals and memory, since I expected sustained GPU or inference workloads to need active airflow. But I had to deal with the fan first, since the Orange Pi 6 Plus ships with a pretty beefy cooling solution that, sadly, is very on the loud side.
And there’s no fan curve–all you get with the CIX kernel is a sysfs interface via cix-ec-fan with three modes:
mute
normal
performance
The first leads to the CPU reaching fairly high temperatures under even moderate load, the last is unbearably loud, and the normal setting ranges from moderately quiet to annoying, so for most of the testing I moved the board to my server closet.
Again, the CIX P1 has 12 cores, but they are not equal–four low-power Cortex-A520 cores clocked at 1.8GHz and eight faster Cortex-A720 cores spread across four clusters at different peak speeds (2.2 to 2.6GHz). The kernel’s cpufreq subsystem treats each cluster independently, which means that it takes a bit of effort to max out all the cores:
sbc-bench reported no throttling during its run, which was encouraging.
The aggregate 7-Zip score landed around 33k, with the best single A720 core around 3874 and the A520 cluster way behind at about 1617–a nice reminder that workload placement matters on this SoC.
Memory bandwidth on the A720 cores was respectable: libc memcpy in the 15-17 GB/s range, memset often 35-47 GB/s.
The A520 results were dramatically lower across the board.
Memory Bandwidth
An interesting twist I lost some time exploring is that you can actually see some differences per CPU cluster, which is new for me in ARM machines:
Memory bandwidth by CPU cluster
Blue bars are memcpy (read-then-write), red bars are memset (pure write). The A520 cluster is roughly half the bandwidth of the A720s across both. This matters for inference because memory access patterns land on whichever cores the scheduler picks, and a hot path pinned to the efficiency cluster is immediately noticeable.
Thermals
On a quiescent system, sensor readings were good–most blocks hovered in the high twenties to low thirties Celsius:
GPU_AVE: 29°C
NPU: 30°C
CPU_M1: 30°C
CPU_B0: 32°C
PCB_HOT: 33°C
The thermal logs during the benchmarks were more reassuring than I expected:
idle and light-load readings sat mostly around 29-33°C across GPU, NPU and CPU blocks
under the longer benchmark runs, board and package sensors generally rose into the mid-30s to about 40°C range, which is very good (but, as you’d expect, audibly noticeable from outside the closet)
frequency traces showed the active cluster spending long stretches pinned at its target clocks before later dropping back, which looked much more like workload phase changes than panicked throttling
One benchmark artifact I largely ignored was the iozone run, because it was aimed at /tmp and therefore mostly measuring the memory-backed path rather than telling me anything meaningful about persistent storage.
Here’s a new chart that tries to capture thermals and frequency a little better than my old ones:
Thermal and frequency trace during sbc-bench run|653
The above covers the full sbc-bench session–roughly 40 minutes of mixed workloads.
The three shaded phases correspond to what was running at the time: a short iozone burst (memory-backed, not interesting), the main sbc-bench battery (OpenSSL, 7-Zip single and multi-threaded, tinymembench across all clusters), and the trailing cooldown.
The key thing to notice is that frequency stayed pinned at target clocks throughout the heavy phases and only dropped back during transitions–there was no thermal throttling, which is pretty amazing.
Temperature peaked around 43°C during the sustained multi-threaded 7-Zip run, which is well within spec for a board with active cooling. The idle baseline was around 29°C, and it settled back there fairly quickly once the load came off.
One thing I could not track was fan speed, since the cix-ec-fan interface does not expose current RPM or duty cycle, and I had no way to correlate the thermal curve with what the fan was actually doing at each point. I could hear it spin up and settle, but I have no real data to overlay, and even though I considered setting up a dB meter, I never got around to it.
All of the above covers the first week or so. But I’ve been running this board as an always-on machine since March 8, and by now have a month’s data on what it’s like to live with.
The board now hosts a piclaw instance (my personal assistant) that I’ve been using for development and model testing, since I realized that LFM2-8B-A1B made for a faster thing to experiment with (31 t/s generation, 47 t/s prompt on Vulkan) even if it’s effectively not that “smart”.
Alongside the assistant work, I’ve been using the board for a real development project: porting the BasiliskII classic Mac emulator’s JIT to AArch64.
Over the past month that has meant a good deal of compilation, linking, automated experiment runs and testing. The JIT now executes real 68k ROM code with basic optimisations–interrupt delivery and display rendering are the active frontier, but it boots to a Mac OS desktop every now and then. The constant rebuilds around AArch64 JIT bugs I hit (broken optflag inline asm bindings, various register allocation and flag bugs in codegen_arm64.cpp, VM_MAP_32BIT allocation failures, repeated runs at fixing emulated 68k interrupt delivery) were genuine low-level issues that exercised the board’s toolchain and memory subsystem in ways no synthetic benchmark would, and it’s been working great.
One thing that came up in every review of the CIX P1 I read–[Jeff Geerling’s Orion O6 writeup][jg] being the most prominent–is power draw, and I have a month’s worth of data to confirm that it’s higher than average–averaging at 15.5W, rather than the usual 13W that I see quoted in other places:
Orange Pi 6 Plus wall power over 30 days
The flat zeros on the left are the setup period when I was reflashing and debugging offline. Once it came up as an always-on machine the power draw settled into a consistent daily pattern.
Orange Pi 6 Plus wall power over 7 days
Zooming into the last week at 15-minute resolution, the daily idle/load cycle is clearly visible–overnight the board drops to about 15-16W, and during the day it hovers around 20-27W depending on what I am doing. Compilation and inference bursts push it briefly toward 30W; the rest of the time it sits comfortably in the low twenties.
That said, the idle floor of 15-16W is noticeably higher than what I am used to from other SBCs. A Raspberry Pi 5 idles around 3-4W, an RK3588 board typically settles around 5-8W, and even a Mini PC with an N100 can idle below 10W.
The Orange Pi 6 Plus never really gets below 15W even with nothing running, and that appears to be a common trait of the CIX P1 reference design rather than anything specific to this board–the Radxa Orion O6 (same SoC) shows a very similar baseline in the reports I have seen.
Whether that is down to the memory controller, the 5GbE PHYs, the always-on fan or some combination of all three, I cannot say for certain. But it does mean the board is less attractive as a low-traffic always-on appliance than the raw compute-per-watt numbers might suggest. At 15W idle you are paying about 130 kWh/year just to keep it breathing, which is not terrible but is not nothing either.
Orange Pi 6 Plus current draw over 7 days
I checked, and current draw mirrors the power profile and stays well under 0.2A on the 230V circuit. The board’s power supply is not doing anything exotic.
Mains voltage on the office circuit over 7 days
The voltage trace is mostly here for completeness–Lisbon mains hovering around 230-232V with the usual overnight sag and daytime recovery. Nothing that would stress any reasonable power supply, and useful as a sanity check that the power readings are not being skewed by wild grid swings.
Reboots over the month: essentially none that weren’t my doing. The board has been stable in a way I did not expect from the early boot-chain experience.
After all of this, the Orange Pi 6 Plus fits a fairly specific set of roles:
local inference experiments with carefully chosen models
edge-side telemetry or monitoring
compact Linux services that benefit from dual 5GbE
infrastructure roles where you want something denser and lower-power than x86 but more capable than the usual toy SBC
I wouldn’t use it as a general-purpose desktop, and I wouldn’t trust the NPU story for anything LLM-related without more soak time. But I would keep it around for the sort of edge-AI and systems work I usually get drawn into–enough real capability to justify the effort, even if that effort is, right now, unreasonably high.
Even considering that I cut a lot of corners on the software side to get to a usable state, the hardware is still very much ahead of the software.
The GPU works, the NPU stack exists in some recognisable form, and local AI is not only possible but occasionally good, and I like what it can do, even if the power consumption and fan noise are higher than I would like for a board in this class, but compared to Rockchip’s offerings, it’s a much more polished experience–and the fact that I can get it to do useful work at all by myself, with my own OS image, is a testament to the progress ARM boards have made in the last couple of years.
The Wii is, indeed, a PowerPC machine, but getting Mac OS X to boot on it still requires a fair amount of kernel hacking—never mind the real life altitude it was actually written at, although it does confirm that flight time can, indeed, be used productively.
This was a shorter work week partly due to the Easter weekend and partly because I book-ended it with a couple of days off in an attempt to restore personal sanity–only to catch a cold and remain stuck at home.
I got an Xteink X4 this week, and my first reaction was somewhere between amusement and nostalgia–it is absurdly small, feels a lot better made than I expected for the price, and the form factor harks back to the times when I was reading e-books on Palm PDAs and the original iPod Touch.
Work ate the week again. I’m exhausted, running on fumes, and daylight saving time stole an hour of sleep I could not afford–the biannual clock shuffle is one of those vestigial absurdities that nobody can be bothered to abolish, and I’m starting to take it personally.
This is absolutely hilarious. The infuriating window corner roundness in Tahoe has been bugging me too–and this is a brilliant take on the problem.
Instead of disabling SIP and patching system apps to remove the rounded corners (which is the usual approach), this simply forces a consistent corner radius across all third-party apps via a DYLD-injected dynamic library.
It’s a small thing, but inconsistency in UI chrome is the kind of detail that, once you notice it, you can never un-notice. The fact that Safari has different corner radii from other apps is inexcusable–and that’s before the Liquid Glass disaster made everything look like a Fisher-Price toy dipped in vaseline. I appreciate the “if you can’t beat them, at least make them all equally ugly” philosophy here.
The implementation is old-timey, straightforward Objective-C method swizzling on NSThemeFrame–nothing exotic, but the approach of skipping com.apple.* bundles and only touching third-party apps means you don’t need to mess with SIP at all. That alone makes it worth bookmarking.
Mar 25th 2026 · 1 min read
·
#ai
#arm
#chip
#cpu
#hardware
#inference
The fact that ARM, whose entire business model revolved around licensing CPU designs, has decided to actually go and build their own chips is remarkable by itself, but the design specs (and power envelope) are very interesting.
I have been keeping tabs on the dedicated inference hardware space ever since I got wind of Cerebras, and I like the idea of special purpose/optimized CPU designs that would remove (or at least lessen) our dependency on NVIDIA (and GPUs in general) to run AI models, because that is the way to make it cheaper, less power hungry and, eventually, desktop-sized.
I do find it stupid to refer to this as an AGI CPU, though.
Mar 22nd 2026 · 2 min read
·
#agents
#ai
#balance
#bun
#dev
#life
#notes
#piclaw
#typescript
#weekly
#windows
#work
This week’s update is going to be short, largely because work was hell and I ended up spending my Saturday evening poring through my meeting notes backlog until 2AM today and I have a splitting headache to show for it.
Well, there went another work week. Slightly better (to a degree, although I got some discouraging news regarding a potential change), and another week where piclaw ate most of my evenings–it went from v1.3.0 to v1.3.16 in seven days, which is frankly absurd even by my standards.
I went to a local mall yesterday and happened to chance upon a couple of MacBook Neos on display at our local (monopolistic) Apple retailer1, and spent a half hour playing with them.
We’re three months into 2026, and coding agents have been a big part of my time since last year–things have definitely intensified, and one of my predictions has already panned out: agents are everywhere.
This was a frankly absurd week work-wise, with some pretty long days and a lot of late-night hacking on my projects (which is not exactly a new thing, but at least now I am asking piclaw to do it during the day time, which is a small improvement).
This is just lovely. If, like me, you grew up with the LEGO Space collection and loved the artwork on those pieces, and do 3D printing, this 10:1 scale recreation with a Mac mini and a 7 inch display will make your day.
I’m just a bit sad that the cabling is still very visible, but you can grab the files from Makerworld and give them your own spin.
Mar 4th 2026 · 1 min read
·
#a18
#apple
#hardware
#mac
I know a bunch of people will disagree, but this is the most relevant Mac announcement in years for two reasons:
It’s the first new Mac model in a while that isn’t just a spec bump, but rather a new product line with a clear target audience and a pretty aggressive price point (at Apple standards, that is).
It’s not running on an M-series chip, which is a bold move that could have significant implications for Apple’s product strategy and the broader Mac ecosystem.
The fact that it has “only” 8GB of RAM and 256GB of storage (which is OK if you think of it as a school machine) is going to be widely maligned, whereas I would focus on the missed opportunity to make it even more portable by shipping a 12” display instead of 13” (probably some sort of golden ratio thing) and the unbelievable stinginess of shipping with a USB-C 2.0 port.
What? You couldn’t afford a USB-C 3.0 port? Really? I mean, I get that this is an entry-level machine, but come on, Apple.
Update: this seems to be a limitation of the A18 chipset’s I/O setup, from what I’m reading. There’s a lot of chip information out there now, including breakdowns of the new M5 lineup that are worth perusing as well.
That said, I would swap my iPad Pro for it in a flash (if it had a 12” display, that is). And that is probably exactly why it is that big.
Mar 1st 2026 · 5 min read
·
#agents
#ai
#dev
#golang
#notes
#security
#weekly
This is a great round-up, and it isn’t hard to spot the main themes-great hardware, absolutely damning feedback on software quality on so many aspects (from the Liquid Glass Tsunami to people straight out avoiding installing Tahoe) that I cannot help but agree on (especially considering my current travails).
The best possible outcome from this is that Apple backtracks on the mess they created last year.
The most likely one is that they will simply carry on without acknowledging any of it publicly and discreetly patch the most critical issues, because they are still making tons of cash on hardware and services and software quality really hasn’t been a priority in half a decade.
At this point, I am even starting to question if they still have the talent (or the ability to retain it), especially considering that the people from most startups they’ve acquired over the years keep leaving. And I know for a fact that they stopped recruiting remotely a few years ago, which definitely hasn’t helped.
Feb 21st 2026 · 2 min read
·
#agents
#ai
#automation
#home
#notes
#siri
#weekly
#wellness
This week I did something different: I took a wellness break from work and generally tried to tune out all the noise and messiness I have been experiencing there. It ate a chunk out of my PTO, but was mostly worth it.
I have no idea of what is happening since I can’t even find any decent logs in Console.app, but it seems that the latest update to macOS Tahoe (26.3) has a serious bug.