Lessons on Building MCP Servers

I’ve been building servers for a while now–I wrote about last year, started out by creating umcp, and I’ve recently opened up an Office server that’s been battered by enough models against enough real documents that the patterns have settled.

I’m still not a fan of , but what follows is what I’ve learned about making tool chains actually work, condensed from swearing at logs rather than reading papers.

Disclaimer:This is a condensed version of CHAINING.md, which was itself stapled together from a bunch of notes in my vault. The full version has more code examples and a techniques inventory table that Opus just _had to add, and I’ve since beaten that out of it and restored most of the original text (minus typos).

The short version: the MCP servers I design do most of the work, while the model walks breadcrumbs.

Models don’t plan

They look at the conversation, scan the tool list, and grab whatever looks more probable. That’s it. There is no hidden planner. If you want chains that finish somewhere sensible, the server has to make the next call blindingly obvious at every step.

After a year or so, I have pared down my approach into these three things, roughly in order of how much pain they save you:

  • A small named core verb set covering most intents
  • Output that suggests the next call
  • An addressing scheme that survives between calls–anchors, IDs, paths, anything but line numbers.

Core verbs beat surface area

The Office server exposes over 100 tools. Its get_instructions() funnels models toward eight:

…start with office_help, then prefer office_read, office_inspect, office_patch, office_table, office_template, office_audit, and word_insert_at_anchor. Treat specialised tools as fallback, diagnostic, legacy-compatibility, or expert tools when the core flow is insufficient.

That single sentence does an outsized amount of work–it tells the model there is a recommended path, that the path is verb-shaped (help -> read -> inspect -> patch -> audit), and that everything else is opt-in.

Without it, models cheerfully reach for word_parse_sow_template when office_read would do, and you end up with five-call detours for one-call jobs.

So I quickly realized that I needed to be ruthless about which tools to surface and when. The specialised ones still ship–hidden under a “for experts” framing, and a handful of legacy ones filtered out of tools/list entirely.

I also make liberal use of activation sets–the surface the model sees is small; the surface it can reach is large.

Naming is the chain

Again, models chain whatever is most likely (or rhymes), and the most effective tactic, for me, has been taking advantage of that.

All Word tools are word_*, all Excel excel_*, all unified office_*. A model that just called office_inspect will reach for office_patch next, not word_patch_with_track_changes, because the prefix matches.

This particular server also makes liberal use of annotations and a little intent/inferrer hack that reads those prefixes to assign readOnlyHint/destructiveHint automatically, so naming discipline turns into safety metadata for free.

The prefix is the plan. The verb is the step. If you take one thing from this entire post, I’d suggest this notion…

Every response nominates the next call

This was the single change that made things behave on smaller models. The big ones will plan a chain from a tool list and a goal; the wee ones won’t–they grab the first plausible tool and stop.

The fix is stupid simple: every response ends with a breadcrumb dictionary of hints to follow. At minimum next_tools: [...], plus usage: "<exact call>" whenever the current tool produced a value the next one needs.

A model that can’t assemble arguments from a schema can copy the usage string verbatim. In fact, they will copy it, because it is still the most likely outcome as it fills in tokens, and thus those usage hints funnel the path the model takes.

Discovery as a tool, not documentation

Another thing I hit upon was that signposting needed to be curated.

Borrowing a page from intent mapping, office_help(goal=...) returns a structured record–recommended chain with rationale, fallbacks, diagnostic strings to watch for, one imperative next_step sentence. Not prose. Not a README, not skills. Data the model can act on without reading comprehension.

Called with no arguments, it returns the catalogue. Called with an unknown goal, it returns the supported set rather than an error, which turns a potential workflow-stopping error into an actual useful catalogue.

Addressing: anchors, not offsets

The biggest reason simple models can’t follow chains is the model losing the thread between calls. “Insert a paragraph after the introduction” is fine in English but catastrophic if you expect it to remember a byte offset across three tool calls.

In this particular scenario, I cheated and since most Office documents have headings (or cells, or internal structured paths inside OOXML), I used either verbatim text from the document or immovable coordinates (which was particularly hard in PowerPoint, by the way).

So besides suggestions and hints, return identifiers your tools will later accept as input. If you find yourself returning data the model has to describe back to you in natural language, you’ve made a chain that will misfire on a Tuesday afternoon when you’re not watching.

Modes turn one tool into four

I started out with individual editing tools per format, which was very easy to do automated tests for but incredibly wasteful of context, so at one point I decided to make things much simpler for initial discovery, and since I needed to make all outputs auditable, I then tagged available sub-operations risk-wise.

office_patch is the same code path whether you ask for dry_run, best_effort, safe, or strict. One tool, four modes, one entry in tools/list.

Discovery cost scales with tool count, not mode count. And dry_run -> safe -> strict is an escalation chain the model figures out on its own without being told.

If you have N tools that differ only in how cautious they are, collapse them. You’re wasting everyone’s context budget.

Diagnostics as the back-edge

Linear chains are easy. Real chains have loops, and loops only happen when the server invites the model back in. Every mutating tool returns a standard envelope with status, matched_targets, unmatched_targets, and next_tools.

The model then branches on a small subset of options “locally” without needing to go over the entire context, and if you name the diagnostic fields with exact strings the model will see again in your instructions, it will just reinforce them.

In this particular case, again, I cheated. I figured out that the models were starting to call tools at random because they couldn’t introspect the document well enough and ended up breaking files, so I always gave them at least one read-only tool, so the penalty for “I’m confused, let me look again” is one extra round-trip, not a destructive cock-up.

My MCP Design Checklist

  • Pick five to ten core verbs and name them in get_instructions() or your local equivalent
  • Use consistent prefixes by surface
  • Provide a discovery tool that returns recommendations as data, not prose
  • Make the discovery tool browseable–no-arg returns the catalogue, unknown input returns the supported set
  • Embed forward breadcrumbs in every tool response
  • Provide a map/anchors tool so addresses survive between calls
  • Give every mutating tool a mode enum including dry_run
  • Return named diagnostic fields and cite the recovery tools
  • Standardise the mutation envelope. If one tool changes something in a specific way, make sure the others are consistent (arguments, semantics, etc.)
  • Reject unknown arguments strictly (this is much easier in some runtimes than others)
  • Provide an audit tool so the model has somewhere to land
  • Cache anything the recovery loop calls more than once, because, well, it will get called dozens of times even if you carefully curate paths through your tooling with hints.
  • Make repeat calls safe–models retry, and they should be allowed to (idempotence is hard, and often impossible).

Do the boring work in the schema and the descriptions. The model will happily do the clever bit if you stop making it guess.

App Notes: Web App Viewer

I got annoyed enough with Safari Web Apps to write my own replacement.

It took about five minutes to get the core working, and maybe another hour of incremental tweaks spread over a day or so. That ratio–five minutes for the thing, an hour for the polish–tells you something about the state of the problem it solves.

Web App Viewer is a tiny native macOS shell that opens a URL in a WebKit window with no browser chrome. No address bar, no tab strip, no toolbar, no Safari-style fullscreen frame. One web page, one native window, as little visible UI as macOS will reasonably allow once a page is loaded (it hides traffic lights and scrollbars when the mouse is away).

You can drop URLs onto it in the Dock, send them from the Share sheet, a .webloc file, or a custom webappviewer:// URL scheme.

This is it. This is the whole app
This is it. This is the whole app

Why

Safari’s “Add to Dock” Web Apps have been around for a while now, and the idea is sound–pin a website as a standalone app, give it its own icon, get it out of the browser tab pile. The execution, though, is maddening, and it has always been broken across the board, but on macOS it is horrendous.

The resulting windows still carry persistent browser chrome I can’t hide, and the whole flow of creating one (find the menu item, wait, hope it picks up the right icon, hope it doesn’t break on the next Safari update) feels like an afterthought rather than a feature anyone at Apple actually uses.

This is one of dozens of papercuts that accumulate into a kind of low-grade daily friction, and I have a growing list of them that I intend to write about at some point. But this one was fixable before dinner, so I fixed it.

How

I fired up Codex with the kind of detailed mini-spec I described in –what the window should look like, how URLs should be accepted, what the drag behaviour should be–and told it to reuse the window styles and approach from Daisy and the USB Video Viewer (another small project I built to test SBCs via USB capture without adding more monitors to an already cluttered desk).

Disclosure: OpenAI provided me with a 6-month trial of Codex for my Open Source work (which has also helped me fully ), but you could probably do this with an brick-brained open-source local model (even if is a mess and under-represented in LLM training sets, which is a problem even with SOTA models).

The core is just WKWebView in a native window with chrome that fades in on hover. The Share Extension, the macOS Service, and the URL scheme were bits I tacked on after, and all the scaffolding (Makefile, signing, etc.) was AI-generated, because there is absolutely no reason to do that by hand in 2026.

There were, however, two things that were a right pain:

  • Adding an invisible drag strip needed a nudge from memory, but Codex was useless there. I knew how I’d have done it in and just guided it through the equivalent until it worked. Everything else was straightforward.
  • Web manifest icon detection in was… oh boy. The fact that still does not have a sane async model (at least like I would expect) and would poke at the page and web manifests but fail to wait and load the bigger icons took me a few tries to get right.

But it was totally worth it. I now have six instances of this running, and I found (and fixed) subtle bugs when trying to create each one of them, so I’m pretty much calling it “done” other than some manual UX tweaks I want to do to the menus and dialogs.

What I Use It For

The original motivation was wrapping Piclaw’s web UI as a frameless native-feeling app, and that works exactly as I wanted. But the nicer surprise has been dropping other self-hosted URLs into it–Grafana dashboards, consoles, internal tools–and getting a clean, chromeless window for each. It turns out that removing the browser frame makes everything feel lighter.

And I am casting one of them to an Android device via AirPlay (more on that later when I get that one stable), and the lack of browser chrome makes it… just great. Zero wasted pixels, no distractions, just the content.

But the way it really improves on what Apple didn’t do for me is usability and practicality. Drop in a URL, check it out, then hit Cmd+I and a new copy is installed to my ~/Applications folder, ready to launch from Spotlight, without cluttering the Dock or trying to figure out where they hid it in the sharing pane.

Bliss.

The Uncomfortable Bit

I was a happy user years ago, and I know there are paid apps that do roughly this. But the uncomfortable truth for Apple indie developers in the age of is that there is zero reason to pay for any of them when I can build a tailored version for my own needs this fast.

That’s not a criticism of those apps. It’s a warning sign about what -assisted development does to the economics of small, focused utilities–and, in the context of Mac apps, which were always a tiny cottage industry, is going to be worrisome for many.

But the real lesson here, I think, should be about what Apple ought to have just built into macOS instead of shipping the half-baked Web App support that provoked all of this in the first place.

I will have more words on that.

Notes for April 20-26

Amidst the chaos brought upon my usual seasonal allergies, work turned out to be calmer than usual–the usual industry churn and constant rumors of layoffs have made “calmer” a relative term, though–so most of my evenings went to projects.

I also re-read Project Hail Mary–partly because I needed something absorbing that wasn’t a screen, and partly because Weir is one of the few authors who makes engineering problem-solving feel like a page-turner. It holds up, and I can’t wait to see the movie.

Mac Retro-Hackery

Rocketing away
Rocketing away

PPC detour is, surprisingly, working much better than the 68k JIT, but already paid off: my naïve take on memory layouts meant that I hit one of the banes of modern emulation very fast–ASLR on aarch64 Linux was randomising addresses that the JIT needed to be fixed, but now I understand a lot of the issues I was having with 68K version.

The fix for now was to have the binary disable its own ASLR at startup via personality(ADDR_NO_RANDOMIZE) and re-exec, which is ugly but works and is the sort of thing nobody documents. And after doing that on the BasiliskII side as well, a lot of issues went away.

Both JITs now have proper Makefile workflows with tmux targets, which means I can build, test, run and kill either emulator from a single command–which I’ve been doing with my iPad, from the comfort of my couch.

As to the , it is not assembled, because the resistive touch screens I have are borderline unusable for precise tapping (so good thing I only 3D printed a test fit with old filament). I ordered a couple of larger capacitive ones and a bunch of other ESP32 stuff, so I expect to come back to that next weekend.

PVE microVMs

So tiny
So tiny

My little hack has been working great–although I had to fix a few things after upgrading one of my nodes (regression testing is the bane of my existence these days), pve-microvm now supports all the operating systems I care about, a few I had never considered using, and other than the fact that I am creatively patching ’s interface, it has been pretty stable, which was unexpected.

I got piclaw to hack in a custom OCI dialog to replace the Create VM wizard, an xterm.js console tab for microVMs (noVNC makes zero sense for serial-only machines), and a bunch of other features.

And of course it broke when shipped a patch release, but since I have a as a sacrificial node I can contain the blast radius of any upgrades. Mostly.

But right now I’m converting most of my LXCs to microVMs, and it’s been a blast–the speed is fantastic, and the fact that I can run in a microVM is just icing on the cake.

The Churning piclaw

Like I wrote above, regressions are the bane of my existence, and I am getting really annoyed at because despite all the nice tooling, it can still pass most linting and “compiling” and fail spectacularly at runtime. And since the upstream packages have been undergoing considerable churn and breaking changes, a lot of piclaw broke in various ways, and experimenting with different models really doesn’t help.

Even as I’m typing this, I am (yet) again waiting for an OpenAI model to audit some UI breakage that Anthropic’s models caused, because they just drop chunks off the code when editing it sometimes, but I am getting really annoyed at fixing things three times in a row…

And yet, the flexibility of and its extension model is pretty amazing–I decided to adopt it wholesale and have started breaking off pieces of piclaw into a piclaw-addons repository, into which I can throw all the mad experiments I want–for instance, yesterday I hacked together a “cheapskate” addon (a cost-conscious model router) that lets you use a bunch of free tiers across various providers, something that would be impossible to do in most harnesses…

Gi

Yes, another cute gopher
Yes, another cute gopher

And yet, I think it’s time to have a backup. So I created gi, a harness inspired by and designed for extensibility, but where all the extensions are externalized to the point where they can’t (hopefully) break the core, and where I want to try to rewind the clock to the simpler times of LISP machines–take your workspace, copy a state dump to another machine, and just carry on.

So I designed it as a single binary that can pack everything into a single database, and that binary embeds both a dialect (via Joker) and a engine that can hook into the state machine–so extensions can be written in either and live inside the SQLite blob alongside everything else.

And in true belt and suspenders style, I’m going to pack both a TUI and a web UI in the same binary.

But, most importantly, I’m taking a completely different approach at dependencies and testing–starting with bringing together most of my previous stuff in various forms, and writing a functional test suite and not just a code one. Still missing tool execution, keychain, workspace indexing–but it’s at the point where I can sit down and have a conversation with it.

9front on ARM

9front literally "on" ARM
9front literally "on" ARM

Yeah, I know. Another project. But I realized that I needed to remind myself of how to bootstrap a kernel on bare metal before I even try to get Haiku running outside QEMU, so I started poking at porting 9front to one of my ARM SBCs.

’s ideas about distributed computing and per-process namespaces have been rattling around in my head since the 90s, but more to the point it is a very simple system, and shifts the bulk of the effort into getting uboot and hardware bootstrapping to work instead of trying to figure out everything at once.

As a fun detour from that, I ended up creating a simple USB Video viewer to pull up video output from a USB capture card to watch things crash spectacularly.

Keeping an eye on things
Keeping an eye on things

Yet Another Website

While I was at it, I finally got around to refreshing rcarmo.github.io–my open source landing page, which had been accumulating a decade of pixel dust while I was off doing other things.

It’s nothing fancy: a single page that groups some of my repositories by topic (AI agents, cloud, hardware, infrastructure, libraries, macOS, terminal stuff) with one-line descriptions for each, and acts as a sane front door for anyone who stumbles onto my GitHub profile and doesn’t fancy scrolling through 380-something repos.

rcarmo.github.io project landing page
The refreshed landing page, sorted by topic and (slightly) opinionated about what's worth highlighting.

The rest of the week’s GitHub activity was the usual scattering: a small go-ai update (the unified LLM client I’m using inside gi), some ground-init and mdnsbridge cleanups, a zmk-config-totem tweak for the split keyboard I’ve been slowly getting used to, and a couple of apfelstrudel commits–because if I’m going to break my brain on emulators all week, I might as well let an AI agent help me make some weird music every now and then.

Site Cleanups

Flint, my “very stable” agent, kept earning its keep on the side: I finally split out and as their own subsections (consolidating entries that had been awkwardly squatting in the language tables) and tucked away a couple of odds and ends–notably and a –into the relevant pages.

None of this is glamorous, but the resource pages have been drifting for a while, and having an agent do the boring sorting (and ask me sensible questions about edge cases) is exactly the kind of thing to deal with chores I’ve been putting off for years.

And yeah, I know it’s too much, and that I’m spreading myself too thin.

Notes for April 13-19

This was a pretty decent week despite my allergies having kicked in to a point where I have constant headaches, but at least I had quite a bit of fun with my projects.

“Now I Have the Full Picture”

Yeah, I find Opus sycophancy and its traits obnoxious, but this time it’s right–I was trying to get to work with my particular flavor of Cheap Yellow Display and having so much trouble matching screen corruption and flipped colors (and bits) to the display code, that after I finally managed to get at least a stable (if broken) boot picture on screen, I thought to myself… why not let piclaw sort this out for me?

So I plugged the CYD and a Logitech Brio 4K into the , and… I got the most surreal ESP32 closed loop debugging setup going:

I ended up moving the camera farther away to get better focus
I ended up moving the camera farther away to get better focus

Five minutes later, I had all the display bugs fixed except for touch input, which was still rotated–a fair bargain.

Proxmox microVMs

I was looking at smolvm and going through my notes on Firecracker and other sandboxing mechanisms, when I realized I had come across microVMs a few months ago when looking at agent sandboxing mechanisms and the old QEMU JIT.

Now, I actually think that microVMs are way overrated, but I was literally in the shower when I realized that, for me (since I have zero interest in running microVMs in my laptop) would be the perfect way to manage them (also since I have zero interest in running another exotic hypervisor).

So I did a little spelunking, and… It worked. Badly, but it worked. I took my terminal session, added a few notes, and asked piclaw to investigate if it was possible to patch the UI–and guess what, it was a pretty simple patch–I got the agent to flesh out a Debian package, turn my hacks into a CI/CD workflow that builds and packs a suitable kernel into the .deb, and now I have a nice VM template, decent integration of microVMs into the web UI, the works.

pve-microvm patches qemu-server to add the machine type, ships a template workflow that pulls OCI container images and converts them to PVE disk images, and redirects serial to the web console so you get a proper terminal in the UI. There’s also init support and a balloon device (as well as qemu-agent support), but the OCI images are so barebones that I haven’t yet sorted out all of the ergonomics about using them to automatically deploy stuff.

Proxmox microVM integration in action

This looks like a very low impact addition to so far and I would love to upstream it, but I’m not holding my breath since maintainers aren’t trivial to reach and the old-style “join our developer mailing-list” approach is… just too effort-intensive as I have so much stuff to do these days.

We Now Do PowerPC JITs Too

The macemu work took an unexpected turn–I shifted from (68k) to SheepShaver (PowerPC), and things moved a lot faster than I expected. To make a long story short, it was Friday and I idly asked piclaw to do a comparative source analysis between both emulators, hoping for something that I’d missed in the quagmire of ROM patches I’ve been wading through.

Turns out that it told me that there was no real JIT support and did a comparative analysis of opcode coverage, ending with “there are, however, much less opcodes to translate in the RISC architecture. Do you want me to set up a quick opcode test harness for PPC”?

Uh… yeah? By Friday evening, every opcode family except AltiVec had native ARM64 codegen and was booting to the Welcome to Macintosh screen (and crashing, but this was comparatively 100x faster than the 68k work), and yesterday afternoon, after some back and forth about creating a second harness (effectively a headless Mac with no hardware to skip problematic ROM regions), I got it to do AltiVec via NEON (which the supports–I’ve yet to devise a fallback path for older chips).

The process was straightforward: point piclaw at an opcode group, have it implement the native codegen, run the harness, iterate on whatever broke, then once an opcode group was “done”, smoke test it on the headless Mac harness. The AltiVec stuff was the most satisfying part–mapping NEON intrinsics to Altivec semantics is tedious but tractable, exactly the kind of work where AI earns its keep and the harness catches every subtle difference.

SheepShaver now boots to a desktop with VNC input working. There’s still a long way to go because I have done zero hardware testing (it’s got no audio, only VNC input and, more importantly, no network or graphics acceleration), but a from-scratch PPC JIT on ARM64 booting to a desktop in around 24h is… not nothing.

I wish I could finish the 68k JIT, though, the register allocation strategy I guided the agent towards and the weird ROM patches BasiliskII does just don’t get along.

Lounge About Agentic Computing

The fun part for me has been that a lot of this has been done on an iPad on my couch, using the Apple Pencil or iOS voice typing to scratch out instructions. After an outing yesterday, I had the idea to just swipe between agents, and… oh boy.

The idea is simple–swipe left or right on the timeline to switch between agents–but making it feel right on an iOS PWA required far too many weird CSS and JS hacks, and the one real problem I’m having is that AI, no matter how many times you specify in painful detail what you want and how many actual code samples you give it, is still too prone to breaking very intricate UX–I’m getting really tired of weird regressions every time I add another feature.

I’m Not In Thrall To Anthropic, But I Can Help

I’m not an Anthropic customer (besides GitHub Copilot’s model selection, which now also includes the new, lobotomized Opus 4.7, I have a personal Codex subscription for OSS work), but so many people seem to have been caught by their ban on third-party coding harnesses that I decided to dust off Vibes, start porting it to (which I had already in my backlog) and turning it into an ACP-only wrapper so that people can use Claude with a nice web UI.

I think it’s the least I can do, and also gives me a decent web UI to drop in for my own work when I absolutely have to use Copilot.

Haiku on ARM64

And, of course, since I have far too many projects already, I decided to see if I could get Haiku to boot on ARM64. I don’t particularly care about doing for salesy startupy business stuff, but I love using it to build things I think should exist, and I have quite a few more I’d like to make happen…

Notes for April 6-12

Thanks to a bit of spillover from Easter break, this was a calmer, more satisfying week where I could actually get stuff done and even have a bit of fun.

My idea of fun, apparently, is to do 3D visualizations in piclaw
My idea of fun, apparently, is to do 3D visualizations in piclaw

Getting Organized

Now that piclaw is in cruise mode, I’ve started focusing on actually using it.

So I created an instance called Flint, which manages not only my vault but also all of my personal pursuits and most of my homelab: I gave it the API tokens for my cluster and , and over the past week it’s been busy:

  • It re-tagged most of my notes and drafts (as well as adding reference URLs for ongoing drafts), quizzing me on what to do with specific notes as it went
  • It rebuilt and redeployed my GPU sandbox (which I broke last week): recreated the VM, mounted the Ubuntu ISO, prompted me to run the installer, and installed the latest NVIDIA drivers, nvidia-docker and a baseline set of utilities.
  • I then asked it to look at the stacks in my gitea instance, my notes, and what needed to be set up, and it installed the agent and brand new versions of the stacks with tweaked network and volume settings, updated my notes, and upgraded the pinned image versions (troubleshooting as it went).
  • It developed and published an OPDS server and an EPUB read later service so I can fetch interesting web pages and read them later on the XteInk X4, including monitoring the CI pipeline and redeploying the containers
  • It audited and set up centralized stats collection in , which I had been meaning to do for ages (and I intend to have it set up Telegraf on other machines to collect metrics).

So far, Flint is a resounding success (it’s using GPT-5.4, a fairly sensible and stable model), but it doesn’t just do notetaking and operations.

Site Hackery

Flint has also become quite useful to help me tidy up my workflow—I was already using a piclaw instance to convert ancient and raw HTML posts into in batches, but there are a few things that have been nagging at me for years and that I can finally make significant progress on:

  • Adding links to my resource pages
  • Drafting link blog entries
  • Streamlining static site builds

I’ve had to do the first two for ages, but they both relied on adding bits of text to Reminders that were then post-processed and added to git using either the CLI or WorkingCopy. That worked OK for a while, but my iPad mini’s increasing slowness has made them quite frustrating, especially since I tend to do that kind of quick posting over breakfast and it was taking up too much time.

As it happens, GitHub has a REST API for Git Trees, and what that means in practice is that I can update a JSON changeset with these minor changes, let it accumulate over breakfast, and then apply them in batches–or, rather, have Flint do that, with all the guidance and steps in a SKILL.md file.

So my new breakfast workflow is to just send links to Flint using the iOS sharing pane or a bookmarklet (still experimenting with both), have it create a JSON changeset for links, and occasionally ask it to screenshot a page and create a blank Markdown document for linkblog posts. That is pre-filled with a title, likely tags and the appropriate image reference, and I just pop open the built-in editor tab in piclaw, finish the post and ask it to add the files to the changeset and post them via the API.

So far, it’s been going swimmingly: zero git fetches/commits/pushes, all handled server side, and very little friction–and it works on my iPad mini, albeit still slowly.

A New Hope

Another thing I’ve been working on is porting the site builder to for both speed and maintainability—the current codebase has some 20-year old hangovers that I wanted to get rid of, and some kind of reset has been long overdue, so I have been slowly poking at this for the past few months.

As it happens, the overall indexing and rendering process was pretty trivial—the real challenge has been to make sure that it looks exactly the same, especially given that my engine has some pretty specific Wiki-linking rules and I’ve accumulated a bunch of rendering helpers and custom plugins over the years.

Plus everything related to HTML rendering has changed: parsing, link resolution, templating, the works. And that’s enough to juggle already, so I don’t want to change the front-end design at all (yet).

I decided to be ambitious and aim for full rendering parity. So what did my little army of AI helpers do?

It converged on doing visual diffs out of random sampled pages: Take a locally rendered version, look at the public page, and generate an image that it can easily rate as “close” or “broken” by just counting the ratio of red pixels:

This is both brilliant and scary at the same time
This is both brilliant and scary at the same time

The process is greatly streamlined: sample 100 pages out of the nearly 10,000 we have now, render, batch compare, show me the worst ones, and then discuss and generalize the fixes (which is the only part the LLM is actively involved in). I could probably use autoresearch to automate this, but some of the fixes have to do with legacy rendering logic that no AI could ever figure out.

Still, this has converged very quickly to minor typography and spacing differences, and once I’m happy with the engine I’ll start looking at optimizing the actual blob uploading part–which I aim to standardize via rclone to remove my current dependency on storage accounts, but greatly optimize with deltas.

Remember, AIs Are Still Dumb

It turns out that if you tell an AI that empty catch blocks are forbidden, the thing will just… go and add comments inside them, instead of doing something useful like a warning log message…

I’m now doing another code audit pass over the entire piclaw codebase, and this kind of mechanical fix is trivial to set up and do reliably with autoresearch:

An autoresearch session doing a code audit pass
An autoresearch session doing a code audit pass

Now to see if I can get some reading and 3D printing done as well, since the whole point of using AI in the first place was to have more free time… right?

Apple, Still

I have been having feelings about lately. This blog may have drifted a fair way from its original focus on , but I am still, first and foremost, an Apple user – just not an exclusively Apple user, and perhaps not even a particularly obedient one anymore, since I use both Windows and every day and have grown used to judging platforms by what they let me get done rather than by whatever story they are trying to tell about themselves.

That makes the current moment a little awkward. Apple is still extraordinarily good at making hardware I want to pick up and use, and still more coherent than most of the industry in the broad strokes, but it also feels increasingly prone to sanding off the wrong edges, reinventing the UX wheel, and constantly adding paper cuts to their software.

The iPhone

The is probably the clearest example of that tension. It is still the phone I would rather carry, and the one whose hardware I trust most, but has become steadily more fussy without becoming proportionally more capable.

A lot of it has been the constant UI friction and pointless balkanization of features like screen mirroring, which I would very much like to have – I see zero point in using Messages on my Mac or futzing around with Handoff and AirDrop when I could just, you know, pull up a window into my phone and type stuff in.

And I know Apple could indeed engineer a way to make those features DMA-compliant if it really wanted to – I suppose breaking the user experience across the board with Liquid Glass had enough priority to preempt allocating engineering resources to, you know, proper features.

Sharing things, moving files around, background activity, browser limitations, the endless little inconsistencies in system UI and the ungainly bloat in Settings – that friction accumulates. None of it is fatal on its own, but the aggregate effect is that the platform feels far less light than it used to, even while Apple keeps insisting that everything is becoming more seamless.

Where The Cracks Show

I’m going to say it outright: I found insulting. Not just visually, but also because it tells me that instead of fixing glaring gaps in things like automation ( is definitely not in good health, and is pretty much dead) that could actually have put Apple in the forefront of automation and AI (never mind the miserable failures in Siri and Apple Intelligence), someone at Apple actually decided breaking visual affordances took priority over stability and providing consistent application intents and hooks across the board.

Even then, is in a better place than , but mostly because it still retains enough of its older character to be workable. Remember, I can just .

There is still a proper filesystem, there is still a shell (even if Apple seems intent on breaking the userland in very small increments across releases), there are still enough escape hatches to route around bad decisions, and Apple Silicon has papered over a remarkable amount of software bloat simply by being absurdly fast and power-efficient.

But the cracks are visible there too. System Settings remains a mess, cross-platform application quality keeps declining, and the old Mac assumption – that a user might actually want to understand how their machine works – seems to matter less every year. Meanwhile keeps borrowing bits of the Mac’s vocabulary without acquiring the Mac’s actual flexibility, which leaves both platforms feeling oddly misaligned.

The iPad

The remains the device I most want to use more than I actually do. I may pick one up every morning to read the news and get drafts started, but . The hardware is excellent, the battery life is still absurd, the pencil is useful, and for reading, sketching, note-taking and casual browsing it remains hard to beat. Fine.

But every time I try to push it into being a serious general-purpose computer, it reminds me that Apple still has not decided what it wants the to be. It can approximate a laptop for stretches at a time – and sometimes very convincingly – but the moment you need proper peripheral support, predictable file handling or sustained tool switching, the abstraction turns into safety glass – and I’m back to my long-held opinion that the only good iPad is the iPad mini.

That’s what I intend to upgrade this year, even if Apple comes out with a decent foldable (and, by the way, I really like the “leaked” form factor, because phones have become stupidly tall and unwieldy).

Fedora, Oddly Enough

And this is where comes in, because it has become my most useful point of comparison. on the desktop is still Linux on the desktop – gloriously inconsistent, occasionally infuriating, and always willing to expose its plumbing at the worst possible moment – but my experience is very conclusive: has reached a point where, for a lot of everyday work, it is simply easier to reason about than either macOS or iOS.

That does not make it better in every respect. It is not. But it does mean that a lot of the breakage in Apple software now has a reference point, and even considering I was always a UNIX user and deeply technical, the creature comforts that Linux now provides give me a lot more confidence than Apple’s software.

If Qualcomm wasn’t so obtuse about only supporting Windows and ARM laptops were more open, things would be very interesting indeed.

Still an Apple User

I still like the hardware, still prefer the overall ecosystem in a number of places, and still find myself evaluating a lot of the rest of the industry by standards set years ago.

But I also think it is getting harder to ignore how much of the original appeal has been traded away due to sheer mismanagement of software QA and Apple’s refusal to acknowledge the gaps across , core applications, and a consistent user experience.

Come on, Tim, get your people in line.

The Orange Pi 6 Plus

This was a long one–I spent a fair bit of time with the Orange Pi 6 Plus over the past few months, and what I expected to be a quick look at another fast ARM board turned into one of those test runs where the hardware looks promising on paper, the software is wonky in exactly the wrong places, and you end up diving far more into boot chains, vendor GPU blobs and inference runtimes than you ever intended.

The Orange Pi 6+ on a corner of my desk
The Orange Pi 6+ on a corner of my desk

Unlike most of the ARM boards I’ve reviewed until now, this one is not an RK3588 board: The Orange Pi 6 Plus uses the CIX P1 (CD8180/CD8160), with 12 CPU cores, a Mali G720 GPU, a dedicated NPU and a wild set of specs for the form factor. Boards like this promise everything at once–homelab, edge AI, dual 5GbE, low power–but they only matter if the software gets out of the way.

Disclaimer: Orange Pi supplied me with a 6 Plus free of charge, and, as usual, this article follows my .

And, for a change, I decided to make sure the software did exactly that, and made it my concern from the start–i.e., I built my own OS images for it (a fork of orangepi-build) and went in a bit deeper than usual, spending around two months taking notes, benchmark logs and even Graphite telemetry as I went along.

Hardware

The Orange Pi 6 Plus, top view
The Orange Pi 6 Plus board (image: Orange Pi)

One of the reasons I wanted to test this board is that the SoC is the CIX P1, which Orange Pi bills as a 12-core part with a combined 45 TOPS across CPU, GPU and NPU. The machine I tested came with:

  • CIX P1 (CD8180/CD8160), 4×Cortex-A520 plus 8×Cortex-A720 cores
  • 16GiB of RAM (roughly 14GiB visible to Linux)
  • dual Realtek RTL8126 5GbE
  • Realtek RTL8852BE Wi-Fi and Bluetooth card
  • Mali G720 / Immortalis-class GPU
  • A three-core Zhouyi NPU

And if you’ve been paying attention to all my homelab testing, those two 5GbE ports alone make this more interesting than most hobbyist SBCs. But, of course, there is a lot more to expandability than that:

Orange Pi 6 Plus annotated board layout
Annotated board layout showing ports, headers and key components (image: Orange Pi)

Hardware Info

The CPU is interesting in itself–the fastest A720 cluster reaches about 2.6GHz, the A520s top out around 1.8GHz, so like many other big.LITTLE ARM architectures you get asymmetric clusters rather than a uniform twelve-core machine:

Architecture:                         aarch64
CPU op-mode(s):                       64-bit
Byte Order:                           Little Endian
CPU(s):                               12
On-line CPU(s) list:                  0-11
Vendor ID:                            ARM
Model name:                           Cortex-A520
Core(s) per socket:                   4
CPU max MHz:                          1799.9980
CPU min MHz:                          799.9990
...
Model name:                           Cortex-A720
Core(s) per socket:                   8
CPU max MHz:                          2600.1980
CPU min MHz:                          799.8400

lspci is a bit more revealing, especially because you get to see where the dual 5GbE setup and Wi-Fi controller are placed–each seems to get its own PCI bridge:

0000:60:00.0 PCI bridge: CIX Technology Group Co., Ltd. CIX P1 CD8180 PCI Express Root Port
0000:61:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8126 5GbE Controller
0001:30:00.0 PCI bridge: CIX Technology Group Co., Ltd. CIX P1 CD8180 PCI Express Root Port
0001:31:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8126 5GbE Controller
0002:00:00.0 PCI bridge: CIX Technology Group Co., Ltd. CIX P1 CD8180 PCI Express Root Port
0002:01:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8852BE PCIe 802.11ax Wireless Network Controller

In the same way, the USB bus is entirely ordinary (this is with it plugged into one of my KVMs):

Bus 007 Device 002: ID 05e3:0610 Genesys Logic, Inc. Hub
Bus 007 Device 005: ID 04d9:0006 Holtek Semiconductor, Inc. Wired Keyboard (78/79 key) [RPI Wired Keyboard 5]
Bus 007 Device 006: ID 093a:2510 Pixart Imaging, Inc. Optical Mouse
Bus 011 Device 002: ID 05e3:0761 Genesys Logic, Inc. Genesys Mass Storage Device
Bus 012 Device 002: ID 0bda:b85b Realtek Semiconductor Corp. Bluetooth Radio

Nothing exotic, which I rather like. And, by the way, the board ships with Cix Technology Group UEFI, version 1.3, so setting up boot devices and managing (very) basic settings was trivial.

Building the Image

This is where I took a very large detour from my usual approach: I decided early on that I wasn’t going to use a vendor image for this board.

Vendor images for SBCs like this always tend to be good enough to boot, occasionally good enough to do basic benchmarks, and almost never something I want to build on–especially if I’m doing local AI work, host-native services, or anything that requires me to trust package sources, first-boot behaviour and upgrade paths.

I wanted a server-first layout, reproducible fixes and a place to bake in GPU/NPU prerequisites, so I forked orangepi-build and started from there, with a fairly high bar:

  • I wanted a fully reproducible Debian 13 / Trixie build with features like /dev/kvm present, not a vendor image with stale software and missing features I wanted.
  • The build needed to stop treating Ubuntu as the only real target–add-apt-repository, PPA logic and software-properties-common had to be cleaned out.
  • Boot fixes had to be baked in from the start, not applied as post-flash rituals.
  • First boot had to be deterministic. If the root filesystem resize requires me nearby with serial and patience, the image isn’t finished.
  • I needed a clean place to stage GPU firmware, vendor userspace and NPU packages.

The Orange Pi repository included kernel 6.6.89-cix, so a lot of the above was already “there”–I just needed to hack at it, but instead of doing it entirely by hand I got piclaw to set things up on an Ubuntu 22.04 VM.

Over a few weeks (this took a while), the above list translated into a fairly concrete set of changes in the build tree:

  • added Trixie configs under external/config/{cli,desktop,distributions}/trixie
  • patched scripts/distributions.sh for Debian 13 support
  • fixed the board config to allow trixie under DISTRIB_TYPE_NEXT
  • removed Ubuntu-only dependencies from the package lists
  • forced standard Debian mirrors
  • made the kernel build non-interactive
  • started baking in GPU/NPU prerequisites and development tooling for later testing

The package side needed archaeological work too. I patched orangepi-config to stop behaving as though it were on Ubuntu, removed software-properties-common from the Trixie dependency chain, forced regeneration of cached packages, and went hunting through component_cix-next for whatever vendor bits still existed and matched my kernel, taking notes throughout.

Getting To First Boot

My first boot-related note on this board was short: I flashed my custom Trixie image, got as far as GRUB, and it fell over because the EFI stub was wrong. The image did contain the right DTBs (SKY1-ORANGEPI-6-PLUS.DTB and friends), but the build scripts somehow commented out useful menu entries and the default pointed at the ACPI path.

But getting past GRUB was only half the battle. The first real boot surfaced another annoying issue: the partition resize worked, the root filesystem resize didn’t, and the machine failed to reboot cleanly at the handoff. I had piclaw trace the resize helper, found it was disabling itself before the second stage could run, and patched that too.

The whole thing made for a pretty intensive couple of weeks:

Build and fix timeline
Build and fix timeline

In parallel, I made sure to include GPU/NPU support:

  • firmware symlink so panthor could find mali_csffw.bin
  • baked in cix-noe-umd and cix-npu-onnxruntime
  • and a big pile of dev tooling so the board could bootstrap AI experiments without turning into a scavenger hunt

NVMe and Swap

Once the image was booting reliably, I wanted the board off SD entirely. I had a 512GB NVMe drive sitting about, so I had piclaw handle the migration–even though it had just finished patching orangepi-config, the actual cutover was done manually: partition the NVMe into EFI, root and swap, rsync everything across, patch GRUB.CFG to point at the new PARTUUID, reboot, verify, remove the SD card.

Storage migration: SD to NVMe
Storage migration: SD to NVMe

Software

So, to recap, I had to fix these things for my custom image:

  • Boot chain: initially broken because GRUB defaulted to the wrong path; stable once DTB boot was forced
  • GPU / Vulkan: initially llvmpipe fallback or panvk failure; working with vendor Vulkan ICD on mali_kbase
  • OpenCL: not useful at first, functional once the vendor userspace was in place
  • NPU kernel side: visible from the beginning, probe messages reporting three cores
  • NPU userspace: present only in fragments, inconsistent package references, a lot of manual validation needed

But after the first few steps were done, I had zero issues installing or building software on this–GCC 14.2 from Trixie, Bun as the primary scripting runtime, and the usual complement of build-essential, cmake, clang and ninja for C/C++ projects.

Python 3 and pip are present for the inevitable bits that still need them, and Docker runs cleanly, plus I made sure I had /dev/kvm available for virtualised workloads–and with the CIX patches for the P1 SoC, everything went swimmingly. The kernel is PREEMPT-enabled, which is pleasant for interactive work and inference latency, though I haven’t tested RT workloads.

I even got to run reliably on this with zero issues (including creating ARM VMs on it) before wiping the NVMe to do some AI testing.

The one area where the software story gets awkward is the vendor-specific GPU and NPU userspace–covered in the next two sections. Everything else about running Debian on this board is unremarkable, which is a compliment.

GPU

Out of the box, the Linux graphics story was absent. The kernel side was in a half-state that looked superficially encouraging–/dev/dri/* present, both panthor and mali_kbase around, the system clearly aware of a Mali GPU, etc.

But Vulkan fell back to llvmpipe, and forcing the Mesa Panfrost ICD produced Unknown gpu_id (0xc870) errors. So I had piclaw go through the Orange Pi and component_cix-next package sources and find the missing pieces: vendor userspace for the CIX stack–cix-gpu-umd, cix-libglvnd, cix-libdrm, cix-mesa and a Vulkan ICD pointing at libmali.so.

Installing those got me partway–the userspace reported No mali devices found, because the board was still on the wrong kernel path. Once I rebound the GPU from panthor to the vendor mali/mali_kbase stack, /dev/mali0 appeared and Vulkan reported actual hardware:

  • deviceName = Mali-G720-Immortalis
  • driverID = DRIVER_ID_ARM_PROPRIETARY

OpenCL also came up correctly afterwards, again via the vendor path.

This was pretty good news as far as typical SBC testing goes, since it means you can get decent (if vendor-specific) GPU support working–but getting there involved driver rebinding, vendor package archaeology and a persistent module policy to keep the machine on the right stack across reboots.

GPU driver stack: open path vs vendor path
GPU driver stack: open path vs vendor path

NPU

The NPU story was, if anything, even more typical of this class of hardware.

Linux clearly knew there was an NPU–dmesg reported three cores during probe–but the userspace was absent or incomplete and the package references inconsistent enough that I had to validate URLs by hand. One package version was simply gone, another worked, and I only reached a coherent install because component_cix-next still had enough usable artifacts lying about.

Not to say the NPU is fake or useless–it isn’t. But the tooling has that familiar feeling of being assembled by several teams who weren’t speaking to each other as often as they ought–and if your interest in a board like this is local AI, that matters more than any TOPS figure on a product page.

NPU stack status
NPU stack status

Performance

This is where the board started being interesting.

Since I have been getting more and more involved in low level AI work, I spent most of my time testing local inference–the Orange Pi 6 Plus is not a universally good AI box, but it is surprisingly usable within a narrow envelope of models and runtimes.

And to make it usable for a few use cases, I needed a model-and-runtime combination that felt like an actual working stack rather than a demo. I ended up trying four inference runtimes–[PowerInfer], [ik_llama][ikl] (which is a CPU-optimized version of llama.cpp), vanilla llama.cpp, and my own Vulkan-patched version of llama.cpp that for the Orange Pi 6 Plus’s GPU (the NPU, alas, like many other ARM SoC NPUs, is designed more for vision processing than LLM work, and I spent a few evenings trying).

I ended up running well over a dozen different combinations of models and runtimes, and these five were the ones I invested the most time in, since I wanted a model that was powerful enough for “production” use even if it was a little slow in practice:

Inference performance by model and runtime|669
Inference performance by model and runtime|669

The dark bars are generation speed, the lighter bars are prompt processing. The verdicts on the right reflect what happened when I pushed each model through a real agent pipeline with tool calls, not just a short benchmark prompt–and that is where the gap between “fast on paper” and “actually works” showed up.

The Liquid models posted impressive raw tok/s figures but broke down in practice with blank responses and formatting failures. The 35B sparse model was surprisingly fast under ik_llama.cpp but ate all available RAM and failed roughly 40% of the time.

Only the Qwen 4B on Vulkan held up as something I would actually leave running and the best all-round result was Qwen3.5 4B Q4_K_M on Vulkan:

Metric Value
Runtime llama.cpp Vulkan
Prompt t/s 8.4
Generation t/s 9.7
Typical response time 6-25s
RSS ~5.3GB
Stability 10/10 pass at -ub 8


Not desktop-GPU territory, but enough to move the board from “cute” to “useful”. More importantly, it was stable–it followed my coding assistant’s AGENTS.md prompt correctly, handled tool calls, and didn’t chew through all available memory.

The production configuration I eventually settled on was:

llama.cpp -m qwen3.5-4b-q4_k_m.bin \
  -c 32768 \
  -ngl 99 \
  -ub 8 \
  -np 1 \
  --reasoning-budget 0 \
  --jinja \
  --cache-ram 0

Every flag has a story–especially (-ub), the micro-batch size, which controls how many tokens llama.cpp tries to process per Vulkan dispatch.

It turns out that the Mali Vulkan backend had a descriptor-set exhaustion issue that needed patching upstream before it stopped crashing (yes, I spent a while debugging Vulkan…), and I ran a set of benchmarks specifically for that:

Vulkan micro-batch tuning sweep|695
Vulkan micro-batch tuning sweep|695

Bigger batches should mean better GPU utilisation and faster prompt ingestion, but the Mali G720’s Vulkan driver has a hard limit on descriptor sets–exceed it and the backend either crashes or silently degrades.

The green bars are stable configurations, the orange ones are not–and the dashed box marks where I landed for production. At -ub 16, prompt speed collapsed because the driver was already struggling; at 64+ it fell over entirely.

The tuning sweep showed where the practical ceiling was rather than the theoretical one:

  • At -ub 2, the setup was stable but underwhelming: about 4.3 prompt tok/s and 9.7 generation tok/s.
  • At -ub 4, prompt speed improved to 5.9 tok/s with the same 9.7 generation rate.
  • At -ub 8, which is where I eventually landed, prompt speed climbed to 8.4 tok/s and generation stayed at 9.7 tok/s.
  • At -ub 16, the whole thing became temperamental and prompt throughput actually collapsed to around 2.0 tok/s.
  • At -ub 32, it could survive a test run, but not in a way that inspired confidence.
  • At 64+, it was simply crashy.

So the practical production setting was not some elegant theoretical optimum–it was simply the highest value that stopped the Vulkan backend from crashing. That, in a sentence, sums up a fair bit of the experience of using this board.

Runtime Rankings

The runtime matters almost as much as the model:

  • llama.cpp on Vulkan was the best all-round practical setup, but only after patching and tuning.
  • llama.cpp on CPU was useful as a baseline and for sanity checks, but too slow once model size started to climb.
  • ik_llama.cpp on CPU turned out to be dramatically better for some 2-bit and sparse-ish workloads than I had expected, to the point where it occasionally made GPU offload look silly.
  • [PowerInfer] remained interesting mostly in theory; in practice it was too awkward and too far behind the other options to matter.

GPU offload was not always the right answer. A lot of the marketing gravity around boards like this points you toward the GPU or NPU as the only interesting path, but once you start timing things, the answer is much more conditional.

Qwen3.5 35B-A3B IQ2_XXS was instructive. Under stock llama.cpp, far too slow. Under ik_llama.cpp, dramatically faster on CPU–to the point where it occasionally behaved like a real system rather than a cry for help. But it had a roughly 40% empty-response rate, consumed nearly all RAM and swap, and was slow enough end-to-end that I would only call it “working” in the same tone one might describe a vintage British car that has just completed a short journey without shedding visible parts.

For that model, the runtime comparison was actually rather stark:

  • Upstream llama.cpp on pure CPU (-ngl 0) managed about 0.63 prompt tok/s, 1.07 generation tok/s and took 76.67s end to end.
  • Upstream llama.cpp with a token amount of offload (-ngl 8) was, if anything, slightly worse at 80.03s total.
  • ik_llama.cpp on CPU was the surprise winner by a ridiculous margin: 16.24 prompt tok/s, 5.24 generation tok/s and 12.75s total.
  • ik_llama.cpp with -ngl 8 promptly ruined that advantage and fell back to a miserable 71.33s total.

Model Rankings

That is one of the more useful things I learned here: for some quantized models on this machine, CPU inference with the right runtime was not just competitive with GPU offload, it was much better.

The Liquid models were interesting for a different reason. LFM2 8B-A1B Q4_K_M managed roughly 46.7 tok/s prompt and ~32 tok/s generation on Vulkan–objectively impressive for the active parameter count–and LFM2.5 1.2B pushed generation to around 45 tok/s. On paper, these look like the hidden sweet spot. In practice both failed when pushed through the full agent pipeline: blank output, formatting failures, over-eager obedience to internal conventions. Useful to know, but not deployable.

For reference, the ranking I ended up with:

  • Qwen3.5 4B Q4_K_M on llama.cpp Vulkan at 9.7 generation tok/s was the only setup that felt production-usable.
  • Qwen3.5 35B-A3B IQ2_XXS on ik_llama.cpp CPU at roughly 5.3 generation tok/s was the most surprising result–impressive, but too flaky and memory-hungry to trust.
  • LFM2 8B-A1B Q4_K_M on Vulkan at roughly 32 tok/s generation posted a great benchmark number but broke down in real agent use.
  • LFM2.5 1.2B Q4_K_M on Vulkan at roughly 45 tok/s generation was quick but not dependable enough to matter.
  • Qwen3.5 0.8B Q4_K_M on CPU at about 46 tok/s sounds good until you ask it to cope with a full agent prompt.

So yes, the board can run local models. It cannot run all of them well, and a distressing amount of the work lies in sorting out which bits of the stack are broken on any given day, but it was a much better experience than with Rockchip boards, and I intend to try out Gemma 4 and more recent models soon.

Fan Noise

While the above was going on, I kept tabs on both thermals and memory, since I expected sustained GPU or inference workloads to need active airflow. But I had to deal with the fan first, since the Orange Pi 6 Plus ships with a pretty beefy cooling solution that, sadly, is very on the loud side.

And there’s no fan curve–all you get with the CIX kernel is a sysfs interface via cix-ec-fan with three modes:

  • mute
  • normal
  • performance

The first leads to the CPU reaching fairly high temperatures under even moderate load, the last is unbearably loud, and the normal setting ranges from moderately quiet to annoying, so for most of the testing I moved the board to my server closet.

Benchmarks

Again, the CIX P1 has 12 cores, but they are not equal–four low-power Cortex-A520 cores clocked at 1.8GHz and eight faster Cortex-A720 cores spread across four clusters at different peak speeds (2.2 to 2.6GHz). The kernel’s cpufreq subsystem treats each cluster independently, which means that it takes a bit of effort to max out all the cores:

  • sbc-bench reported no throttling during its run, which was encouraging.
  • The aggregate 7-Zip score landed around 33k, with the best single A720 core around 3874 and the A520 cluster way behind at about 1617–a nice reminder that workload placement matters on this SoC.
  • Memory bandwidth on the A720 cores was respectable: libc memcpy in the 15-17 GB/s range, memset often 35-47 GB/s.
  • The A520 results were dramatically lower across the board.

Memory Bandwidth

An interesting twist I lost some time exploring is that you can actually see some differences per CPU cluster, which is new for me in ARM machines:

Memory bandwidth by CPU cluster
Memory bandwidth by CPU cluster

Blue bars are memcpy (read-then-write), red bars are memset (pure write). The A520 cluster is roughly half the bandwidth of the A720s across both. This matters for inference because memory access patterns land on whichever cores the scheduler picks, and a hot path pinned to the efficiency cluster is immediately noticeable.

Thermals

On a quiescent system, sensor readings were good–most blocks hovered in the high twenties to low thirties Celsius:

  • GPU_AVE: 29°C
  • NPU: 30°C
  • CPU_M1: 30°C
  • CPU_B0: 32°C
  • PCB_HOT: 33°C

The thermal logs during the benchmarks were more reassuring than I expected:

  • idle and light-load readings sat mostly around 29-33°C across GPU, NPU and CPU blocks
  • under the longer benchmark runs, board and package sensors generally rose into the mid-30s to about 40°C range, which is very good (but, as you’d expect, audibly noticeable from outside the closet)
  • frequency traces showed the active cluster spending long stretches pinned at its target clocks before later dropping back, which looked much more like workload phase changes than panicked throttling
  • One benchmark artifact I largely ignored was the iozone run, because it was aimed at /tmp and therefore mostly measuring the memory-backed path rather than telling me anything meaningful about persistent storage.

Here’s a new chart that tries to capture thermals and frequency a little better than my old ones:

Thermal and frequency trace during sbc-bench run|653
Thermal and frequency trace during sbc-bench run|653

The above covers the full sbc-bench session–roughly 40 minutes of mixed workloads.

The three shaded phases correspond to what was running at the time: a short iozone burst (memory-backed, not interesting), the main sbc-bench battery (OpenSSL, 7-Zip single and multi-threaded, tinymembench across all clusters), and the trailing cooldown.

The key thing to notice is that frequency stayed pinned at target clocks throughout the heavy phases and only dropped back during transitions–there was no thermal throttling, which is pretty amazing.

Temperature peaked around 43°C during the sustained multi-threaded 7-Zip run, which is well within spec for a board with active cooling. The idle baseline was around 29°C, and it settled back there fairly quickly once the load came off.

One thing I could not track was fan speed, since the cix-ec-fan interface does not expose current RPM or duty cycle, and I had no way to correlate the thermal curve with what the fan was actually doing at each point. I could hear it spin up and settle, but I have no real data to overlay, and even though I considered setting up a dB meter, I never got around to it.

Living with it

All of the above covers the first week or so. But I’ve been running this board as an always-on machine since March 8, and by now have a month’s data on what it’s like to live with.

The board now hosts a piclaw instance (my personal assistant) that I’ve been using for development and model testing, since I realized that LFM2-8B-A1B made for a faster thing to experiment with (31 t/s generation, 47 t/s prompt on Vulkan) even if it’s effectively not that “smart”.

Alongside the assistant work, I’ve been using the board for a real development project: porting the BasiliskII classic Mac emulator’s JIT to AArch64.

Over the past month that has meant a good deal of compilation, linking, automated experiment runs and testing. The JIT now executes real 68k ROM code with basic optimisations–interrupt delivery and display rendering are the active frontier, but it boots to a Mac OS desktop every now and then. The constant rebuilds around AArch64 JIT bugs I hit (broken optflag inline asm bindings, various register allocation and flag bugs in codegen_arm64.cpp, VM_MAP_32BIT allocation failures, repeated runs at fixing emulated 68k interrupt delivery) were genuine low-level issues that exercised the board’s toolchain and memory subsystem in ways no synthetic benchmark would, and it’s been working great.

Power Consumption

One thing that came up in every review of the CIX P1 I read–[Jeff Geerling’s Orion O6 writeup][jg] being the most prominent–is power draw, and I have a month’s worth of data to confirm that it’s higher than average–averaging at 15.5W, rather than the usual 13W that I see quoted in other places:

Orange Pi 6 Plus wall power over 30 days
Orange Pi 6 Plus wall power over 30 days

The flat zeros on the left are the setup period when I was reflashing and debugging offline. Once it came up as an always-on machine the power draw settled into a consistent daily pattern.

Orange Pi 6 Plus wall power over 7 days
Orange Pi 6 Plus wall power over 7 days

Zooming into the last week at 15-minute resolution, the daily idle/load cycle is clearly visible–overnight the board drops to about 15-16W, and during the day it hovers around 20-27W depending on what I am doing. Compilation and inference bursts push it briefly toward 30W; the rest of the time it sits comfortably in the low twenties.

That said, the idle floor of 15-16W is noticeably higher than what I am used to from other SBCs. A Raspberry Pi 5 idles around 3-4W, an RK3588 board typically settles around 5-8W, and even a Mini PC with an N100 can idle below 10W.

The Orange Pi 6 Plus never really gets below 15W even with nothing running, and that appears to be a common trait of the CIX P1 reference design rather than anything specific to this board–the Radxa Orion O6 (same SoC) shows a very similar baseline in the reports I have seen.

Whether that is down to the memory controller, the 5GbE PHYs, the always-on fan or some combination of all three, I cannot say for certain. But it does mean the board is less attractive as a low-traffic always-on appliance than the raw compute-per-watt numbers might suggest. At 15W idle you are paying about 130 kWh/year just to keep it breathing, which is not terrible but is not nothing either.

Orange Pi 6 Plus current draw over 7 days
Orange Pi 6 Plus current draw over 7 days

I checked, and current draw mirrors the power profile and stays well under 0.2A on the 230V circuit. The board’s power supply is not doing anything exotic.

Mains voltage on the office circuit over 7 days
Mains voltage on the office circuit over 7 days

The voltage trace is mostly here for completeness–Lisbon mains hovering around 230-232V with the usual overnight sag and daytime recovery. Nothing that would stress any reasonable power supply, and useful as a sanity check that the power readings are not being skewed by wild grid swings.

Reboots over the month: essentially none that weren’t my doing. The board has been stable in a way I did not expect from the early boot-chain experience.

Conclusion

After all of this, the Orange Pi 6 Plus fits a fairly specific set of roles:

  • local inference experiments with carefully chosen models
  • edge-side telemetry or monitoring
  • compact Linux services that benefit from dual 5GbE
  • infrastructure roles where you want something denser and lower-power than x86 but more capable than the usual toy SBC

I wouldn’t use it as a general-purpose desktop, and I wouldn’t trust the NPU story for anything LLM-related without more soak time. But I would keep it around for the sort of edge-AI and systems work I usually get drawn into–enough real capability to justify the effort, even if that effort is, right now, unreasonably high.

Even considering that I cut a lot of corners on the software side to get to a usable state, the hardware is still very much ahead of the software.

The GPU works, the NPU stack exists in some recognisable form, and local AI is not only possible but occasionally good, and I like what it can do, even if the power consumption and fan noise are higher than I would like for a board in this class, but compared to Rockchip’s offerings, it’s a much more polished experience–and the fact that I can get it to do useful work at all by myself, with my own OS image, is a testament to the progress ARM boards have made in the last couple of years.

Notes for March 30 – April 5

This was a shorter work week partly due to the Easter weekend and partly because I book-ended it with a couple of days off in an attempt to restore personal sanity–only to catch a cold and remain stuck at home.

Read More...

The Xteink X4

I got an Xteink X4 this week, and my first reaction was somewhere between amusement and nostalgia–it is absurdly small, feels a lot better made than I expected for the price, and the form factor harks back to the times when I was reading e-books on Palm PDAs and the original iPod Touch.

Read More...

Hans Zimmer

At least they aren’t from Behringer
Modular synths on stage. Who would have thought?

Notes for March 23–29

Work ate the week again. I’m exhausted, running on fumes, and daylight saving time stole an hour of sleep I could not afford–the biannual clock shuffle is one of those vestigial absurdities that nobody can be bothered to abolish, and I’m starting to take it personally.

Read More...

Notes for March 16–22

This week’s update is going to be short, largely because work was hell and I ended up spending my Saturday evening poring through my meeting notes backlog until 2AM today and I have a splitting headache to show for it.

Read More...

Notes for March 9–15

Well, there went another work week. Slightly better (to a degree, although I got some discouraging news regarding a potential change), and another week where piclaw ate most of my evenings–it went from v1.3.0 to v1.3.16 in seven days, which is frankly absurd even by my standards.

Read More...

MacBook Neo Impressions

I went to a local mall yesterday and happened to chance upon a couple of s on display at our local (monopolistic) retailer1, and spent a half hour playing with them.

Read More...

So You Want To Do Agentic Development

We’re three months into 2026, and coding agents have been a big part of my time since –things have definitely intensified, and has already panned out: agents are everywhere.

Read More...

Notes for March 2–8

This was a frankly absurd week work-wise, with some pretty long days and a lot of late-night hacking on my projects (which is not exactly a new thing, but at least now I am asking piclaw to do it during the day time, which is a small improvement).

Read More...

Notes for February 23–March 1

Well, going back to work after a week off was rough.

Read More...

Notes for February 16-21

This week I did something different: I took a wellness break from work and generally tried to tune out all the noise and messiness I have been experiencing there. It ate a chunk out of my PTO, but was mostly worth it.

Read More...

macOS Tahoe 26.3 is Broken

I have no idea of what is happening since I can’t even find any decent logs in Console.app, but it seems that the latest update to macOS Tahoe (26.3) has a serious bug.

Read More...

Archives3D Site Map