This is a fascinating box–so much so that after almost three weeks playing with it, I amassed so much material that I nearly decided to split my review into two parts, but in the end I decided to condense it a bit and post a longer piece than usual, even if that means almost half of it is a fairly wide-ranging exploration of how to get AI workloads on it.
The MilkV Jupiter 2 in its metal case
Spoiler: We’re tantalizingly close to having usable non-GPU inference on SBCs, and surprisingly enough, RISC-V is more interesting than ARM right now.
I’ve tested a lot of ARM boards over the past few years, but only a couple of RISC-V machines–and the MilkV Jupiter 2 is quite a substantial system: Sixteen cores (with a twist), a refreshingly roomy 32GB of RAM, a 10GbE SFP, Wi-Fi 6, a GPU with actual DRM nodes, all in a Pico ITX form factor.
Disclaimer: my contacts at Radxa supplied me with a Jupiter 2 free of charge, and as usual, this article follows my review policy.
On paper, this is the first RISC-V board that doesn’t feel like a science project.
In person, and unlike most of the SBCs I get, the Jupiter 2 is a finished product, and came in a neat little box, fully assembled and contained in an unassuming metal case with external antennae as the only extra parts. No power brick, but since it has a USB-C PD port, I had zero trouble powering it from one of my monitors.
After some careful disassembly, the board itself is pretty dense: 1× DP out, 1× eDP ribbon, 1× USB-C PD power input, 3× USB-A 3.0, 1× GbE RJ-45, 1× 10GbE SFP+ cage, an M.2 slot and what looks like a second M.2 for storage. There are also MIPI/eDP ribbon connectors I haven’t tested.
The board is dwarfed on the top side by the cooler, which I dared not remove
The SoC is SpacemiT’s K3–a big.LITTLE style arrangement with 8×A100 cores at 2GHz and 8×X100 cores at 2.4GHz, which makes it the first RISC-V chip I’ve handled that has asymmetric core clusters. And since there are a few other devices out there with the same reference design, I will henceforth refer to the Jupiter as the K3 for short.
If you’ve never come across SpacemiT’s stuff before (I had only a bare inkling of the K1), I heartily recommend the public SpacemiT K3 documentation and their GitHub repository since the architecture is laid out there, and it was fairly easy to get a high level grasp. In particular, the K3 SoC datasheet has a pretty good overview:
Block Diagram from the K3 Technical Brief
A key thing that needs to be taken into account is that the A100 cores are fundamentally different from the X100 ones. They have extended vector instruction sets, dedicated transactional memory, and, well… AI.
That documentation also seems to be the original source of the marketing claims that the K3 provides 60 TOPS of AI compute and can run 30B models at over 10 tokens/s. Well, sort of– as another spoiler, I can share that I hit a hard cap at an effective 3B (which seemed to be the practical limit), but we’ll get there…
One of the nice things about this box is that it comes with a 10GbE Realtek NIC. I wasn’t able to test that at full speed yet since my 10GbE interfaces are all in my server closet, but the 802.11ax reported below worked flawlessly with my Wi-Fi 6 setup:
That sda (model TY7B-128) initially fooled me into thinking it was a SATA SSD–but there’s no SATA controller on this board, and the 3.4 GB/s reads I measured later are well past anything SATA III can do (~600 MB/s). It’s actually 128GB of onboard UFS, which rides the kernel’s SCSI layer and so enumerates as sda exactly like a SATA disk would (NVMe would be nvme0n1, eMMC mmcblk*). The mtdblock devices are the 8 MB NOR flash partitions (bootinfo, FSBL, env, eSOS, OpenSBI, U-Boot).
The sensors output is a bit weird, but it does cover all the CPU cores (A100 are clusters 0 and 1, X100 are 2 and 3). And I will have a bit more to say about the fan.
But I’m ahead of myself here–these were gathered after plugging it in, obviously, and it’s worth rewinding and going over that part:
This was a first-class experience, and I wish all SBCs worked this way: I plugged the DP port into my ancient LG Ultrafine, powered on the monitor, and got a Bianbu first-boot wizard in less than 5 seconds after the initial logo.
Clicked through it–language, timezone, user account–and landed on a working accelerated desktop. That’s it. No GRUB patching, no DTB hunting, no resize-filesystem bugs, no serial console required. The smoothest first boot I’ve had with an SBC all year.
The board ships with Bianbu 4.0 (“Resolute Raccoon”)–a Debian-based distribution from SpacemiT, which, unlike most ARM boards I’ve used recently, is actually running a modern 6.18.3 kernel.
MilkV Jupiter 2 LXQt on Wayland - note how only the first 8 cores are active
The desktop runs LXQt on Wayland, SDDM as the display manager, and the whole thing felt responsive enough that I didn’t immediately reach for the terminal. That is not something I say about SBC desktops often, and even though I then spent most of the past three weeks accessing it via ssh, I would likely have zero issues using it.
Standard apt works (repos seem to be at spacemit.com), Debian toolchain is present, and the kernel command line includes some interesting RISC-V-specific hints: unaligned_scalar_speed=fast and unaligned_vector_speed=fast, which I think are related to the RVV extended vector instruction set and the way the kernel does thread allocation.
I dug around a bit more and the boot chain goes through NOR flash (OpenSBI + U-Boot) → UFS, which is cleaner than the SD-card-based setups on most SBCs I’ve tested, and it was able to update itself without any issues:
Not UEFI, but compared to the U-Boot-on-SD-card experience that most ARM SBCs inflict on you, having a proper NOR flash boot chain with OpenSBI → U-Boot → onboard UFS is a step up, because it means you can brick the OS partition and still recover without reflashing an SD card on another machine (and yes, Rockchip, I’m looking at you).
And since it all worked out of the box, I did not try adding an NVMe (there’s an M.2 M-Key slot for one) or booting from it (yet), although since there is official Ubuntu support I fully intend to try that out in the future.
Developer tooling for RISC-V will be foremost on most of my readers’ minds, so I can tell you right away that I am currently making extensive use of these:
GCC 15.2 (riscv64)
Go 1.25.7 – works out of the box, which is significant for me
Python 3.14.3
Make 4.4.1
Sadly (for me), Bun isn’t available, since there’s no official riscv64 build available yet, but node works OK. I focused mostly on Go, though.
To get started, I ran a small battery of tests to get a feel for where this sits relative to the Orange Pi 6 Plus (CIX P1, 12 ARM cores) I’ve been living with for months.
Note that these benchmarks only ran on the X100 cluster (cores 0–7). The A100 cores (8–15) are kernel-fenced for AI work–htop shows them sitting idle, and sched_setaffinity silently refuses to pin anything there from a normal shell. The reasons for that are various and fascinating, and I’ll get into them below.
The sysbench single-thread number is the interesting one here: 2,329 versus 2,800. That’s only a 1.2× gap per X100 core. The 7-Zip figures (17.5k vs 42.3k MIPS) look damning until you realize that the A100 cores weren’t used at all, so the Jupiter 2 is really running 8 general-purpose threads against the P1’s 12.
The real gap shows up in Go and Python (4-5×), which probably says more about how young the riscv64 runtime backends are than about the hardware itself.
I went back and ran this in parallel on the CIX P1, and the K3’s memory bandwidth is much lower–roughly a fifth for reads. This is likely the biggest single performance gap and puts an upper cap on whatever the CPU can do regardless of how much it packs into each cycle. For inference workloads that are memory-bound, this matters a lot. The K3 has a few workarounds, though, as we’ll see later.
The built-in UFS storage is very nice–NVMe-class speeds, better than what I saw on the Orange Pi 6 Plus’s NVMe setup with my own (underused) PCIe 4 SSD. No complaints here.
The board stays well-behaved under sustained 8-core stress-ng:
Idle: 59-64°C, fan at 45% / 2335 RPM
Full load (30s sustained): 62-68°C, fan ramps to 60% / 3194 RPM
No throttling observed, which made my usual CPU/thermal charts kind of pointless
Again, stress-ng --cpu 0 ran on the 8 available X100 cores, but even when I ran both CPU and AI loads that used the A100 cores, the fan was audible but not objectionable–noticeably quieter than the Orange Pi 6 Plus’s cix-ec-fan in quiet mode, and the fan controller API is much saner.
Since I had a few tussles with the Orange Pi 6 Plus’s fan controller limitations, I let an LLM loose on /sys/devices, and it found out that the Jupiter’s fan is managed by a CrosEC controller over eSPI (/sys/devices/platform/soc/cac8c000.espi/84000000.ec). That exposes a standard hwmon interface with fan1_input and (surprisingly) fan1_fault that standard Linux utilities can read (and the built-in cooler does seem to have the right number of wires to provide fan sensing, which is a nice touch).
There’s also a separate pwm-fan platform device at /sys/devices/platform/pwm-fan/hwmon/hwmon8/pwm1 that accepts values 0-255 for direct duty-cycle control, with pwm1_enable=1 when thermal management is active, with a pwm-fan cooling device linked to thermal_zone0. In practice, you never need to touch any of this–the board keeps itself at 60-68°C under sustained load with the fan barely audible, even when using all 16 cores and at an ambient temperature of nearly 28°C in my office.
I stuck a USB PD power monitor between the PSU and the K3, and the figures were pretty stable: 11W idle, an oddly symmetrical 22W under load. I suspect using an SFP for networking will add significantly to that, but most of my testing was actually done by ssh over Wi-Fi.
Unlike the Orange Pi 6 Plus, where the GPU required driver rebinding and vendor package archaeology, the Jupiter 2’s PowerVR GPU works out of the box.
No module loading, no blacklisting, no package hunting. I ran vulkaninfo and got a conformant Vulkan 1.3 device on the first try, although I am not sure how far I can go with Vulkan compute on this board yet since I explored other avenues.
The hardware is an IMG PowerVR B-Series BXM-4-64 MC1, and Vulkan reports it cleanly:
deviceName = PowerVR B-Series BXM-4-64 MC1
driverID = DRIVER_ID_IMAGINATION_PROPRIETARY
apiVersion = 1.3.277
driverVersion = 1.588.1135 (24.2@6603887)
conformanceVersion = 1.3.8.1
Doing the usual barrel-scraping YouTube influencer “testing” of firing up a 4K video in the browser is… absurdly fluid, really, since the K3 has a dedicated video decode unit (/dev/video-dec0, V4L2 “mvx” driver–decode only, no hardware encode that I can find) and that seems to be properly stitched together on the Bianbu packages.
OpenCL 3.0 is also present, with cl_khr_fp16 and cl_khr_integer_dot_product – the latter suggesting hardware support for int8 dot products, which is exactly what you want for basic vision processing. I tried poking at it with my Vulkan tooling, and the Vulkan side exposes shaderFloat16 and shaderInt8, 16KB shared memory, and 2 compute queues.
In short, I had zero issues with desktop acceleration, and I expect the K3 to be well supported going forward. I do intend to explore Vulkan on this a bit more, but as you’ll see below, I got completely sidetracked by the ISA and how it does vector compute…
The device tree shows an Arm China Linlon V5 (Zhouyi AIPU) at c0500000, status okay.
Okay, then, but… the device-tree lacked the obvious NPU plumbing I am sort of used to from ARM:
/proc/device-tree/soc/linlon-v5@c0500000/compatible says arm china,linlon-v5
there are no /dev/aipu*, /dev/npu*, /dev/linlon* or /dev/zhouyi* nodes
there are no aipu, linlon or zhouyi kernel modules under /lib/modules/6.18.3-generic
dmesg is silent for those names
web searches for linlon-v5, arm china,linlon-v5, Zhouyi AIPU and SpacemiT K3 NPU drivers turned up no public driver or SDK that matches this node
The Linlon V5 block is effectively opaque–no driver, no SDK, no kernel module. So it’s a dead end for now, although I suspect there are drivers for it somewhere.
What is interesting is what’s hiding in Bianbu’s apt repository: a SpacemiT ONNX Runtime stack (spacemit-onnxruntime, python3-spacemit-ort) and a spacemit-tcm package. The latter ships libspine_tcm.so, spacemit-tcm-smi and a public spine_tcm.h, and it talks to /dev/tcm rather than to a classic /dev/npu device. That’s not an NPU path at all–it’s targeting the A100 RISC-V cores and their tightly-coupled memory directly.
After the first evening of poking around, I decided to do what most people would do and read some actual documentation–which wasn’t hard to come by.
The CPU chapter in SpacemiT’s documentation gave me a few hints: the A100 cores run SpacemiT-IME (Inference Matrix Engine), a set of custom RISC-V vector extensions for quantised matrix arithmetic, with a programming model that gave me a bit of a flashback to my FORTRAN and VAX days–matrices in registers, explicit tiling and core synchronisation–but as a crash course in what RISC-V vector extensions can actually do, it made for a fun read.
The short version, if you’re in a hurry, is that this is a “unified memory” RISCv system where the CPU itself can do some interesting quasi-GPU math:
The long version is that this is almost tailor made for go-pherence, my pet inference library. I’ve been trying to do mostly MLX-like FP16 stuff with it, but my intent is to do non-GPU stuff with it, and even though AVX2 and NEON are interesting, I was completely nerd-swiped by the idea of using this RISC-V RVV variant to do “proper” inference.
And Codex was able to sort out how to map this to useful steps and identify parts of the instruction set that could do just that:
The custom instructions (vmadotsu.hp, vmadotu.hp, vnpack4.vv, vupack.vv, vpack.vv) perform fused int4×int8 dot products with FP16 accumulation. Each vmadot dispatch processes 128 bytes of activation against 512 bytes of 4-bit weights, producing 32 partial results. The data layout treats VS1 as copies×(M, K) matrices and VS2 as copies×(K, N) matrices, with the result stored across VD(L) and VD(H).
The “hard” part was to map this to Go assembler, but, again, Codex had no trouble churning out code for vector operations by just lining up the right bits:
I had some trouble figuring out how this mapped to the TCM memory device that I had found, but a few more pages into the ISA doc it became clear:
TCM is 3 MB of on-chip SRAM (8 × 384 KB blocks), meant as a low-latency scratchpad for the IME2 matrix engine. According to the docs, both sets of cores can access it in pairs:
From the X100 cores (VLEN=256), TCM reads at 1.14 GB/s (uncacheable device memory)
From the A100 cores (VLEN=1024), it reads at 5.4 GB/s via a direct SRAM path for wide vector loads
This is a pretty dramatic difference from the RAM bandwidth I measured earlier, and even more so if you consider that the A100 cores can access it four times faster than X100 cores. And there’s more:
Cores are organised in pairs sharing TCM blocks, so they can exchange results much faster
I later found that SpacemiT’s own reference code uses paired-worker barriers to overlap DMA (weight prefetch from DRAM into TCM) with compute on the partner core
If you’ve ever done double-buffering, well, this is it applied to vector compute.
Armed with this knowledge, I distilled it into a SPEC and went to town on the K3 with Codex to see if we could port some of the go-pherence SIMD inference kernels, but there was a serious kink: I couldn’t for the life of me figure out how to schedule code on the A100 cores.
So I asked Codex to get out Capstone and disassemble the TCM libraries. Turns out getting a thread onto the A100 cores requires a two-step handshake:
write the thread’s TID to /proc/set_ai_thread (a kernel interface that unlocks scheduling on cores 8–15 for that specific thread)
then call sched_setaffinity to pin it.
Without the registration the kernel silently refuses the affinity change–those cores are fenced off from normal userspace entirely (which explains the oddities in the early benchmarking).
SpacemiT’s own llama.cpp fork (PR #22863) uses this pattern: six pthreads permanently pinned to cores 8–13, synchronised with spine_barrier_t (an atomic spinlock barrier), sitting in a persistent work loop that processes matrix tiles from a shared queue.
The workers never return to the OS scheduler between operations–barriers replace dispatch overhead entirely. I later realized that a) this is how the K3 can hit 35–40 tok/s on Qwen3-0.6B Q4_K_M b) Go scheduling has a lot more overhead.
Disassembling the ONNX runtime I’d found (SpaceMITExecutionProvider) showed it used the same cores with SPACEMIT_EP_* settings for thread count, profiling, and operator filtering.
So where does this leave us in terms of usable inference? Well, a lot of people like speed, and if you want speed, you can install llama.cpp-tools-spacemit 0.0.8 and run TinyLlama 1.1B Chat Q2_K (which is just 459MiB) with 8 threads:
Test
Result
Prompt processing pp128
137.47 ± 0.05 t/s
Token generation tg64
36.60 ± 0.01 t/s
This is pretty impressive as SBCs go, and no wonder I am starting to see YouTube videos demoing it—it fills up a screen impressively fast if you do a one-shot prompt, but is fundamentally useless.
The more interesting question is whether the K3 can host a usable local coding endpoint, so I worked through a spread of current models on a fork of the SpacemiT llama.cpp tree, all at Q4_K_M with f16/f16 KV and 8 threads.
I cranked out a Pi session and had it draft a realistic agentic coding turn: a system prompt with tool definitions, a prior read tool call, the file returned as context, and a request to produce an edit tool call - roughly 700-900 prompt tokens in, 700 generated out.
The results were… Interesting. And slow to achieve, not just because of the turn times but also because I had to patch llama.cpp to match minor changes in the Bianbu libraries:
Model
Type / active
RAM
Prefill (t/s)
Decode (t/s)
Overall† (t/s)
Turn
Qwen3.6-28B-REAP-A3B
MoE / A3B
17.3 GB
29.1
6.5
11.5
140s
Gemma 4 E4B
dense / 4B
4.9 GB
28.9
5.7
9.5
147s
Gemma 4 E2B QAT UD-Q4_K_XL
dense / 2B-ish
2.5 GB
99.6
12.9
-
18s/128 tok
Gemma 4 26B-A4B
MoE / A4B
16.9 GB
38.8
5.1
9.1
154s
Qwen 3.5-9B
dense / 9B
5.6 GB
22.5
4.5
8.2
195s
Gemma 4 12B
dense / 12B
7.3 GB
18.7
2.46
4.3
322s
Gemma 4 12B QAT UD-Q4_K_XL
dense / 12B
6.3 GB
25.0
3.6
4.2
~86s/300 tok
†Overall = (prompt + completion tokens) ÷ total compute time - blends prefill and decode for the turn.
So yes, it can run fairly decent models, but at slightly over 2 minutes a turn, not in a usable way. That doesn’t mean it can’t run LLMs, just that it can’t run moderately serious ones at speed (still, I’m pretty sure you can stuff a smaller Qwen variant in there and do simple things like home automation).
Since I happened to be playing with a few of these models on my RTX3060 (where they work at 4-8x the speed, making them quite usable), I copied the weights across and had Codex script out the same run across them with a few variations in settings:
Model
Note
Prefill t/s
Decode t/s
Qwen3.6-28B-REAP + ngram spec
copy-heavy task, 81% accept
29
15.5 (2× peak)
Qwen3.6-28B-REAP @ 64K ctx
light context
33.1
7.8
Qwen3.6-28B-REAP @ 262K ctx
full native context
21.5
9.8
Qwen3 0.6B
tiny model
293
55
Qwen3.6-28B Q4_0 (requant)
deep 9K ctx
21.7
3.5
Qwen3.6-35B-REAP + MTP
non-viable on this CPU backend
-
- (stalled)
The pattern is somewhat clear: on this memory-bandwidth-bound board, decode rate tracks 1 / active parameters – or something. Sparse mixtures-of-experts and sub-4B dense models “work”, but anything above 3B just doesn’t, really. And Multi-token prediction (MTP), which I had gotten to work pretty well on my 3060 under go-pherence, stalls completely.
Since Gemma 4 just came out (again, for what, the third time?) with QAT, I also tried both its MTP and QAT variants by patching llama.cpp a bit further (by this time I was really hooked).
And splitting the workload across core types actually “worked”: sticking drafters on the slower X100 and the rest on A100 was feasible, but… there’s no fast memory exchange between core types, so it was (verifiably) useless:
Gemma 4 E2B QAT was, however, “useful”, for a rather slow definition of it (~13t/s), and it is technically multimodal, but on the K3… not usable either. I tossed a 224×224 test image into it, which took roughly 39–47 seconds just to process through the projector, and even though it could identify a solid red square an equally simple red square/blue circle image came back as “yellow and white”. Might be my code, but by this time I had already started eating into my vacation days and I decided to call it quits.
The numbers are interesting, though, and make me wonder what a K3-like CPU can do with other kinds of models:
Model / route
Params (M)
RAM / file
Prefill t/s
Decode t/s
Notes
Qwen 3 0.6B
596
373 MB
37.5
43.5
Great demo, zero substance
Gemma 4 E2B QAT
4,630
2.5 GB
99.6
12.9
decent prose/code without toy-model speedups; can use tools, will keep it around
Qwen3.6-28B-REAP-A3B
28,240
17.3 GB
28.9
7.15
quality anchor; large context and actual coding
Gemma 4 E4B
7,520
4.9 GB
27.5
6.01
twice as big as E2B, and twice as slow
Gemma 4 12B QAT UD-Q4_K_XL
11,910
6.3 GB
25.0
3.6
sort of worked, but unusable
In practice, none of the model, quant, or speculative tricks break the ~7 t/s decode wall for genuinely useful generation on the quality models, and we’re stuck shuffling ~3B of active weights per token out of LPDDR.
I was able to get very close to the C numbers with the E2B QAT, so I will be playing with that a bit more–in fact, I think that the Gemma 4 models are the most interesting thing out there if, like me, you’re stuck with an RTX 3060 and a 36GB RAM M3 Mac as your top inference hardware…
I am quite taken by the K3, to the point where I just got Whisper going on it using go-pherence and am now trying to shoehorn various other things into it, but the summary is as follows:
Software maturity is surprisingly good – Bianbu 4.0 has a modern kernel (6.18), modern tooling, and has had zero papercuts (so far). This is not the “barely boots” RISC-V experience from two years ago, when Go barely worked out of the box on riscv64 without a full rebuild. And no, I didn’t try Rust yet, but I will, eventually.
Thermals are well-managed – never exceeded 68°C even under sustained 16-core load, fan ramps smoothly and is much quieter than the CIX P1.
The shipping storage is great – NVMe-class speeds from the onboard UFS (didn’t feel the need to add an NVMe), no SD card in sight
GPU and video decoding story is better than expected – PowerVR Vulkan 1.3 and hardware decoding both work out of the box.
The “normal” CPU cores (and memory bandwidth) are middling – per-core IPC is roughly half an A720, but the A100 RVV sort of makes up for it.
And even if it can’t do “real” LLMs, I am pretty sure the K3 can handle standard image recognition swimmingly. YOLOv5 has just come out even as I am putting this post together, so I haven’t tested it, but the key thing is that RISC-V is really interesting as a CPU platform now (at least for me). Of course, it has to come with enough RAM (and those 32GB RAM are probably the bare minimum any AI SBC should have for any realistic use), and times are tough, but I look forward to testing its descendant(s).
Right now I’ve embarked on the rather quixotic quest of getting Ideogram 4 on it (and yes, I know it won’t really “work”, but I wanted to have a go and have another “working” implementation besides the RTX back-end), and I expect I will spend a bit of time trying to tweak Qwen or Gemma 4 on it to see if I can have a permanent “house LLM” that doesn’t suck and can do basic automation (even if slowly) – and I’ll update this post (or add a link to it at the bottom) with any positive results.
Jun 8th 2026 · 3 min read
·
#ai
#apple
#automation
#design
#ios
#macos
#opinion
#siri
#wwdc
This was the weirdest WWDC26 keynote in a while, and some of the past ones were visibly phoned in. It was rife with weirdness and flashbacks.
To my surprise, a few of my wish list items actually made it. Naming the next macOS “Golden Gate” was not on my bingo card, though; a little too trippy and a lot too lofty for what is, by Apple’s own tacit admission, a Snow Leopard year: catching up rather than charging ahead.
The self-deprecating tone ran through the whole thing, from a hippy bus that was equal parts weird and funny to the unmistakable sense of a company that spent the past year watching the industry sprint past it on AI and is now, not running but sedately pacing, to catch up.
Much to my surprise, two of my top annoyances got airtime: they’re tackling Spotlight and Mail search, the exact failures I called out, although whether either works once it ships is anyone’s guess.
They’re also doubling down on automation, at least superficially, with vibecoded Shortcuts and a renewed push for third-party Actions. Vibecoding Safari extensions and Shortcuts is the genuinely interesting part: it points at automation rather than novelty, which is more than I can say for yet another Image Playground. None of it erases the brittleness and legacy gaps that made me want a real platform to begin with, but it’s at least pointing the right way. Tab grouping and change detection in Safari are a fun party trick, no more.
And yes, there’s a new, as-yet-unproven Siri (with a completely pointless AI moniker) you summon by holding the power button (part Spotlight, part walkie-talkie, plus a floating gelatinous orb in Vision Pro), and a Siri app trying to be a catch-all bucket for every interaction.
The new voice struck me as a little cringe and overly American, which is an odd note to land on when you want me talking to my machines all day. The feature set is fuzzy: on paper it can touch far more of my data, and moving photos to the shared library by voice would be neat if it works. But Siri has been stuck at “if it works” for fifteen years, and the one thing I actually want (for it to handle my mail and calendar properly) wasn’t demoed in any useful detail.
I wondered whether the automation push would reach HomeKit, and the answer is a shrug: the new camera detection is cute, but a YOLO model has done exactly that for a decade, and the automation logic I actually need stays vague. The rest of my list didn’t show at all: no hypervisor on the iPad, no running my own code without the annual toll, nothing on iCloud sync, the Watch, or SwiftUI. Maybe the sessions turn something up (which is why this is an early read), but my expectations haven’t budged.
The framing around Apple Foundation Models was the bigger tell: we already know there’s Gemini underneath, which leaves me wondering how much Apple is adding beyond the wrapper. Liquid Glass got the same treatment by being walked back in the most face-saving way imaginable, with the old Accessibility transparency slider re-warmed and trotted out as an improvement. Disingenuous is the word, twice over.
Update: Also much to my surprise, they actually mentioned unifying the corner radii, which I completely missed. I must have tuned it out after the 300 random percentage performance improvements they quoted against… no real baseline, really.
Anyway, Apple heard the parts of everyone’s complaints that a) did not force them to walk back Liquid Glass and b) fit the AI story it needed to tell, and stayed quiet on a lot of the boring structural stuff that’s been broken for years. Yes, they are committing to improving performance and fixing some of the most egregious issues, and that’s not nothing; hearing Spotlight and Mail search admitted out loud is more than I expected, but it is mostly Apple’s technical debt catching up with them, and, of course, Apple catching up with everyone else where it regards AI, but on its own terms and at its own pace.
Oh, and they deprecated pretty much all of my hardware, too. Kind of expected, much like the usual geographical restrictions, which mean a good chunk of this may not reach Portugal for a year, if at all.
I’m going to give it a couple of days until the dust settles, watch the Platforms State of the Union tomorrow, and then mull things over a bit more. And maybe, somehow, we can chalk up this WWDC as a sort of a win, in the long run.
Jun 7th 2026 · 2 min read
·
#ai
#calibre
#mcp
#niri
#noctalia
#notes
#weekly
I decided to take a couple of days off and generally tune out, thanks to a few strategically placed bank holidays – which meant my usual mix of relaxing and dealing with a few chores.
For starters, I replaced the battery on our A1466 MacBook Air, which just keeps on trucking – it’s now on its third battery (I swapped the factory one some four, or was it five, years ago). For around EUR 80, keeping that rather nice keyboard/screen/trackpad combination in use was a no-brainer, and it too now runs a Niri desktop, having been converted to Fedora a few months ago.
I’ve been automating away a fairly large chunk of VM and container management – I have a dedicated agent that knows how to manage my Portainer stacks and version them in Gitea, for instance – but as it turns out, LLMs are also pretty good at a few other things, like setting up emulators under Steam (creating nice icons, fixing controller input mappings, tuning upscaling and shaders, and the rest of it).
But I hadn’t let an LLM loose on my Calibre and music collections yet, and – with the right safeguards – it’s been awesome at tidying up metadata. I had dozens of ancient books with slightly broken Calibre metadata, so I’ve been putting together an MCP server that sits next to my library to fix them – mostly because I don’t want to give a model full filesystem access to my NAS, and this way I can snapshot the database whenever it tries anything more extensive. I may well make something more generic, given time.
Jun 5th 2026 · 4 min read
·
#apple
#automation
#ios
#ipad
#macos
#rant
#wwdc
Michael Tsai’s annual roundup of WWDC wish lists went up this week, and the thing that struck me most wasn’t any single request–it was the mood. There seem to be fewer wish lists than last year, several people openly admitted they couldn’t be bothered to write one, and the ones that did are pretty much bereft of any “aspirational” wishes.
In short, most Apple developers seem resigned to their fate, and echoed the same weary plea for a “Snow Leopard” year where Apple fixes things instead of shipping more, er… “liquid” junk.
One thing that is clearly apparent even to me (even though I am not doing a lot of Mac or iOS development save ios-linuxkit) is that we haven’t even got stability in the 26s yet (John Siracusa has a rather mordant take on that in the latest ATP episode), and in a couple of weeks we’ll get betas of the 27s piling bugs on top of bugs.
I already wrote my catalogue of what’s broken last month, so consider this the constructive inverse–roughly the same list, reframed as things I’d actually like to see fixed next week.
None of these are moonshots. Most have been fixable for years, and a fair few were working better a decade ago.
What’s changed for me is the agentic-era stakes: I now point Codex and Claude at almost every tool I use during the day, and Apple’s software is, conspicuously, the part that fights back hardest (although I can’t really go on about it much, this week’s MS Build is chock full of examples where Microsoft is way ahead of Apple in working AI integration, and it’s… just sad to me personally).
My expectations are effectively rock-bottom by now. Apple has become a hardware company where software seems to have been tacked on as a somewhat under-maintained afterthought. But I can’t help but keep a scorecard, so here’s what I’m hoping for–in rough order of how often it ruins my week.
I want Mail to be automatable again. Not necessarily the full plugin API they killed, but an AppleScript dictionary that isn’t frozen in amber and a MailKit surface that can file, tag and search without ceremony–because the one app I live in all day is the one black box I can’t point an agent at. While they’re at it, smart folders and rules that sync from the Mac should finally arrive on iOS, roughly twenty years late.
Spotlight should simply find things that exist. I’d settle for that alone–no AI, no reinvention–just reliable, complete results and the one-line reindex affordance the Mac has had for years made available on iOS, so a corrupted index doesn’t mean a multi-hour restore that breaks Apple Pay and FaceID along the way.
In the agentic era, automation needs to be a first-class platform, not an afterthought. Like many others, I wish for a way to programmatically create and modify Shortcuts; I also want Shortcuts that don’t break between OS releases, a genuine cross-platform story, and the MCP-style hooks that OpenAI and Anthropic have to keep reinventing to automate anything in macOS. Windows still does COM and Win32 automation so well that I built an agent tool against it in fifteen minutes–Apple should be embarrassed by that comparison.
Give the iPad back a hypervisor. Hypervisor.framework has been on the Mac since Yosemite and Apple Silicon runs Linux VMs beautifully, yet an EUR 1,400 iPad Pro with an M4 can’t run a container or a VM that a EUR 50 ARM board handles without breaking a sweat. The entire local-LLM and coding-agent ecosystem I depend on is locked out of the most powerful tablet I own.
HomeKit needs a scripting layer and real logic. Scene chaining, granular presence, if-this-then-that that actually works, and–for the love of everything–let HomeKit automations call Shortcuts, not just the reverse. I’ve papered over all of it with Node-RED and Home Assistant, but none of that should be necessary for someone who bought into the ecosystem.
Make iCloud sync trustworthy and give us Sync Now buttons across the core apps, the way Messages already has (for now, until they notice and remove it). Stop silently migrating data to CloudKit and leaving the CalDAV and IMAP paths to rot–document third-party access properly instead of letting Reminders and Notes quietly vanish from open protocols. Apple has never exposed any APIs worth using, and that needs to change.
The Watch should be the best time-aware device Apple makes, and instead it’s a widget carousel. I want a Pebble-style chronological timeline, a Smart Stack that’s actually aligned with my calendar, and the Watch independence Imthaz Ahamed asked for–let it pair with more than one phone.
Let me run my own code on my own hardware without an annual EUR 99 toll. I don’t want App Store distribution–I want a “just run this on my phone” mode in Xcode that doesn’t involve certificate chains that expire and silently brick my sideloaded apps.
Stabilise SwiftUI or admit it’s a research project. Views that worked on iOS 17 behave differently on 18 and seem broken on 26, and I lose hours dropping to UIKit to dodge layout bugs reported years ago. Steve Troughton-Smith’s dream of a real cross-platform successor to UIKit and AppKit is the one I’d trade everything else on this list for if I had to write iOS apps for a living.
And no, I’m not going to complain about Liquid Glass again. I don’t think anyone at Apple will ever own up to how much of a failure it was (even down to controls that provide user feedback but don’t register clicks at the very edge of them), and some of it was an improvement (the other 80% of spattering controls atop application content wasn’t).
Every one of these is within Apple’s reach. They have the engineers, the money, and total control of the platform, which is precisely why the pattern grates: this isn’t technical inability, it’s a decade of chosen neglect dressed up as focus, whether you look at it from the pure platform side or if you think about it in terms of the (utterly absent) third-party API integration surface.
This is, unashamedly, a bit of a rant. I’ve been using Macs since System 6 and writing here since the OS X betas, and I’ve watched the company get richer and more capable while the software I use every day gets quietly worse at the boring, essential things, and no wonder I have gradually started using other platforms to the point where most people don’t even consider this a Mac blog.
But I am deeply indebted to Apple for making the platforms that have kept me sane over multiple decades, and I do care about the ecosystem, so… Here we are.
I’d love to be proved wrong next week. I won’t hold my breath–but the scorecard is open, the pen is out, and if all we get is another year of razzle over the dazzle, at least I’ll have a checklist to tick off.
Jun 4th 2026 · 9 min read
·
#agentic
#ai
#anthropic
#codex
#coding
#copilot
#llms
#openai
#workflow
Since today is a bank holiday for me, I decided to consolidate a few more of my notes into a post. What follows is a set of guiding “principles” that I’ve found useful over the past year or so and that I’ve codified into various bits of scaffolding I reuse across my projects.
As usual, I’ve tried to strip away all of the hype and fuzziness and stick to facts, but everyone has their own way of leveraging AI, so your mileage may vary.
However, unlike most of what I read online about AI these days, I am not pitching any specific tooling, although all of this is based on my experience.
Full Disclaimer: I work at Microsoft and have a personal Codex account that OpenAI provided for my OSS work, as well as access to random Tier 2 providers that I use to test piclaw.
A great example I usually point out is that if you ask an LLM to do extensive error handling on a piece of code, it will almost invariably (at least in TypeScript) generate empty catch(){} blocks and call that “error handling”.
Another is when I asked it to optimize a particular tree traversal function for an edge case and it just hard coded the result.
And this applies to nearly everything you ask any LLM to do–but code can be validated, and tested, and measured in various dimensions, and you can turn some of its foibles against it.
In the case of the first example above, a linter will catch that, and you can force the AI to turn those empty catches into something useful (like warning messages in logs).
The second one is nastier, but it too can be fixed through proper test fixtures (dynamic but non-repetitive).
Which is why I invariably wrap all my AI-driven projects into several layers of deterministic testing and automation.
The ground rule I follow is that even SOTA models are inherently unreliable, so when I set up a project or after the first few days of goofing around with a prototype, I try to make sure everything runs on rails.
I typically start with putting together a Makefile because it works/is preinstalled everywhere, is extremely familiar to LLMs, and means I have to do zero thinking myself when running steps manually, but you can use whatever you want.
The important thing is that it must cover the entire development and release cycle, because your agent will inevitably start drifting off and forget how it should do things.
I set it up like this:
Makefile targets to do everything (that way there is no “secret sauce” only the model “knows” to do tests, a build, etc.)
linting/static analysis (go vet is great, but you should also prepare for typical LLM “lazy” idioms like empty catch blocks, which should be considered critical errors)
tests (unit/fuzzing/functional)
builds
packaging
upstream dependency updates (packages and vendored files)
One or more SKILL.md file(s) that explain how to use the Makefile and cover the dev/test/debug/release workflows. You should make sure those are referenced from AGENTS.md or use the .github/copilot conventions (insert your flavor of choice here).
The key thing is to always aim for reproducible steps. The model will always go off into the weeds seeking an adventure regardless of how many admonitions you put in AGENTS.md or equivalent, especially when debugging things, but the Makefile (or equivalent) should be your ground truth.
The SKILL.md files are… Well, of dubious value, really. I’ve found recent regressions in GPT 5.x to have made them less effective since unlike gpt-5.3-codex newer models often don’t even read the files, but your mileage may vary.
In short, LLM-written tests are generally crap. Anthropic models, in particular, just plain cheat at writing them, so if you ask your LLM to write them, make sure you actually read them.
Unit tests written by LLMs very seldom do anything beyond the obvious, miss edge cases, etc. The only models that write halfway decent tests (as of mid-2026) are the Codex family of GPT models, and even vanilla 5.4/5.5 regressed on that from my standpoint, so my usual tactics are:
Build a set of prompts to have different models refactor tests without looking at the internals of your code (i.e., focus on contracts).
Treat tests as a black box that outputs a report, so that the session you are coding in does not see the tests and the session that runs and writes the tests does not see the code. You can call these different agents if you want–I call it separation of concerns.
Set up CI/CD flows that run all of the tests with zero agent intervention, but have CI/CD generate concise Markdown reports the agents can consume.
The last point is critical, so set it up as soon as you can–it frees up time on your machine and any decent agent can use gh (or equivalent) to fetch CI/CD artifacts, review the results and file issues for itself.
This is where SOTA models shine. Even Sonnet, bless its little stupid heart, can take a set of requirements and distill them into user stories and feature files much faster than formal committee-style BDD processes, and the quality and coverage (so far) seems to be better than humans’.
If you work with customers, this last bit is very important–humans will want to describe the user stories that matter to them in exquisitely irrelevant detail while completely skimping on the ones they don’t care about, whereas LLMs won’t care if they are describing boring bits or not, and they won’t quibble at the details–they will just do it.
The resulting user stories need to be reviewed, of course, but piping UX requirements through an LLM and Gherkin typically generates pretty decent scripted tests, especially if the LLM can look at your Preact/Vue/etc. code and build corresponding Playwright scripts.
This will save you weeks of work, and catch dozens of inevitable regressions as LLMs subtly break your front-end code en passant while implementing new features.
Mind that I never rely on the LLM to run Playwright for the actual tests directly - it will either cheat, be creative about how it inputs things, refresh the page to see if the DOM changes and break test state, etc. – it’s fine to use it to explore an app and draft the scripts, but when you run these things in CI/CD, you want them to be extremely deterministic.
And you want evidence of all functional tests, so I have a little toolkit to gather that evidence:
Playwright for web testing
tmux for TUI testing (rmux is also a thing now, but if you work in regulated industries the paperwork to get it baked into an image will likely outweigh the benefits)
A custom VNC harness for my retro emulators (using tesseract for OCR, which is surprisingly capable)
And, sometimes, a webcam or an USB video capture adapter (plus a sub-agent that only describes what it sees)
As a bonus, besides a Markdown report, I also generate a PDF report with screenshots and logs for the failing cases–and an override switch to screenshot all the tests for occasional audits.
LLMs will always mangle long files, regardless of how big the model or context window is. Anthropic models (as of mid-2026) are particularly prone to that for some reason (as well as “drive by shootings” where they mangle tangentially related files).
You need to decrease your exposure to this kind of risk and do some proactive damage control by decreasing the impact of any such errors. It is not a matter of if, it is a matter of when, and it will nearly always manifest as weird regressions a few days down the line.
What I do:
If possible in your harness, disable full-file write tooling and force the model to use edit or diff for focused edits. The added friction will typically prevent it from mangling entire files.
Set strict caps on file sizes and (depending on the kind of package) guidelines for breaking up functionality.
Review changes to see if unexpected files were touched (I have been meaning to create a SKILL.md for doing this automatically, but eyeballing by listing uncommitted files it is just easier).
Sometimes I wish I could just make unrelated files read-only before letting the LLM loose on React/Preact code, so I am looking into LSPs and static analysis to see if I can do the coding equivalent of raycasting–projecting out which files would be related to a specific change.
Every few sessions. stop and refactor the code. Most technical debt from AI use comes from letting it literally piss all over your nice module structure.
In particular, I’ve found that LLMs like to define redundant types and duplicate code pretty much at random because they can’t see across your entire code base. If they’re operating in one part of the tree, they’ll be completely oblivious to the rest.
What I do is that once I have implemented one feature (or a sequence of features) and tests pass, I aggressively go in and review every single type, helper and filename.
Models can do baseline audits (the trope about OpenAI models fixing code Anthropic ones wrote is very much true in my experience), and you can trust the outlines of the audits, but with some caveats:
They will always cut short the depth to which they analyze code
They will often stop at module or dependency boundaries
They will only try to merge or remove duplicate code if it is blatantly obvious (and even then it is not a guarantee)
I do use models for audits, but only as a starting point. Then I go in and:
Point out where there was feature creep or duplication of code/responsibilities in the module structure
Enforce things like centralized logging
Manually flag duplicates and give instructions by adding TODO comments to the code
In Go (which I have sort of gravitated to recently due to the balance of great profiling and refactoring tools and less cognitive overhead than Rust), gopls can significantly help the model do most file splitting/refactoring automatically and without any chance for the model to mess things up, so every so often I fire up a dedicated session, hand it a prebaked set of guidelines and do a full-on refactoring pass.
Models have a tendency to follow “best practices” to a point where they create untenable messes of nested abstractions, very much like the sort of people who write Python as if they were cosplaying at writing Java–classes, accessors and factories everywhere, etc. You know what I’m talking about.
This is something that initial SPECs and system prompts actually help with, until the context window is so full that those guidelines are “forgotten”.
Weed those out ruthlessly. By all means define reusable contracts and use strong typing (TypeScript is a godsend in that regard), but expect your linter and LSP to catch your LLM red-handed.
There are many ways to work with AI, and none of them work for everyone, but there are some basic tenets I follow:
Shorter Sessions = more attrition. One-shotting features will just create more pain and technical debt down the line, and they foster an illusion of progress, not stuff you can actually rely on.
Make sure you are willing to put in the design and spec effort. The more you think and plan yourself, the more grounding you can provide to an agent to keep it on track.
Leaving the agent to its own devices for an hour or so will give you time to ponder–yes, it might be risky token-wise if you haven’t specced out the work well enough, but that is part of the challenge here.
I think Ralph loops are profoundly stupid and wasteful, but am very much a fan of writing a SPEC, chunking it into a plan.md (or your harness’ equivalent) that includes clear directions for testing and then using things like /goal complete the plan.md file, because that provides the agent with a clear cut set of steps.
Goal seeking of various forms (autoresearch, performance optimizations, etc.) can be extremely effective and reliable, but only if you’ve stacked up most of the previous tricks written above (and even then I’ve caught LLMs cheating at benchmarks in the most egregious way: “the simplest option is to not execute the query” is a real thing that actually happened).
Again, do not trust any of the code the agent puts out. And even if it works, keep track of how it works–in a sentence, instrument the crap out of everything:
Enforce structured logging as soon as possible, and have automated checks to ensure that errors/exceptions/etc. are logged.
Maintain a set of benchmarking/regression tests that output actual metrics (if you don’t use OpenTelemetry, try to at least have a text file with key metrics)
Be very thorough about regression testing. Taking the time to rebuild and run last week’s version will often show that you’ve missed either testing for something or measuring something important.
Again, CI/CD is your friend here, and a lot of my time, even on personal projects, has been spent on building test and smoke harnesses of various kinds:
Mock up external APIs and write various failure modes into the mocks so that the LLM will have to deal with “errors” from the start.
When doing emulation/JIT work, create a test harness for each specific operation that you can gdb through (LLMs can actually do this pretty well), then a smoke harness that you can compare with QEMU, etc.
When doing microcontroller work, build and test subroutines separately in the host machine before assuming they will work in the microcontroller.
When doing inference optimizations (like in go-pherence), cross-check similar kernels across back-ends and architectures to ensure they all provide the same results
The list goes on, but the key thing is that everything should be automatable and outside the control of the LLM.
Is all the above hard work? Yes. But can you take most of it along with you when you start a new project? Also pretty much yes–and the icing on the cake is that once you’ve gotten the basics down, the principles are all transferrable across stacks/environments/runtimes and the thought process will keep your wits sharp.
Not to mention these things will save you a bunch of time.
Jun 2nd 2026 · 1 min read
·
#agents
#ai
#microsoft
After years of rumors, NVIDIA is finally shipping an Arm chip for Windows PCs, and the part that interests me isn’t the GPU–it’s the up to 128GB of unified LPDDR5x memory sitting behind it, something that Qualcomm never really went for.
The RTX Spark is essentially a consumer rebrand of the DGX Spark dev box (which I’ve been trying unsuccessfully to get my hands on, by the way), pairing a 20-core Grace CPU (co-designed with MediaTek, all big and “medium” cores, no efficiency cores) with up to 6,144 Blackwell cores, roughly a desktop RTX 5070’s worth of GPU inside an 80W envelope.
Might be a little toasty for a laptop, and will have to be very power efficient if they really want to compete with Apple Silicon… But there are zero actual specs anywhere on the PR, and pricing is sure to be… interesting.
But it’s nice to see them chasing the same unified-memory architecture that makes Apple’s M5 Pro/Max and the Framework Desktop genuinely useful for running local models, since 100GB+ of addressable VRAM is a lot more useful than the insulting 8-12GB you get on a discrete 5070.
And the gaming angle also makes it pretty interesting. Prism translation has finally gotten good enough that productivity work feels indistinguishable, but gaming remains a minefield of anti-cheat kernels that simply refuse to run. Qualcomm never “fixed” that (nor pricing, or efficiency either).
If it didn’t feel like the end times for computer hardware right now, this would be amazing.
May 31st 2026 · 3 min read
·
#go-pherence
#hardware
#networking
#niri
#notes
#weekly
Today I realised that I could just spend the day doing essentially nothing and that nobody would hold it against me (at least in Western nations), so… I might well do just that, with a few caveats:
Something very weird happened after I published my notes on last week’s Wi-Fi tweak – it made it to Hacker News (a day or so after I submitted it myself, because, as usual, most of my self-submitted links still appear to be shadow-banned despite 30K+ karma–and no, I don’t understand that either), and it was very popular among the usual band of armchair networking experts.
But then something really weird happened: I got an alert from Cloudflare that the lowercase-rewrite worker I’d deployed as a fallback for incorrect linking was exceeding the free-tier limit (100,000 runs, if I recall correctly), which made me curious enough to dig into the analytics:
The control chart doesn't lie. Those orange dots are not normal.
I have CF’s anti-bot crawling settings active, I turned on CAPTCHAs again after the initial peak, and yet… 70,000 views in an hour, twice? Has to be crawlers. And how did CF let them through and count them?
So I went and plotted Clarity’s chart of “human” visitors (always an undercount, since it only captures people without JS or ad-blocking, but useful as a sanity check):
The real HN spike was Thursday. Everything after is noise.
Definitely bots after the initial HN flood. I have to wonder why, why now, and whether Cloudflare’s free tier is still even marginally effective at blocking them.
The most interesting work this week was grafting speaker diarization onto go-pherence. Whisper tells you what was said; knowing who said it is a separate problem, and the standard answer is SpeechBrain plus a Python subprocess plus a fairly heavy PyTorch dependency. I did not want any of that. Instead I ported ECAPA-TDNN – the speaker embedding model SpeechBrain uses – to Go, and it all now mostly works with zero Python, even if it still needs a lot of tweaking.
There’s a speakercheck validation harness that runs spot-checks against windowed audio segments, scores against expected speaker labels, and outputs JSON reports, and a diarize-vtt command that accepts an optional ECAPA model and emits speaker-tagged VTT output. I expect to drop this onto one of my current hardware test subjects soon.
Allergy season is finally fading (at least for me), but today was the first time I had to turn on the AC in the office, and it was great to realize that despite the recent Wi-Fi changes and almost four years of potential HomeKit foibles, my ESP32 hack is still working perfectly.
Those minor joys aside, I’ve been actively trying to get out of the house to do some exercise at least one hour a day and it is clearly not going to happen at lunchtime anymore–well, not every day, at least, so I’m starting to get cabin fever.
All of this to say that I’m feeling as if I am starting down the slippery slope to both physical and mental burnout again, and this time I’m backing off as early as possible.
For starters, I am currently profoundly annoyed at my current working arrangements, since my days of wall-to-wall meetings with completely random 15 minute breaks are both utterly destroying my health and eroding my ability to focus. Sometimes, and despite being remote for many, many years, I would really prefer to be back working at an office, if only because I miss walking about and using stairs to go and talk to people.
Turns out my closest project team are now in Madrid (plus Belgium, Sweden, Canada, etc.), so that isn’t going to happen. And, truth be told, online meetings are now so stupefyingly more productive (as meetings go) that actual work is still best done remote–as long as you can cut through the tremendous amount of AI-augmented cruft that a meeting now entails.
I, as usual, have been pragmatic about it and crafted my own agent to summarize meetings the way I want them, and to craft terse, minimalist works of corporate obeisance that avoid the walls of text I get by default and focus on the stuff I need to do instead of spouting corporate cheerleading (it has become a surprisingly popular hack).
Anyway, my priority is now, again, my well-being. But I feel like my entire lifestyle is in dire need of an intervention, and the obvious life hacks most people suggest like exercising in the early morning (when I am trying to do my daily reading and research) or at the end of the day (when I am just bog tired) just don’t work for me, so the upshot of all this is that I am currently trying to carve out slots throughout the week to just get out of the house for 30 minutes.
Which is completely stupid.
This has to change (somehow). In the meantime, part of that carve-out is also going to be about mental health–I’m phasing out Twitter/X again, as well as a bunch of other “social” distractions and hypefests like HN.
May 26th 2026 · 6 min read
·
#cudy
#hardware
#homelab
#networking
#openwrt
#usteer
#wifi
A few months after writing up the Cudy AX3000 units and moving the house over to OpenWRT, I ended up revisiting the one bit I had deliberately waved away as “good enough”: roaming.
My sinuses are still giving me grief, but this week was much more successful at pretending to be enjoyable, at least. For starters, we watched Project Hail Mary, and it was every bit as good as I would expect it to be, which is very rare in movies these days.
I think it’s time for an update on my iPad Pro M1 and, most importantly, the Logitech Combo Touch I got for it. Think of it as a long term review of sorts.
This is a little bit of follow-up to my MiniBook X review – I keep using it routinely (especially when we travel for leisure) and love the little thing to bits, but I’ve been wanting to run it mostly on power saving mode to reap the most benefit out of the hardware (and battery, of course), so I started looking at desktop environment alternatives.
Yes, I could already get a full afternoon (and then some) out of it, but Apple Silicon has spoiled me as far as battery life expectations go, and GNOME has a little bit too much baggage for that kind of extended use.
Since I spend 90% of my time on it writing or coding and still have a penchant for keyboard-driven desktops, I initially switched to Fedora Sway Atomic (gotta love being able to swap environments with a single command…), but later installed Niri and Noctalia Shell because I really like both the idea of a scrolling window environment and the sheer polish of the whole thing–even if there are some rough edges here and there.
I am very happy with it, and writing plugins for it is trivial:
I hacked together a Bing Wallpaper plugin in 30m
The one thing that annoyed me to no end, though, was locking on suspend, which Noctalia Shell should do but apparently doesn’t in Fedora, so I had to resort to two hacks:
This last one feels extremely gauche and I hope to find a better way, but I guess this comes with the territory. I don’t really care about having a trendy Wayland desktop (I just want a dead simple one with a bit of polish), but I hope this kind of hacks won’t be necessary for much longer.
Oh, and of course I set gsettings set org.gnome.desktop.wm.preferences button-layout 'close,minimize,maximize:appmenu' to match macOS decorations.
I know this blog has strayed a fair distance from its Mac-centric origins, but I’ve been keeping a mental list of all the things that are broken, missing or inexplicably neglected in Apple’s software, and it’s gotten long enough that writing it down feels like a public service1.
The weather has gone a tad cloudy again, which provided me some relief from my allergies–but not enough for proper overnight rest, so yet again I arrived at Friday afternoon totally exhausted.
Last weekend my DS1019+ decided, for some unfathomable reason, to stop working after I took it out of the closet, dusted it and put it back, and I have feelings about it.
The Ternus announcement got me thinking about the one thing I keep wishing Apple would build and almost certainly never will: a family-scoped AI assistant that actually works across all our devices.
I am very late to this party (it was announced at I/O last week and I’ve been buried in other things), but Google is replacing Chromebooks with “Googlebooks”–Android-based laptops with Gemini baked in, designed to sync with your phone.
On the face of it, this looks like yet another Google rebranding exercise, and considering what happened with the Pixel and Google’s penchant for unveiling “category defining” devices they never actually sell worldwide, My first reaction was “meh”.
But with Android’s recent support for desktop windowing, resizable apps and Linux sandboxing, this is actually very interesting for me–because it means you can have a laptop that runs Android apps natively, has a proper desktop shell, and can spin up a Linux container for development work. All on ARM hardware.
If they get the desktop UX right (which is a big “if” given Google’s track record with consistency–I’ve set up Android 16 on a Pi and it sucks), this could be a genuinely compelling alternative to both Chromebooks and cheap Windows laptops–especially for people who already live in the Android ecosystem and want something that doesn’t fight them the way iPadOS does.
May 12th 2026 · 1 min read
·
#agents
#ai
#ide
#ipad
#opinion
#piclaw
#ux
This was a weird week, both because I keep waking up at 5AM with my sinuses clogged, and because I feel like I’m losing momentum. Feeling almost permanently cotton-headed, sleepy due to sheer exhaustion or because of antihistamines certainly has something to do with it, but I am not exactly enthusiastic this weekend.
Regular readers will know that I’ve spent most of the past two years shoehorning LLMs into single-board computers, partly as a learning exercise and partly because there are lots of local/”edge” applications where semantic reasoning (no matter how limited) and “interpretation” of sensor data are actually useful.
I genuinely did not see this coming. Cloudflare has been building one of the more coherent AI developer platforms out there–Workers AI, AI Gateway, Vectorize, their edge inference stack–all sitting on top of the same network they’ve been quietly expanding for years. They’ve been making real moves in the agentic space, not just slapping an LLM API on top of existing products, and I thought they were doing it in a way that would require more people, not fewer.
And yet: 1,100 jobs gone–roughly 20% of their workforce–with the explanation being that internal AI adoption changed what the company actually needs. I can follow the logic even if I find the timing jarring. I have friends in the Lisbon office–one of their larger European engineering bases, and one of the better things to happen to the local tech scene in recent years–and I’m genuinely hoping they’re alright.
May 7th 2026 · 2 min read
·
#agents
#ai
#anthropic
#codex
#coding
#llm
#openai
#opinion
I’ve been getting annoyed at constant code regressions in piclaw for the past few weeks. Something was off–even after bumping the test suite to the point where it catches most mechanical errors, gpt-5.5 kept making unrelated edits to code that should have been left alone, and I was getting really annoyed at babysitting it.
This was an absurdly productive week, at least on a personal level. I’m not sure whether to be pleased or worried about the number of projects that moved forward simultaneously, but here we are.
Of all the Maclock mods I’ve seen since I wrote up my review, this is probably the best all-round solution. It uses a custom PCB to drive a properly fitted display from a Pi Zero 2W running SheepShaver, and the result is a clean, self-contained build with none of the cable-routing bodges that plague most of these projects–and still uses the battery, which is great.
I’ve been deliberately not finishing my own Maclock mod, and this is serendipitous–sometimes waiting yields the exact solution you’d have spent weeks converging on from a worse starting point. The custom PCB is the key bit: it solves the button re-use and display connection in one go, which is the part I kept stalling on.
I’ll be using my own macemu build on it, now that the ARM64 JIT work has made SheepShaver fast enough to make even a regular Pi Zero feel snappy…
May 3rd 2026 · 1 min read
·
#apps
#ghostty
#ios
#ipad
#mosh
#ssh
#terminal
This has pretty much replaced Blink for me. rootshell is a Metal-accelerated terminal for iPhone, iPad, Vision Pro (ha!) and, surprisingly, the Mac, built on Ghostty’s rendering engine.
It is buttery smooth, and has proper mosh support, which means my sessions survive Wi-Fi handoffs and network changes without dropping.
The Ghostty bit matters because it means the rendering is fast, the font handling is good (it has Fira Code Nerd which has become by default), and the whole thing feels like a proper terminal rather than the usual iOS compromise. There’s also a built-in AI assistant that can execute shell commands locally, which sounds gimmicky but is surprisingly useful for one-off tasks when you can’t be bothered to type out a long find or awk invocation on a phone keyboard (I got it to work with a Gemini API key).
The one thing I’m missing is the ability to install my own commands like I do with A-Shell lets you–local binaries, custom scripts, that sort of thing. A-Shell’s approach of bundling a minimal Unix userland inside the app sandbox is still unmatched for offline tinkering, but it’s nice to have alternatives.
I’ve been building MCP servers for a while now–I wrote about the general approach last year, started out by creating umcp, and I’ve recently opened up an Office server that’s been battered by enough models against enough real documents that the patterns have settled.
A brisk, brilliantly coded tutorial on vector quantisation: how far you can push compression on model KV caches and embeddings without breaking what matters. The interactive sliders and diagrams do the teaching before the maths catches up.