The MilkV Jupiter 2/SpacemiT K3

This is a fascinating box–so much so that after almost three weeks playing with it, I amassed so much material that I nearly decided to split my review into two parts, but in the end I decided to condense it a bit and post a longer piece than usual, even if that means almost half of it is a fairly wide-ranging exploration of how to get AI workloads on it.

The MilkV Jupiter 2 in its metal case
The MilkV Jupiter 2 in its metal case

Spoiler: We’re tantalizingly close to having usable non-GPU inference on SBCs, and surprisingly enough, RISC-V is more interesting than ARM right now.

I’ve tested a lot of ARM boards , but only a couple of RISC-V machines–and the MilkV Jupiter 2 is quite a substantial system: Sixteen cores (with a twist), a refreshingly roomy 32GB of RAM, a 10GbE SFP, Wi-Fi 6, a GPU with actual DRM nodes, all in a Pico ITX form factor.

Disclaimer: my contacts at Radxa supplied me with a Jupiter 2 free of charge, and as usual, this article follows my .

On paper, this is the first RISC-V board that doesn’t feel like a science project.

In person, and unlike most of the SBCs I get, the Jupiter 2 is a finished product, and came in a neat little box, fully assembled and contained in an unassuming metal case with external antennae as the only extra parts. No power brick, but since it has a USB-C PD port, I had zero trouble powering it from one of my monitors.

Hardware

After some careful disassembly, the board itself is pretty dense: 1× DP out, 1× eDP ribbon, 1× USB-C PD power input, 3× USB-A 3.0, 1× GbE RJ-45, 1× 10GbE SFP+ cage, an M.2 slot and what looks like a second M.2 for storage. There are also MIPI/eDP ribbon connectors I haven’t tested.

The board is dwarfed on the top side by the cooler, which I dared not remove
The board is dwarfed on the top side by the cooler, which I dared not remove

The SoC is SpacemiT’s K3–a big.LITTLE style arrangement with 8×A100 cores at 2GHz and 8×X100 cores at 2.4GHz, which makes it the first RISC-V chip I’ve handled that has asymmetric core clusters. And since there are a few other devices out there with the same reference design, I will henceforth refer to the Jupiter as the K3 for short.

Specs

The machine I’m testing has a nice assortment of features:

  • 16 RISC-V cores (8× Spacemit A100 + 8× Spacemit X100)
  • 32GB RAM
  • 128GB UFS
  • RTL8852BE Wi-Fi 6 + Bluetooth
  • 1 GbE RJ-45 + 10 GbE SFP (RTL8127 10GbE via PCIe)
  • An IMG (PowerVR) GPU
  • NOR flash for bootloader (SPI, 8MB: bootinfo + FSBL + env + eSOS + OpenSBI + U-Boot)
  • PWM fan
  • Pico ITX form factor

The ISA

If you’ve never come across SpacemiT’s stuff before (I had only a bare inkling of the K1), I heartily recommend the public SpacemiT K3 documentation and their GitHub repository since the architecture is laid out there, and it was fairly easy to get a high level grasp. In particular, the K3 SoC datasheet has a pretty good overview:

Block Diagram from the K3 Technical Brief
Block Diagram from the K3 Technical Brief

A key thing that needs to be taken into account is that the A100 cores are fundamentally different from the X100 ones. They have extended vector instruction sets, dedicated transactional memory, and, well… AI.

That documentation also seems to be the original source of the marketing claims that the K3 provides 60 TOPS of AI compute and can run 30B models at over 10 tokens/s. Well, sort of– as another spoiler, I can share that I hit a hard cap at an effective 3B (which seemed to be the practical limit), but we’ll get there…

Hardware Info

The board identifies itself as “SpacemiT K3 Pico ITX” in the device tree, and cores are reported like so:

Architecture:                            riscv64
Byte Order:                              Little Endian
CPU(s):                                  16
Vendor ID:                               0x710
Model name:                              Spacemit(R) A100
  Thread(s) per core:                    1
  Core(s) per socket:                    8
  CPU max MHz:                           2000.0000
  CPU min MHz:                           614.4000
Model name:                              Spacemit(R) X100
  Thread(s) per core:                    1
  Core(s) per socket:                    8
  CPU max MHz:                           2400.0000
  CPU min MHz:                           614.4000
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                10 MiB (4 instances)

One of the nice things about this box is that it comes with a 10GbE Realtek NIC. I wasn’t able to test that at full speed yet since my 10GbE interfaces are all in my server closet, but the 802.11ax reported below worked flawlessly with my Wi-Fi 6 setup:

# lspci
0000:00:00.0 PCI bridge: SpacemiT X100 PCIe Root Complex (rev 01)
0002:00:00.0 PCI bridge: SpacemiT X100 PCIe Root Complex (rev 01)
0002:01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8127 10GbE Controller (rev 08)
0004:00:00.0 PCI bridge: SpacemiT X100 PCIe Root Complex (rev 01)
0004:01:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8852BE PCIe 802.11ax Wireless Network Controller

There isn’t a lot to report on the USB front (most of the below is what is plugged into my LG Ultrafine):

# lsusb
Bus 005 Device 002: ID 043e:9a46 LG Electronics USA, Inc. USB2.1 Hub
Bus 005 Device 003: ID 043e:9a48 LG Electronics USA, Inc.
Bus 005 Device 004: ID 043e:9a42 LG Electronics USA, Inc. USB Audio
Bus 005 Device 009: ID 046d:085e Logitech, Inc. BRIO Ultra HD Webcam
Bus 005 Device 010: ID 043e:9a40 LG Electronics USA, Inc. USB Controls
Bus 007 Device 004: ID 04d9:0006 Holtek Semiconductor, Inc. Wired Keyboard
Bus 007 Device 005: ID 093a:2510 Pixart Imaging, Inc. Optical Mouse

The flash storage it ships with is also sensibly organized:

# lsblk
NAME        SIZE TYPE MOUNTPOINT MODEL
sda       119.3G disk            TY7B-128
├─sda1      256M part
├─sda2      256M part /boot
└─sda3    118.8G part /
mtdblock0     8M disk
mtdblock1   128K disk
mtdblock2   512K disk
mtdblock3    64K disk
mtdblock4     1M disk
mtdblock5   384K disk
mtdblock6   5.9M disk

That sda (model TY7B-128) initially fooled me into thinking it was a SATA SSD–but there’s no SATA controller on this board, and the 3.4 GB/s reads I measured later are well past anything SATA III can do (~600 MB/s). It’s actually 128GB of onboard UFS, which rides the kernel’s SCSI layer and so enumerates as sda exactly like a SATA disk would (NVMe would be nvme0n1, eMMC mmcblk*). The mtdblock devices are the 8 MB NOR flash partitions (bootinfo, FSBL, env, eSOS, OpenSBI, U-Boot).

# sensors
pwmfan-isa-0000
Adapter: ISA adapter
pwm1: 60% MANUAL CONTROL

thermal_cluster3-virtual-0
Adapter: Virtual device
temp1: +60.0°C

thermal_cluster1-virtual-0
Adapter: Virtual device
temp1: +60.0°C

thermal_gpu-virtual-0
Adapter: Virtual device
temp1: +63.0°C

thermal_top-virtual-0
Adapter: Virtual device
temp1: +62.0°C

cros_ec-isa-000c
Adapter: ISA adapter
fan1: 3208 RPM

thermal_cluster2-virtual-0
Adapter: Virtual device
temp1: +63.0°C

thermal_cluster0-virtual-0
Adapter: Virtual device
temp1: +64.0°C

thermal_vpu-virtual-0
Adapter: Virtual device
temp1: +60.0°C

The sensors output is a bit weird, but it does cover all the CPU cores (A100 are clusters 0 and 1, X100 are 2 and 3). And I will have a bit more to say about the fan.

But I’m ahead of myself here–these were gathered after plugging it in, obviously, and it’s worth rewinding and going over that part:

First Boot

This was a first-class experience, and I wish all SBCs worked this way: I plugged the DP port into my ancient LG Ultrafine, powered on the monitor, and got a Bianbu first-boot wizard in less than 5 seconds after the initial logo.

Clicked through it–language, timezone, user account–and landed on a working accelerated desktop. That’s it. No GRUB patching, no DTB hunting, no resize-filesystem bugs, no serial console required. The smoothest first boot I’ve had with an SBC all year.

The board ships with Bianbu 4.0 (“Resolute Raccoon”)–a Debian-based distribution from SpacemiT, which, unlike most ARM boards I’ve used recently, is actually running a modern 6.18.3 kernel.

MilkV Jupiter 2 LXQt on Wayland - note how only the first 8 cores are active
MilkV Jupiter 2 LXQt on Wayland - note how only the first 8 cores are active

The desktop runs LXQt on Wayland, SDDM as the display manager, and the whole thing felt responsive enough that I didn’t immediately reach for the terminal. That is not something I say about SBC desktops often, and even though I then spent most of the past three weeks accessing it via ssh, I would likely have zero issues using it.

Standard apt works (repos seem to be at spacemit.com), Debian toolchain is present, and the kernel command line includes some interesting RISC-V-specific hints: unaligned_scalar_speed=fast and unaligned_vector_speed=fast, which I think are related to the RVV extended vector instruction set and the way the kernel does thread allocation.

I dug around a bit more and the boot chain goes through NOR flash (OpenSBI + U-Boot) → UFS, which is cleaner than the SD-card-based setups on most SBCs I’ve tested, and it was able to update itself without any issues:

Setting up spacemit-ec-firmware (1:0.0.22) spacemit-ec-firmware: payload installed successfully. Current EC firmware 'SPACEMIT_PICO_ITX-V00.14' is older than packaged firmware 'SPACEMIT_PICO_ITX-V00.16'. Starting automatic EC firmware update during package installation...

[INFO] Automatic EC firmware update triggered during package installation.
[INFO] Current RW: SPACEMIT_PICO_ITX-V00.14
[INFO] Target FW : SPACEMIT_PICO_ITX-V00.16
[WARN] Do not remove power, reset the system, or interrupt the tool while flashing.
[INFO] 1. Erasing flash region... Erasing 262144 bytes at offset 0... done.
[ OK ] Erase completed.
[INFO] 2. Writing firmware image... Reading 242688 bytes from /lib/firmware/k3-pico-itx/ec.bin... Writing to offset 0... Writing: [########################################] 100% done.
[ OK ] Write completed.
[INFO] 3. Reading back flash contents for MD5 verification... Reading 242688 bytes at offset 0... Reading: [########################################] 100% done.
[ OK ] MD5 verification passed.
[ OK ] Automatic EC firmware update finished successfully.
[INFO] This reboots the EC firmware only. Linux is not rebooted automatically.
[WARN] The power LED may blink and ectool may be unavailable briefly after the command.
[INFO] Sending EC reboot command...
[ OK ] EC reboot command sent.
[INFO] Waiting 10 seconds for EC reboot to settle...
[WARN] Reboot Linux manually now to restore EC communication.
[INFO] After Linux reboots, verify with: ectool version Automatic EC firmware update completed.

Not UEFI, but compared to the U-Boot-on-SD-card experience that most ARM SBCs inflict on you, having a proper NOR flash boot chain with OpenSBI → U-Boot → onboard UFS is a step up, because it means you can brick the OS partition and still recover without reflashing an SD card on another machine (and yes, Rockchip, I’m looking at you).

And since it all worked out of the box, I did not try adding an NVMe (there’s an M.2 M-Key slot for one) or booting from it (yet), although since there is official Ubuntu support I fully intend to try that out in the future.

Toolchains

Developer tooling for RISC-V will be foremost on most of my readers’ minds, so I can tell you right away that I am currently making extensive use of these:

  • GCC 15.2 (riscv64)
  • Go 1.25.7 – works out of the box, which is significant for me
  • Python 3.14.3
  • Make 4.4.1

Sadly (for me), Bun isn’t available, since there’s no official riscv64 build available yet, but node works OK. I focused mostly on , though.

Performance

To get started, I ran a small battery of tests to get a feel for where this sits relative to the (CIX P1, 12 ARM cores) I’ve been .

CPU

Test MilkV Jupiter 2 (16× RISC-V) Orange Pi 6 Plus (12× ARM) Notes
7-Zip multi-thread 17,547 MIPS 42,346 MIPS ARM is 2.4× total
sysbench CPU (1 thread) 2,329 ev/s 2,800 ev/s ARM 1.2× per-core IPC
sysbench CPU (all cores) 16,980 ev/s (8 usable) 25,746 ev/s (12t) 7.3× vs 9.2× scaling
fib(42) GCC -O2 1.110s 0.649s ARM 1.7× faster
Go 50M trig ops 2.68s 0.483s ARM 5.5× (Go arm64 mature)
Python 10M loop 4.74s 1.07s ARM 4.4×

Note that these benchmarks only ran on the X100 cluster (cores 0–7). The A100 cores (8–15) are kernel-fenced for AI work–htop shows them sitting idle, and sched_setaffinity silently refuses to pin anything there from a normal shell. The reasons for that are various and fascinating, and I’ll get into them below.

The sysbench single-thread number is the interesting one here: 2,329 versus 2,800. That’s only a 1.2× gap per X100 core. The 7-Zip figures (17.5k vs 42.3k MIPS) look damning until you realize that the A100 cores weren’t used at all, so the Jupiter 2 is really running 8 general-purpose threads against the P1’s 12.

The real gap shows up in Go and Python (4-5×), which probably says more about how young the riscv64 runtime backends are than about the hardware itself.

Memory Bandwidth

Test MilkV Jupiter 2 Orange Pi 6 Plus (A720 best)
sysbench memory read 3,051 MiB/s 15-17 GB/s (libc memcpy)
sysbench memory write 2,694 MiB/s 35-47 GB/s (memset)

I went back and ran this in parallel on the CIX P1, and the K3’s memory bandwidth is much lower–roughly a fifth for reads. This is likely the biggest single performance gap and puts an upper cap on whatever the CPU can do regardless of how much it packs into each cycle. For inference workloads that are memory-bound, this matters a lot. The K3 has a few workarounds, though, as we’ll see later.

Storage

Test MilkV Jupiter 2
Sequential write 1.2 GB/s
Sequential read 3.4 GB/s
4K random write 113 MB/s (~28K IOPS)

The built-in UFS storage is very nice–NVMe-class speeds, better than what I saw on the Orange Pi 6 Plus’s NVMe setup with my own (underused) PCIe 4 SSD. No complaints here.

Thermals under Load

The board stays well-behaved under sustained 8-core stress-ng:

  • Idle: 59-64°C, fan at 45% / 2335 RPM
  • Full load (30s sustained): 62-68°C, fan ramps to 60% / 3194 RPM
  • No throttling observed, which made my usual CPU/thermal charts kind of pointless

Again, stress-ng --cpu 0 ran on the 8 available X100 cores, but even when I ran both CPU and AI loads that used the A100 cores, the fan was audible but not objectionable–noticeably quieter than the Orange Pi 6 Plus’s cix-ec-fan in quiet mode, and the fan controller API is much saner.

Since I had a few tussles with the Orange Pi 6 Plus’s fan controller limitations, I let an LLM loose on /sys/devices, and it found out that the Jupiter’s fan is managed by a CrosEC controller over eSPI (/sys/devices/platform/soc/cac8c000.espi/84000000.ec). That exposes a standard hwmon interface with fan1_input and (surprisingly) fan1_fault that standard Linux utilities can read (and the built-in cooler does seem to have the right number of wires to provide fan sensing, which is a nice touch).

There’s also a separate pwm-fan platform device at /sys/devices/platform/pwm-fan/hwmon/hwmon8/pwm1 that accepts values 0-255 for direct duty-cycle control, with pwm1_enable=1 when thermal management is active, with a pwm-fan cooling device linked to thermal_zone0. In practice, you never need to touch any of this–the board keeps itself at 60-68°C under sustained load with the fan barely audible, even when using all 16 cores and at an ambient temperature of nearly 28°C in my office.

Power Consumption

I stuck a USB PD power monitor between the PSU and the K3, and the figures were pretty stable: 11W idle, an oddly symmetrical 22W under load. I suspect using an SFP for networking will add significantly to that, but most of my testing was actually done by ssh over Wi-Fi.

GPU

Unlike the , where the GPU required driver rebinding and vendor package archaeology, the Jupiter 2’s PowerVR GPU works out of the box.

No module loading, no blacklisting, no package hunting. I ran vulkaninfo and got a conformant Vulkan 1.3 device on the first try, although I am not sure how far I can go with Vulkan compute on this board yet since I explored other avenues.

The hardware is an IMG PowerVR B-Series BXM-4-64 MC1, and Vulkan reports it cleanly:

  • deviceName = PowerVR B-Series BXM-4-64 MC1
  • driverID = DRIVER_ID_IMAGINATION_PROPRIETARY
  • apiVersion = 1.3.277
  • driverVersion = 1.588.1135 (24.2@6603887)
  • conformanceVersion = 1.3.8.1

Doing the usual barrel-scraping YouTube influencer “testing” of firing up a 4K video in the browser is… absurdly fluid, really, since the K3 has a dedicated video decode unit (/dev/video-dec0, V4L2 “mvx” driver–decode only, no hardware encode that I can find) and that seems to be properly stitched together on the Bianbu packages.

OpenCL 3.0 is also present, with cl_khr_fp16 and cl_khr_integer_dot_product – the latter suggesting hardware support for int8 dot products, which is exactly what you want for basic vision processing. I tried poking at it with my Vulkan tooling, and the Vulkan side exposes shaderFloat16 and shaderInt8, 16KB shared memory, and 2 compute queues.

In short, I had zero issues with desktop acceleration, and I expect the K3 to be well supported going forward. I do intend to explore Vulkan on this a bit more, but as you’ll see below, I got completely sidetracked by the ISA and how it does vector compute…

NPU Vs A100 CPU Cores

The device tree shows an Arm China Linlon V5 (Zhouyi AIPU) at c0500000, status okay.

Okay, then, but… the device-tree lacked the obvious NPU plumbing I am sort of used to from ARM:

  • /proc/device-tree/soc/linlon-v5@c0500000/compatible says arm china,linlon-v5
  • there are no /dev/aipu*, /dev/npu*, /dev/linlon* or /dev/zhouyi* nodes
  • there are no aipu, linlon or zhouyi kernel modules under /lib/modules/6.18.3-generic
  • dmesg is silent for those names
  • web searches for linlon-v5, arm china,linlon-v5, Zhouyi AIPU and SpacemiT K3 NPU drivers turned up no public driver or SDK that matches this node

The Linlon V5 block is effectively opaque–no driver, no SDK, no kernel module. So it’s a dead end for now, although I suspect there are drivers for it somewhere.

What is interesting is what’s hiding in Bianbu’s apt repository: a SpacemiT ONNX Runtime stack (spacemit-onnxruntime, python3-spacemit-ort) and a spacemit-tcm package. The latter ships libspine_tcm.so, spacemit-tcm-smi and a public spine_tcm.h, and it talks to /dev/tcm rather than to a classic /dev/npu device. That’s not an NPU path at all–it’s targeting the A100 RISC-V cores and their tightly-coupled memory directly.

The ISA, Again

After the first evening of poking around, I decided to do what most people would do and read some actual documentation–which wasn’t hard to come by.

The CPU chapter in SpacemiT’s documentation gave me a few hints: the A100 cores run SpacemiT-IME (Inference Matrix Engine), a set of custom RISC-V vector extensions for quantised matrix arithmetic, with a programming model that gave me a bit of a flashback to my FORTRAN and VAX days–matrices in registers, explicit tiling and core synchronisation–but as a crash course in what RISC-V vector extensions can actually do, it made for a fun read.

The short version, if you’re in a hurry, is that this is a “unified memory” RISCv system where the CPU itself can do some interesting quasi-GPU math:

A page from the docs
A page from the docs

Go-ing Places

The long version is that this is almost tailor made for go-pherence, my pet inference library. I’ve been trying to do mostly MLX-like FP16 stuff with it, but my intent is to do non-GPU stuff with it, and even though AVX2 and NEON are interesting, I was completely nerd-swiped by the idea of using this RISC-V RVV variant to do “proper” inference.

And Codex was able to sort out how to map this to useful steps and identify parts of the instruction set that could do just that:

The custom instructions (vmadotsu.hp, vmadotu.hp, vnpack4.vv, vupack.vv, vpack.vv) perform fused int4×int8 dot products with FP16 accumulation. Each vmadot dispatch processes 128 bytes of activation against 512 bytes of 4-bit weights, producing 32 partial results. The data layout treats VS1 as copies×(M, K) matrices and VS2 as copies×(K, N) matrices, with the result stored across VD(L) and VD(H).

The “hard” part was to map this to Go assembler, but, again, Codex had no trouble churning out code for vector operations by just lining up the right bits:

// func rvvMulVecVec(a *float32, b *float32, out *float32, n int)
TEXT ·rvvMulVecVec(SB), NOSPLIT, $0-32
    MOV  a+0(FP), X10
    MOV  b+8(FP), X11
    MOV  out+16(FP), X12
    MOV  n+24(FP), X13
    WORD $0x012072d7            // vsetvli t0, zero, e32, m4, tu, mu
loop:
    BEQ  X13, X0, done
    WORD $0x0126f2d7            // vsetvli t0, a3, e32, m4, tu, mu
    WORD $0x02056007            // vle32.v v0, (a0)
    WORD $0x0205e207            // vle32.v v4, (a1)
    WORD $0x92021057            // vfmul.vv v0, v0, v4
    WORD $0x02066027            // vse32.v v0, (a2)
    SLL  $2, X5, X6
    ADD  X6, X10, X10
    ADD  X6, X11, X11
    ADD  X6, X12, X12
    SUB  X5, X13, X13
    JMP  loop
done:
    RET

All it needed was this page (and a couple of others):

The instruction format page
The instruction format page

TCM (Tightly Coupled Memory)

I had some trouble figuring out how this mapped to the TCM memory device that I had found, but a few more pages into the ISA doc it became clear:

TCM is 3 MB of on-chip SRAM (8 × 384 KB blocks), meant as a low-latency scratchpad for the IME2 matrix engine. According to the docs, both sets of cores can access it in pairs:

  • From the X100 cores (VLEN=256), TCM reads at 1.14 GB/s (uncacheable device memory)
  • From the A100 cores (VLEN=1024), it reads at 5.4 GB/s via a direct SRAM path for wide vector loads

This is a pretty dramatic difference from the RAM bandwidth I measured earlier, and even more so if you consider that the A100 cores can access it four times faster than X100 cores. And there’s more:

  • Cores are organised in pairs sharing TCM blocks, so they can exchange results much faster
  • I later found that SpacemiT’s own reference code uses paired-worker barriers to overlap DMA (weight prefetch from DRAM into TCM) with compute on the partner core

If you’ve ever done double-buffering, well, this is it applied to vector compute.

Armed with this knowledge, I distilled it into a SPEC and went to town on the K3 with Codex to see if we could port some of the go-pherence SIMD inference kernels, but there was a serious kink: I couldn’t for the life of me figure out how to schedule code on the A100 cores.

Thread Scheduling Weirdness

So I asked Codex to get out Capstone and disassemble the TCM libraries. Turns out getting a thread onto the A100 cores requires a two-step handshake:

  • write the thread’s TID to /proc/set_ai_thread (a kernel interface that unlocks scheduling on cores 8–15 for that specific thread)
  • then call sched_setaffinity to pin it.

Without the registration the kernel silently refuses the affinity change–those cores are fenced off from normal userspace entirely (which explains the oddities in the early benchmarking).

SpacemiT’s own llama.cpp fork (PR #22863) uses this pattern: six pthreads permanently pinned to cores 8–13, synchronised with spine_barrier_t (an atomic spinlock barrier), sitting in a persistent work loop that processes matrix tiles from a shared queue.

The workers never return to the OS scheduler between operations–barriers replace dispatch overhead entirely. I later realized that a) this is how the K3 can hit 35–40 tok/s on Qwen3-0.6B Q4_K_M b) Go scheduling has a lot more overhead.

Disassembling the ONNX runtime I’d found (SpaceMITExecutionProvider) showed it used the same cores with SPACEMIT_EP_* settings for thread count, profiling, and operator filtering.

The Actual AI Bit

So where does this leave us in terms of usable inference? Well, a lot of people like speed, and if you want speed, you can install llama.cpp-tools-spacemit 0.0.8 and run TinyLlama 1.1B Chat Q2_K (which is just 459MiB) with 8 threads:

Test Result
Prompt processing pp128 137.47 ± 0.05 t/s
Token generation tg64 36.60 ± 0.01 t/s

This is pretty impressive as SBCs go, and no wonder I am starting to see YouTube videos demoing it—it fills up a screen impressively fast if you do a one-shot prompt, but is fundamentally useless.

Running Real Models

The more interesting question is whether the K3 can host a usable local coding endpoint, so I worked through a spread of current models on a fork of the SpacemiT llama.cpp tree, all at Q4_K_M with f16/f16 KV and 8 threads.

I cranked out a Pi session and had it draft a realistic agentic coding turn: a system prompt with tool definitions, a prior read tool call, the file returned as context, and a request to produce an edit tool call - roughly 700-900 prompt tokens in, 700 generated out.

The results were… Interesting. And slow to achieve, not just because of the turn times but also because I had to patch llama.cpp to match minor changes in the Bianbu libraries:

Model Type / active RAM Prefill (t/s) Decode (t/s) Overall† (t/s) Turn
Qwen3.6-28B-REAP-A3B MoE / A3B 17.3 GB 29.1 6.5 11.5 140s
Gemma 4 E4B dense / 4B 4.9 GB 28.9 5.7 9.5 147s
Gemma 4 E2B QAT UD-Q4_K_XL dense / 2B-ish 2.5 GB 99.6 12.9 - 18s/128 tok
Gemma 4 26B-A4B MoE / A4B 16.9 GB 38.8 5.1 9.1 154s
Qwen 3.5-9B dense / 9B 5.6 GB 22.5 4.5 8.2 195s
Gemma 4 12B dense / 12B 7.3 GB 18.7 2.46 4.3 322s
Gemma 4 12B QAT UD-Q4_K_XL dense / 12B 6.3 GB 25.0 3.6 4.2 ~86s/300 tok

†Overall = (prompt + completion tokens) ÷ total compute time - blends prefill and decode for the turn.

So yes, it can run fairly decent models, but at slightly over 2 minutes a turn, not in a usable way. That doesn’t mean it can’t run LLMs, just that it can’t run moderately serious ones at speed (still, I’m pretty sure you can stuff a smaller Qwen variant in there and do simple things like home automation).

Since I happened to be playing with a few of these models on my RTX3060 (where they work at 4-8x the speed, making them quite usable), I copied the weights across and had Codex script out the same run across them with a few variations in settings:

Model Note Prefill t/s Decode t/s
Qwen3.6-28B-REAP + ngram spec copy-heavy task, 81% accept 29 15.5 (2× peak)
Qwen3.6-28B-REAP @ 64K ctx light context 33.1 7.8
Qwen3.6-28B-REAP @ 262K ctx full native context 21.5 9.8
Qwen3 0.6B tiny model 293 55
Qwen3.6-28B Q4_0 (requant) deep 9K ctx 21.7 3.5
Qwen3.6-35B-REAP + MTP non-viable on this CPU backend - - (stalled)

The pattern is somewhat clear: on this memory-bandwidth-bound board, decode rate tracks 1 / active parameters – or something. Sparse mixtures-of-experts and sub-4B dense models “work”, but anything above 3B just doesn’t, really. And Multi-token prediction (MTP), which I had gotten to work pretty well on my 3060 under go-pherence, stalls completely.

Since Gemma 4 just came out (again, for what, the third time?) with QAT, I also tried both its MTP and QAT variants by patching llama.cpp a bit further (by this time I was really hooked).

And splitting the workload across core types actually “worked”: sticking drafters on the slower X100 and the rest on A100 was feasible, but… there’s no fast memory exchange between core types, so it was (verifiably) useless:

Gemma 4 E4B run Thread placement Prefill t/s Decode t/s
No drafter target on A100 8-15 26.36 5.99
Assistant MTP, 4 draft threads drafter on X100 0-7, target on A100 8-15 26.35 5.99
Assistant MTP, 8 draft threads drafter on X100 0-7, target on A100 8-15 26.30 5.97

QAT

Gemma 4 E2B QAT was, however, “useful”, for a rather slow definition of it (~13t/s), and it is technically multimodal, but on the K3… not usable either. I tossed a 224×224 test image into it, which took roughly 39–47 seconds just to process through the projector, and even though it could identify a solid red square an equally simple red square/blue circle image came back as “yellow and white”. Might be my code, but by this time I had already started eating into my vacation days and I decided to call it quits.

The numbers are interesting, though, and make me wonder what a K3-like CPU can do with other kinds of models:

Model / route Params (M) RAM / file Prefill t/s Decode t/s Notes
Qwen 3 0.6B 596 373 MB 37.5 43.5 Great demo, zero substance
Gemma 4 E2B QAT 4,630 2.5 GB 99.6 12.9 decent prose/code without toy-model speedups; can use tools, will keep it around
Qwen3.6-28B-REAP-A3B 28,240 17.3 GB 28.9 7.15 quality anchor; large context and actual coding
Gemma 4 E4B 7,520 4.9 GB 27.5 6.01 twice as big as E2B, and twice as slow
Gemma 4 12B QAT UD-Q4_K_XL 11,910 6.3 GB 25.0 3.6 sort of worked, but unusable

In practice, none of the model, quant, or speculative tricks break the ~7 t/s decode wall for genuinely useful generation on the quality models, and we’re stuck shuffling ~3B of active weights per token out of LPDDR.

I was able to get very close to the C numbers with the E2B QAT, so I will be playing with that a bit more–in fact, I think that the Gemma 4 models are the most interesting thing out there if, like me, you’re stuck with an RTX 3060 and a 36GB RAM M3 Mac as your top inference hardware…

Where This Leaves Me

I am quite taken by the K3, to the point where I just got Whisper going on it using go-pherence and am now trying to shoehorn various other things into it, but the summary is as follows:

  • Software maturity is surprisingly good – Bianbu 4.0 has a modern kernel (6.18), modern tooling, and has had zero papercuts (so far). This is not the “barely boots” RISC-V experience from two years ago, when barely worked out of the box on riscv64 without a full rebuild. And no, I didn’t try yet, but I will, eventually.
  • Thermals are well-managed – never exceeded 68°C even under sustained 16-core load, fan ramps smoothly and is much quieter than the CIX P1.
  • The shipping storage is great – NVMe-class speeds from the onboard UFS (didn’t feel the need to add an NVMe), no SD card in sight
  • GPU and video decoding story is better than expected – PowerVR Vulkan 1.3 and hardware decoding both work out of the box.
  • The “normal” CPU cores (and memory bandwidth) are middling – per-core IPC is roughly half an A720, but the A100 RVV sort of makes up for it.

And even if it can’t do “real” LLMs, I am pretty sure the K3 can handle standard image recognition swimmingly. YOLOv5 has just come out even as I am putting this post together, so I haven’t tested it, but the key thing is that RISC-V is really interesting as a CPU platform now (at least for me). Of course, it has to come with enough RAM (and those 32GB RAM are probably the bare minimum any AI SBC should have for any realistic use), and times are tough, but I look forward to testing its descendant(s).

Right now I’ve embarked on the rather quixotic quest of getting Ideogram 4 on it (and yes, I know it won’t really “work”, but I wanted to have a go and have another “working” implementation besides the RTX back-end), and I expect I will spend a bit of time trying to tweak Qwen or Gemma 4 on it to see if I can have a permanent “house LLM” that doesn’t suck and can do basic automation (even if slowly) – and I’ll update this post (or add a link to it at the bottom) with any positive results.

WWDC26: Early Impressions

This was the weirdest WWDC26 keynote in a while, and some of the past ones were visibly phoned in. It was rife with weirdness and flashbacks.

To my surprise, a few of my items actually made it. Naming the next macOS “Golden Gate” was not on my bingo card, though; a little too trippy and a lot too lofty for what is, by Apple’s own tacit admission, a Snow Leopard year: catching up rather than charging ahead.

The self-deprecating tone ran through the whole thing, from a hippy bus that was equal parts weird and funny to the unmistakable sense of a company that spent the past year watching the industry sprint past it on AI and is now, not running but sedately pacing, to catch up.

Moderately Likely To Work

Much to my surprise, two of my top annoyances got airtime: they’re tackling Spotlight and Mail search, the exact failures I called out, although whether either works once it ships is anyone’s guess.

They’re also doubling down on automation, at least superficially, with vibecoded and a renewed push for third-party Actions. Vibecoding Safari extensions and Shortcuts is the genuinely interesting part: it points at automation rather than novelty, which is more than I can say for yet another Image Playground. None of it erases the brittleness and legacy gaps that made me want a real platform to begin with, but it’s at least pointing the right way. Tab grouping and change detection in Safari are a fun party trick, no more.

Siri AI

And yes, there’s a new, as-yet-unproven Siri (with a completely pointless AI moniker) you summon by holding the power button (part Spotlight, part walkie-talkie, plus a floating gelatinous orb in Vision Pro), and a Siri app trying to be a catch-all bucket for every interaction.

The new voice struck me as a little cringe and overly American, which is an odd note to land on when you want me talking to my machines all day. The feature set is fuzzy: on paper it can touch far more of my data, and moving photos to the shared library by voice would be neat if it works. But Siri has been stuck at “if it works” for fifteen years, and the one thing I actually want (for it to handle my and calendar properly) wasn’t demoed in any useful detail.

Reheated, Or Absent

I wondered whether the automation push would reach , and the answer is a shrug: the new camera detection is cute, but a YOLO model has done exactly that for a decade, and the automation logic I actually need stays vague. The rest of my list didn’t show at all: no hypervisor on the iPad, no running my own code without the annual toll, nothing on iCloud sync, the Watch, or SwiftUI. Maybe the sessions turn something up (which is why this is an early read), but my expectations haven’t budged.

The framing around Apple Foundation Models was the bigger tell: we already know there’s Gemini underneath, which leaves me wondering how much Apple is adding beyond the wrapper. got the same treatment by being walked back in the most face-saving way imaginable, with the old Accessibility transparency slider re-warmed and trotted out as an improvement. Disingenuous is the word, twice over.

Update: Also much to my surprise, they actually mentioned unifying the corner radii, which I completely missed. I must have tuned it out after the 300 random percentage performance improvements they quoted against… no real baseline, really.

Anyway, Apple heard the parts of everyone’s complaints that a) did not force them to walk back Liquid Glass and b) fit the AI story it needed to tell, and stayed quiet on a lot of the boring structural stuff that’s been broken for years. Yes, they are committing to improving performance and fixing some of the most egregious issues, and that’s not nothing; hearing Spotlight and Mail search admitted out loud is more than I expected, but it is mostly Apple’s technical debt catching up with them, and, of course, Apple catching up with everyone else where it regards AI, but on its own terms and at its own pace.

Oh, and they deprecated pretty much all of my hardware, too. Kind of expected, much like the usual geographical restrictions, which mean a good chunk of this may not reach Portugal for a year, if at all.

I’m going to give it a couple of days until the dust settles, watch the Platforms State of the Union tomorrow, and then mull things over a bit more. And maybe, somehow, we can chalk up this WWDC as a sort of a win, in the long run.

Notes for June 1–7

I decided to take a couple of days off and generally tune out, thanks to a few strategically placed bank holidays – which meant my usual mix of relaxing and dealing with a few chores.

For starters, I replaced the battery on our A1466 MacBook Air, which just keeps on trucking – it’s now on its third battery (I swapped the factory one some four, or was it five, years ago). For around EUR 80, keeping that rather nice keyboard/screen/trackpad combination in use was a no-brainer, and it too now runs a Niri desktop, having been a few months ago.

Putting Pi on the Desktop

And since I quite like having an AI assistant that can actually do something useful on my desktop, I did a quick hack to wire Pi into Noctalia:

The Pi assistant panel running inside the Noctalia desktop shell
Pi running inside Noctalia -- here, mid-task on the shell plugin itself.

This took around 30 minutes to become useful, and gave me a couple of ideas for improvements to piclaw’s UX – I had forgotten how flexible QML is.

AI Can Be Entertaining Too

I’ve been automating away a fairly large chunk of VM and container management – I have a dedicated agent that knows how to manage my Portainer stacks and version them in Gitea, for instance – but as it turns out, LLMs are also pretty good at a few other things, like setting up emulators under Steam (creating nice icons, fixing controller input mappings, tuning upscaling and shaders, and the rest of it).

But I hadn’t let an LLM loose on my Calibre and music collections yet, and – with the right safeguards – it’s been awesome at tidying up metadata. I had dozens of ancient books with slightly broken Calibre metadata, so I’ve been putting together an server that sits next to my library to fix them – mostly because I don’t want to give a model full filesystem access to my NAS, and this way I can snapshot the database whenever it tries anything more extensive. I may well make something more generic, given time.

My WWDC 26 Wish List

Michael Tsai’s annual roundup of WWDC wish lists went up this week, and the thing that struck me most wasn’t any single request–it was the mood. There seem to be fewer wish lists than last year, several people openly admitted they couldn’t be bothered to write one, and the ones that did are pretty much bereft of any “aspirational” wishes.

In short, most Apple developers seem resigned to their fate, and echoed the same weary plea for a “Snow Leopard” year where Apple fixes things instead of shipping more, er… “liquid” junk.

One thing that is clearly apparent even to me (even though I am not doing a lot of Mac or iOS development save ) is that we haven’t even got stability in the 26s yet (John Siracusa has a rather mordant take on that in the latest ATP episode), and in a couple of weeks we’ll get betas of the 27s piling bugs on top of bugs.

I already wrote my catalogue of last month, so consider this the constructive inverse–roughly the same list, reframed as things I’d actually like to see fixed next week.

None of these are moonshots. Most have been fixable for years, and a fair few were working better a decade ago.

What’s changed for me is the agentic-era stakes: I now point Codex and Claude at almost every tool I use during the day, and Apple’s software is, conspicuously, the part that fights back hardest (although I can’t really , this week’s MS Build is chock full of examples where Microsoft is way ahead of Apple in working AI integration, and it’s… just sad to me personally).

My expectations are effectively rock-bottom by now. Apple has become a hardware company where software seems to have been tacked on as a somewhat under-maintained afterthought. But I can’t help but keep a scorecard, so here’s what I’m hoping for–in rough order of how often it ruins my week.

  • I want to be automatable again. Not necessarily the full plugin API they killed, but an dictionary that isn’t frozen in amber and a MailKit surface that can file, tag and search without ceremony–because the one app I live in all day is the one black box I can’t point an agent at. While they’re at it, smart folders and rules that sync from the Mac should finally arrive on , roughly twenty years late.
  • Spotlight should simply find things that exist. I’d settle for that alone–no AI, no reinvention–just reliable, complete results and the one-line reindex affordance the Mac has had for years made available on , so a corrupted index doesn’t mean a multi-hour restore that breaks Apple Pay and FaceID along the way.
  • In the agentic era, automation needs to be a first-class platform, not an afterthought. Like many others, I wish for a way to programmatically create and modify ; I also want Shortcuts that don’t break between OS releases, a genuine cross-platform story, and the MCP-style hooks that OpenAI and Anthropic have to keep reinventing to automate anything in macOS. Windows still does COM and Win32 automation so well that I built an agent tool against it in fifteen minutes–Apple should be embarrassed by that comparison.
  • Give the iPad back a hypervisor. Hypervisor.framework has been on the Mac since Yosemite and Apple Silicon runs Linux VMs beautifully, yet an EUR 1,400 iPad Pro with an M4 can’t run a container or a VM that a EUR 50 ARM board handles without breaking a sweat. The entire local-LLM and coding-agent ecosystem I depend on is locked out of the most powerful tablet I own.
  • needs a scripting layer and real logic. Scene chaining, granular presence, if-this-then-that that actually works, and–for the love of everything–let HomeKit automations call , not just the reverse. I’ve papered over all of it with Node-RED and Home Assistant, but none of that should be necessary for someone who bought into the ecosystem.
  • Make iCloud sync trustworthy and give us Sync Now buttons across the core apps, the way Messages already has (for now, until they notice and remove it). Stop silently migrating data to CloudKit and leaving the CalDAV and IMAP paths to rot–document third-party access properly instead of letting Reminders and Notes quietly vanish from open protocols. Apple has never exposed any APIs worth using, and that needs to change.
  • The Watch should be the best time-aware device Apple makes, and instead it’s a widget carousel. I want a -style chronological timeline, a Smart Stack that’s actually aligned with my calendar, and the Watch independence Imthaz Ahamed asked for–let it pair with more than one phone.
  • Let me run my own code on my own hardware without an annual EUR 99 toll. I don’t want App Store distribution–I want a “just run this on my phone” mode in that doesn’t involve certificate chains that expire and silently brick my sideloaded apps.
  • Stabilise or admit it’s a research project. Views that worked on iOS 17 behave differently on 18 and seem broken on 26, and I lose hours dropping to UIKit to dodge layout bugs reported years ago. Steve Troughton-Smith’s dream of a real cross-platform successor to UIKit and AppKit is the one I’d trade everything else on this list for if I had to write iOS apps for a living.

And no, I’m not going to complain about again. I don’t think anyone at Apple will ever own up to how much of a failure it was (even down to controls that provide user feedback but don’t register clicks at the very edge of them), and some of it was an improvement (the other 80% of spattering controls atop application content wasn’t).

Every one of these is within Apple’s reach. They have the engineers, the money, and total control of the platform, which is precisely why the pattern grates: this isn’t technical inability, it’s a decade of chosen neglect dressed up as focus, whether you look at it from the pure platform side or if you think about it in terms of the (utterly absent) third-party API integration surface.

This is, unashamedly, a bit of a rant. I’ve been using Macs since System 6 and writing here since the OS X betas, and I’ve watched the company get richer and more capable while the software I use every day gets quietly worse at the boring, essential things, and no wonder I have gradually started using other platforms to the point where most people don’t even consider this a Mac blog.

But I am deeply indebted to Apple for making the platforms that have kept me sane over multiple decades, and I do care about the ecosystem, so… Here we are.

I’d love to be proved wrong next week. I won’t hold my breath–but the scorecard is open, the pen is out, and if all we get is another year of razzle over the dazzle, at least I’ll have a checklist to tick off.

Field Notes From The AI Battlefield

Since today is a bank holiday for me, I decided to consolidate a few more of my notes into a post. What follows is a set of guiding “principles” that I’ve found useful over the past year or so and that I’ve codified into various bits of scaffolding I reuse across my projects.

As usual, I’ve tried to strip away all of the hype and fuzziness and stick to facts, but everyone has their own way of leveraging AI, so your mileage may vary.

However, unlike most of what I read online about AI these days, I am not pitching any specific tooling, although all of this is based on my experience.

Full Disclaimer: I and have a personal Codex account that OpenAI provided for my OSS work, as well as access to random Tier 2 providers that I use to test piclaw.

If you like this, you might be interested on , a minor rant about and my .

Do Not Blindly Trust AI-generated Code

A great example I usually point out is that if you ask an LLM to do extensive error handling on a piece of code, it will almost invariably (at least in ) generate empty catch(){} blocks and call that “error handling”.

Another is when I asked it to optimize a particular tree traversal function for an edge case and it just hard coded the result.

And this applies to nearly everything you ask any LLM to do–but code can be validated, and tested, and measured in various dimensions, and you can turn some of its foibles against it.

In the case of the first example above, a linter will catch that, and you can force the AI to turn those empty catches into something useful (like warning messages in logs).

The second one is nastier, but it too can be fixed through proper test fixtures (dynamic but non-repetitive).

Which is why I invariably wrap all my AI-driven projects into several layers of deterministic testing and automation.

Automate Everything Away from the Model

The ground rule I follow is that even SOTA models are inherently unreliable, so when I set up a project or after the first few days of goofing around with a prototype, I try to make sure everything runs on rails.

I typically start with putting together a Makefile because it works/is preinstalled everywhere, is extremely familiar to LLMs, and means I have to do zero thinking myself when running steps manually, but you can use whatever you want.

The important thing is that it must cover the entire development and release cycle, because your agent will inevitably start drifting off and forget how it should do things.

I set it up like this:

  • Makefile targets to do everything (that way there is no “secret sauce” only the model “knows” to do tests, a build, etc.)
    • linting/static analysis (go vet is great, but you should also prepare for typical LLM “lazy” idioms like empty catch blocks, which should be considered critical errors)
    • tests (unit/fuzzing/functional)
    • builds
    • packaging
    • upstream dependency updates (packages and vendored files)
  • One or more SKILL.md file(s) that explain how to use the Makefile and cover the dev/test/debug/release workflows. You should make sure those are referenced from AGENTS.md or use the .github/copilot conventions (insert your flavor of choice here).

The key thing is to always aim for reproducible steps. The model will always go off into the weeds seeking an adventure regardless of how many admonitions you put in AGENTS.md or equivalent, especially when debugging things, but the Makefile (or equivalent) should be your ground truth.

The SKILL.md files are… Well, of dubious value, really. I’ve found to have made them less effective since unlike gpt-5.3-codex newer models often don’t even read the files, but your mileage may vary.

Keep An Eye On Tests

In short, LLM-written tests are generally crap. Anthropic models, in particular, just plain cheat at writing them, so if you ask your LLM to write them, make sure you actually read them.

Unit tests written by LLMs very seldom do anything beyond the obvious, miss edge cases, etc. The only models that write halfway decent tests (as of mid-2026) are the Codex family of GPT models, and even vanilla 5.4/5.5 regressed on that from my standpoint, so my usual tactics are:

  • Build a set of prompts to have different models refactor tests without looking at the internals of your code (i.e., focus on contracts).
  • Treat tests as a black box that outputs a report, so that the session you are coding in does not see the tests and the session that runs and writes the tests does not see the code. You can call these different agents if you want–I call it separation of concerns.
  • Set up CI/CD flows that run all of the tests with zero agent intervention, but have CI/CD generate concise Markdown reports the agents can consume.

The last point is critical, so set it up as soon as you can–it frees up time on your machine and any decent agent can use gh (or equivalent) to fetch CI/CD artifacts, review the results and file issues for itself.

Use LLMs to Fast-Track User Stories

This is where SOTA models shine. Even Sonnet, bless its little stupid heart, can take a set of requirements and distill them into user stories and feature files much faster than formal committee-style BDD processes, and the quality and coverage (so far) seems to be better than humans’.

If you work with customers, this last bit is very important–humans will want to describe the user stories that matter to them in exquisitely irrelevant detail while completely skimping on the ones they don’t care about, whereas LLMs won’t care if they are describing boring bits or not, and they won’t quibble at the details–they will just do it.

The resulting user stories need to be reviewed, of course, but piping UX requirements through an LLM and Gherkin typically generates pretty decent scripted tests, especially if the LLM can look at your Preact/Vue/etc. code and build corresponding Playwright scripts.

This will save you weeks of work, and catch dozens of inevitable regressions as LLMs subtly break your front-end code en passant while implementing new features.

Ask me how I know.

Again, Never Let The LLM Run Tests

Mind that I never rely on the LLM to run Playwright for the actual tests directly - it will either cheat, be creative about how it inputs things, refresh the page to see if the DOM changes and break test state, etc. – it’s fine to use it to explore an app and draft the scripts, but when you run these things in CI/CD, you want them to be extremely deterministic.

And you want evidence of all functional tests, so I have a little toolkit to gather that evidence:

  • Playwright for web testing
  • tmux for TUI testing (rmux is also a thing now, but if you work in regulated industries the paperwork to get it baked into an image will likely outweigh the benefits)
  • A custom VNC harness for my retro emulators (using tesseract for OCR, which is surprisingly capable)
  • And, sometimes, a webcam or an USB video capture adapter (plus a sub-agent that only describes what it sees)

As a bonus, besides a Markdown report, I also generate a PDF report with screenshots and logs for the failing cases–and an override switch to screenshot all the tests for occasional audits.

Again, ask me why.

Do Not Let The Models Edit Freely

LLMs will always mangle long files, regardless of how big the model or context window is. Anthropic models (as of mid-2026) are particularly prone to that for some reason (as well as “drive by shootings” where they mangle tangentially related files).

You need to decrease your exposure to this kind of risk and do some proactive damage control by decreasing the impact of any such errors. It is not a matter of if, it is a matter of when, and it will nearly always manifest as weird regressions a few days down the line.

What I do:

  • If possible in your harness, disable full-file write tooling and force the model to use edit or diff for focused edits. The added friction will typically prevent it from mangling entire files.
  • Set strict caps on file sizes and (depending on the kind of package) guidelines for breaking up functionality.
  • Review changes to see if unexpected files were touched (I have been meaning to create a SKILL.md for doing this automatically, but eyeballing by listing uncommitted files it is just easier).

Sometimes I wish I could just make unrelated files read-only before letting the LLM loose on React/Preact code, so I am looking into LSPs and static analysis to see if I can do the coding equivalent of raycasting–projecting out which files would be related to a specific change.

Aggressively Refactor at Every Opportunity

Every few sessions. stop and refactor the code. Most technical debt from AI use comes from letting it literally piss all over your nice module structure.

In particular, I’ve found that LLMs like to define redundant types and duplicate code pretty much at random because they can’t see across your entire code base. If they’re operating in one part of the tree, they’ll be completely oblivious to the rest.

What I do is that once I have implemented one feature (or a sequence of features) and tests pass, I aggressively go in and review every single type, helper and filename.

Models can do baseline audits (the trope about OpenAI models fixing code Anthropic ones wrote is very much true in my experience), and you can trust the outlines of the audits, but with some caveats:

  • They will always cut short the depth to which they analyze code
  • They will often stop at module or dependency boundaries
  • They will only try to merge or remove duplicate code if it is blatantly obvious (and even then it is not a guarantee)

I do use models for audits, but only as a starting point. Then I go in and:

  • Point out where there was feature creep or duplication of code/responsibilities in the module structure
  • Enforce things like centralized logging
  • Manually flag duplicates and give instructions by adding TODO comments to the code

In (which I have sort of gravitated to recently due to the balance of great profiling and refactoring tools and less cognitive overhead than ), gopls can significantly help the model do most file splitting/refactoring automatically and without any chance for the model to mess things up, so every so often I fire up a dedicated session, hand it a prebaked set of guidelines and do a full-on refactoring pass.

Prune Abstractions

Models have a tendency to follow “best practices” to a point where they create untenable messes of nested abstractions, very much like the sort of people who write Python as if they were cosplaying at writing Java–classes, accessors and factories everywhere, etc. You know what I’m talking about.

This is something that initial SPECs and system prompts actually help with, until the context window is so full that those guidelines are “forgotten”.

Weed those out ruthlessly. By all means define reusable contracts and use strong typing ( is a godsend in that regard), but expect your linter and LSP to catch your LLM red-handed.

Learn To Walk Away

There are many ways to work with AI, and none of them work for everyone, but there are some basic tenets I follow:

  • Shorter Sessions = more attrition. One-shotting features will just create more pain and technical debt down the line, and they foster an illusion of progress, not stuff you can actually rely on.
  • Make sure you are willing to put in the design and spec effort. The more you think and plan yourself, the more grounding you can provide to an agent to keep it on track.
  • Leaving the agent to its own devices for an hour or so will give you time to ponder–yes, it might be risky token-wise if you haven’t specced out the work well enough, but that is part of the challenge here.

I think Ralph loops are profoundly stupid and wasteful, but am very much a fan of writing a SPEC, chunking it into a plan.md (or your harness’ equivalent) that includes clear directions for testing and then using things like /goal complete the plan.md file, because that provides the agent with a clear cut set of steps.

Goal seeking of various forms (, performance optimizations, etc.) can be extremely effective and reliable, but only if you’ve stacked up most of the previous tricks written above (and even then I’ve caught LLMs cheating at benchmarks in the most egregious way: “the simplest option is to not execute the query” is a real thing that actually happened).

Aim For Reproducible Everything

Again, do not trust any of the code the agent puts out. And even if it works, keep track of how it works–in a sentence, instrument the crap out of everything:

  • Enforce structured logging as soon as possible, and have automated checks to ensure that errors/exceptions/etc. are logged.
  • Maintain a set of benchmarking/regression tests that output actual metrics (if you don’t use OpenTelemetry, try to at least have a text file with key metrics)
  • Be very thorough about regression testing. Taking the time to rebuild and run last week’s version will often show that you’ve missed either testing for something or measuring something important.

Again, CI/CD is your friend here, and a lot of my time, even on personal projects, has been spent on building test and smoke harnesses of various kinds:

  • Mock up external APIs and write various failure modes into the mocks so that the LLM will have to deal with “errors” from the start.
  • When doing emulation/JIT work, create a test harness for each specific operation that you can gdb through (LLMs can actually do this pretty well), then a smoke harness that you can compare with QEMU, etc.
  • When doing microcontroller work, build and test subroutines separately in the host machine before assuming they will work in the microcontroller.
  • When doing inference optimizations (like in go-pherence), cross-check similar kernels across back-ends and architectures to ensure they all provide the same results

The list goes on, but the key thing is that everything should be automatable and outside the control of the LLM.

Is all the above hard work? Yes. But can you take most of it along with you when you start a new project? Also pretty much yes–and the icing on the cake is that once you’ve gotten the basics down, the principles are all transferrable across stacks/environments/runtimes and the thought process will keep your wits sharp.

Not to mention these things will save you a bunch of time.

Notes for May 24–31

Today I realised that I could just spend the day doing essentially nothing and that nobody would hold it against me (at least in Western nations), so… I might well do just that, with a few caveats:

Wi-Fi Fallout

Something very weird happened after I published – it made it to Hacker News (a day or so after I submitted it myself, because, as usual, most of my self-submitted links still appear to be shadow-banned despite 30K+ karma–and no, I don’t understand that either), and it was very popular among the usual band of armchair networking experts.

But then something really weird happened: I got an alert from Cloudflare that the lowercase-rewrite worker I’d deployed as a fallback for incorrect linking was exceeding the free-tier limit (100,000 runs, if I recall correctly), which made me curious enough to dig into the analytics:

Cloudflare page views control chart showing two out-of-control spikes reaching ~70,000 views/hour on 30 May
The control chart doesn't lie. Those orange dots are not normal.

I have CF’s anti-bot crawling settings active, I turned on CAPTCHAs again after the initial peak, and yet… 70,000 views in an hour, twice? Has to be crawlers. And how did CF let them through and count them?

So I went and plotted Clarity’s chart of “human” visitors (always an undercount, since it only captures people without JS or ad-blocking, but useful as a sanity check):

Microsoft Clarity unique visitors chart showing the genuine HN-driven spike to ~8,000 unique visitors on 29 May, with traffic returning to normal shortly after
The real HN spike was Thursday. Everything after is noise.

Definitely bots after the initial HN flood. I have to wonder why, why now, and whether Cloudflare’s free tier is still even marginally effective at blocking them.

go-pherence

The most interesting work this week was grafting speaker diarization onto go-pherence. Whisper tells you what was said; knowing who said it is a separate problem, and the standard answer is SpeechBrain plus a Python subprocess plus a fairly heavy PyTorch dependency. I did not want any of that. Instead I ported ECAPA-TDNN – the speaker embedding model SpeechBrain uses – to Go, and it all now mostly works with zero Python, even if it still needs a lot of tweaking.

There’s a speakercheck validation harness that runs spot-checks against windowed audio segments, scores against expected speaker labels, and outputs JSON reports, and a diarize-vtt command that accepts an optional ECAPA model and emits speaker-tagged VTT output. I expect to drop this onto one of my current hardware test subjects soon.

In Other News

I’ve been tinkering with more new hardware, but some things just take time and I’m still putting together my notes on those.

On the other hand, I am still very much impressed with the running , and I’m enjoying building little plugins for it as I go:

Niri display layout plugin showing the Kuycon P20 external display and built-in DSI screen arranged in a stacked layout
A Niri plugin to manage display layout, because of course I wrote one.

I will eventually publish these somewhere…

Mildly Parboiled

Allergy season is finally fading (at least for me), but today was the first time I had to turn on the AC in the office, and it was great to realize that and almost four years of potential HomeKit foibles, my is still working perfectly.

Those minor joys aside, I’ve been actively trying to get out of the house to do some exercise at least one hour a day and it is clearly not going to happen at lunchtime anymore–well, not every day, at least, so I’m starting to get cabin fever.

All of this to say that I’m feeling as if I am starting down the slippery slope to both physical and mental burnout again, and this time I’m backing off as early as possible.

For starters, I am currently profoundly annoyed at my current working arrangements, since my days of wall-to-wall meetings with completely random 15 minute breaks are both utterly destroying my health and eroding my ability to focus. Sometimes, and despite being remote for many, many years, I would really prefer to be back working at an office, if only because I miss walking about and using stairs to go and talk to people.

Turns out my closest project team are now in Madrid (plus Belgium, Sweden, Canada, etc.), so that isn’t going to happen. And, truth be told, online meetings are now so stupefyingly more productive (as meetings go) that actual work is still best done remote–as long as you can cut through the tremendous amount of AI-augmented cruft that a meeting now entails.

I, as usual, have been pragmatic about it and crafted my own agent to summarize meetings the way I want them, and to craft terse, minimalist works of corporate obeisance that avoid the walls of text I get by default and focus on the stuff I need to do instead of spouting corporate cheerleading (it has become ).

Anyway, my priority is now, again, my well-being. But I feel like my entire lifestyle is in dire need of an intervention, and the obvious life hacks most people suggest like exercising in the early morning (when I am trying to do my daily reading and research) or at the end of the day (when I am just bog tired) just don’t work for me, so the upshot of all this is that I am currently trying to carve out slots throughout the week to just get out of the house for 30 minutes.

Which is completely stupid.

This has to change (somehow). In the meantime, part of that carve-out is also going to be about mental health–I’m phasing out Twitter/X again, as well as a bunch of other “social” distractions and hypefests like HN.

Indoor Wi-Fi Roaming with OpenWRT

A few months after writing up the units and moving the house over to , I ended up revisiting the one bit I had deliberately waved away as “good enough”: roaming.

Read More...

Notes for May 17-24

My sinuses are still giving me grief, but this week was much more successful at pretending to be enjoyable, at least. For starters, we watched Project Hail Mary, and it was every bit as good as I would expect it to be, which is very rare in movies these days.

Read More...

Logitech Combo Touch: Four Years Later

I think it’s time for an update on my iPad Pro M1 and, most importantly, the Logitech Combo Touch I got for it. Think of it as a long term review of sorts.

Read More...

TIL: Noctalia Shell Lock on Suspend

This is a little bit of follow-up to my – I keep using it routinely (especially when we travel for leisure) and love the little thing to bits, but I’ve been wanting to run it mostly on power saving mode to reap the most benefit out of the hardware (and battery, of course), so I started looking at desktop environment alternatives.

Yes, I could already get a full afternoon (and then some) out of it, but Apple Silicon has spoiled me as far as battery life expectations go, and has a little bit too much baggage for that kind of extended use.

Since I spend 90% of my time on it writing or coding and still have a penchant for keyboard-driven desktops, I initially switched to Fedora Sway Atomic (gotta love being able to swap environments with a single command…), but later installed Niri and Noctalia Shell because I really like both the idea of a scrolling window environment and the sheer polish of the whole thing–even if there are some rough edges here and there.

I am very happy with it, and writing plugins for it is trivial:

I hacked together a Bing Wallpaper plugin in 30m
I hacked together a Bing Wallpaper plugin in 30m

The one thing that annoyed me to no end, though, was locking on suspend, which Noctalia Shell should do but apparently doesn’t in , so I had to resort to two hacks:

Locking on Lid Close

The first was adding a switch-events block to the Niri config to trigger the lock screen when the lid closes:

switch-events {
    lid-close {
        spawn "qs" "-c" "noctalia-shell" "ipc" "call" "lockScreen" "lock"
    }
}

Idle Lock via swayidle

The second was setting up a swayidle systemd user service to lock after 5 minutes of inactivity and suspend after 10:

[Unit]
Description=SwayIdle Service
After=graphical-session.target

[Service]
Type=simple
ExecStart=/usr/sbin/swayidle -w \
    timeout 300 'qs -c noctalia-shell ipc call lockScreen lock' \
    timeout 600 'qs -c noctalia-shell ipc call sessionMenu lockAndSuspend'
Restart=on-failure
TimeoutSec=30

[Install]
WantedBy=graphical-session.target

This last one feels extremely gauche and I hope to find a better way, but I guess this comes with the territory. I don’t really care about having a trendy Wayland desktop (I just want a dead simple one with a bit of polish), but I hope this kind of hacks won’t be necessary for much longer.

Oh, and of course I set gsettings set org.gnome.desktop.wm.preferences button-layout 'close,minimize,maximize:appmenu' to match macOS decorations.

Apple Papercuts

I know this blog has strayed a fair distance from its Mac-centric origins, but I’ve been keeping a mental list of all the things that are broken, missing or inexplicably neglected in ’s software, and it’s gotten long enough that writing it down feels like a public service1.

Read More...

Notes for May 10-17

The weather has gone a tad cloudy again, which provided me some relief from my allergies–but not enough for proper overnight rest, so yet again I arrived at Friday afternoon totally exhausted.

Read More...

Announcing ios-linuxkit: Linux on iPad, the Hard Way

I’m done waiting for Apple to fix things. And one of the things I think should exist is a decent way to run Linux binaries on my iPad.

Read More...

Unexpected Synology Woes

Last weekend my decided, for some unfathomable reason, to stop working after I took it out of the closet, dusted it and put it back, and I have feelings about it.

Read More...

The Siri For Families Apple Will Never Build

The got me thinking about the one thing I keep wishing would build and almost certainly never will: a family-scoped AI assistant that actually works across all our devices.

Read More...

I Think I Figured Out What an AI IDE Looks Like

I’ve been mulling the UX arc I’ve been going through over the past couple of years, and I think it was mostly the same for everybody:

Read More...

Notes for May 3-10

This was a weird week, both because I keep waking up at 5AM with my sinuses clogged, and because I feel like I’m losing momentum. Feeling almost permanently cotton-headed, sleepy due to sheer exhaustion or because of antihistamines certainly has something to do with it, but .

Read More...

The Local AI Moat

Regular readers will know that I’ve spent most of the past two years shoehorning LLMs into single-board computers, partly as a learning exercise and partly because there are lots of local/”edge” applications where semantic reasoning (no matter how limited) and “interpretation” of sensor data are actually useful.

Read More...

Notes on GPT 5.x Model Regressions

I’ve been getting annoyed at constant code regressions in piclaw for the past few weeks. Something was off–even after bumping the test suite to the point where it catches most mechanical errors, gpt-5.5 kept making unrelated edits to code that should have been left alone, and I was getting really annoyed at babysitting it.

Read More...

Notes for April 27 – May 3

This was an absurdly productive week, at least on a personal level. I’m not sure whether to be pleased or worried about the number of projects that moved forward simultaneously, but here we are.

Read More...

Lessons on Building MCP Servers

I’ve been building servers for a while now–I wrote about last year, started out by creating umcp, and I’ve recently opened up an Office server that’s been battered by enough models against enough real documents that the patterns have settled.

Read More...

App Notes: Web App Viewer

I got annoyed enough with Safari Web Apps to write my own replacement.

Read More...

Archives3D Site Map