Fast Remote Desktop on a Raspberry Pi, with Glamor and an iGPU

If anyone were to tell me that I would be hacking X11 server modules in 2022, I’d say they were nuts. And yet, here we are.

I’ve been using a Raspberry Pi 3B as a thin-client on an ultra-wide monitor for a few months now, and it’s been a great way to have my roaming Fedora desktop available in a completely silent desktop with a big screen whenever I need it.

you get the picture — Mostly like this, except the Raspberry Pi is outside...

I use that environment mostly for CAD and 3D printing stuff via my iPad (since I can’t run OpenSCAD or SuperSlicer locally), and although it is more than speedy enough, I’ve been pushing the envelope a bit more and investigating GPU acceleration.

Parenthesis: Getting 60fps in Windows RDP

A lot of people believe RDP to be slow and confuse it with VNC, but they don’t know a Windows RDP server can actually provide clients with 60fps H.264/AVC:444-encoded full-screen video, and that Linux clients like Remmina actually support that (with some caveats).

Back when I was doing streaming hacks to broadcast desktops to Raspberry Pi 1s that would have been great, and in a Windows-to-Windows environment it is pretty easy to setup (I have notes on that), and I update them every now and then.

A Raspberry Pi 3 can actually cope (I got mine to go up to 54fps), but it is most useful with server-side GPU acceleration – which, again, is pretty easy to enable in Windows.

Server-Side GPU Acceleration in Linux

As it happens, both xrdp and Remmina already support GFX AVC:444 encoding out of the box (more on that later).

What xorgxrdp (the actual back-end component) does not support by default is server-side GPU acceleration for encoding the desktop stream, which has to be compiled in using --enable-glamor.

But how do you do actually go about doing that for a Fedora container? And what kind of GPU is required?

Well, I didn’t have much choice on the latter part–none of my personal machines has a discrete GPU, and my KVM host is an Intel i7-6700 with i530 integrated graphics, so I had to make that work¹.

GPU Setup

Mapping the GPU into the container was the first step, and LXD makes it very easy to do this (easier than LXC, at least): you simply tell it to add a gpu device, and it will remap it².

lxc config gnome device add gpu gpu
lxc config gnome device gpu uid 1000
lxc config gnome device gpu gid 1000

This immediately makes /dev/dri available inside the container, and since I have my UIDs synchronized³, the username I use to log into the container now has access to the device⁴.

I then did a quick test to see if running ffmpeg with GPU acceleration worked inside the container:

# this uses VAAPI to do accelerated transcoding
ffmpeg -vaapi_device /dev/dri/renderD128 -i ancient.wmv -vf 'format=nv12,hwupload' -c:v h264_vaapi output.mp4

This was successful and intel-gpu-top showed a good level of activity on the host, so on to the next challenge: building xorgxrdp with glamor baked in.

Patching Fedora

Update: As of November 2022, this is no longer necessary: Just install the new xorgxrdp-glamor-0.9.19-4 package, which I found out about when unpacking an xorgxrdp upgrade and finding a clause that read Conflicts: %{name}-glamor in its .spec file. My thanks to the Fedora 36 maintainers.

Building from source on a blank system is pretty straightforward, and there are plenty of instructions out there – in fact, I decided to try things out on my Platinum-themed Ubuntu container⁵ using this gist, and lo and behold, I was able to run OpenSCAD and my fancy new WebGL sitemap with GPU acceleration.

But this particular Fedora container represents a significant amount of investment (as all desktops, even if you keep notes and automate your configurations, restoring them takes time), so I decided to take a trip down memory lane to when I backported packages to the Cobalt RAQ550 and patch the xorgxrdp package.

First, I checked how it had been built:

rpm -q --queryformat="%{NAME}: %{OPTFLAGS}\n" xorgxrdp

That confirmed it did not use --enable-glamor, so I got to work rebuilding the RPM:

sudo dnf install rpmdevtools xorg-x11-server-devel xrdp-devel nasm mesa-libgmb-devel mesa-libOSMesa-devel libepoxy-devel libdrm-devel kernel-devel 
rpmdev-setuptree
cd ~/rpmbuild/SRPMS
dnf download --source xorgxrdp.x86_64
rpm2cpio xorgxrdp-0.9.19-1.fc36.src.rpm | cpio -idm
# move the .tar.gz into SRCS
# and the .spec into SPECS

You then need to edit the .spec file to read:

%build
%configure --enable-glamor CPPFLAGS="-I/usr/include/libdrm" --libdir="/usr/lib64/xorg/modules"
%make_build

Proper packaging would require me to actually change the version and edit BuildRequires and whatnot to include the dependencies I added (especially libepoxy and libdrm), but this was an experiment, and I have some hope Fedora will add this to the upstream package since it dramatically improves 3D application performance over RDP…

You then simply do:

rpmbuild -ba SPECS/xorgxrdp.spec
sudo rpm --reinstall RPMS/x86_64/xorgxrdp-0.9.19-1.fc36.x86_64.rpm

Rebuilding worked fine, but it took me a while to get it to work in Fedora since due to different configuration layouts, you actually need to explicitly load glamoregl:

# make sure this is in /etc/X11/xrdp/xorg.conf
    ...
    Load "glamoregl"
    Load "xorgxrdp"
    ...

Note: You still need to check the above even if you install the new xorgxrdp-glamor package.

The Pendulum Swings

The upshot of all this is that I was able to manipulate models in OpenSCAD at some pretty amazing speeds, but it quickly became obvious that the Raspberry Pi 3 was becoming a rendering bottleneck, so I decided to fish out the Raspberry Pi 4 I had tested my non-accelerated setup on.

And guess what, now it was noticeably faster, and it allowed me to take advantage of a few things:

First off, my display hacks were no longer necessary - the Pi 4 went full-screen on my ultra-wide monitor unprompted.
Switching to a 64-bit Raspbian image brought in the 64-bit video driver, which does make things feel snappier.

And, of course, Gigabit Ethernet and a slightly beefier CPU certainly helped as well. But with 4GB of RAM instead of the original 512KB I started off with on the Pi 3A+, this setup is much less “thin” as a client,

I keep telling myself that at least now I can run beefier softsynths on it so I can try to keep my music hobby afloat, but I was curious to see exactly how far I could push the envelope.

Lies, Damned Lies and Benchmarks

I wanted to get some hard data regarding performance, so I started by reasoning out what would be worthwhile to test.

Performance Factors

The way I see it, a typical RDP pipeline breaks down into five (mostly) independent things:

Back-end Rendering – this is where glamor comes in, and where I had been having trouble. Offloading 3D rendering to the Intel integrated graphics freed up the CPU for other things, but the i530 can only do so much.
Frame Encoding – this entails taking whatever passes for a rendering surface and sending it out as a stream of bits, and is the only common factor between client and server.
Wire Speed – this (and bandwidth) are tied together, but for my use case I’ve already moved up from Wi-Fi to Ethernet as I upgraded the client hardware, and even Gigabit Ethernet makes little difference, so I could safely ignore it.
Frame Decoding – this entails grabbing the data and tossing it up on some sort of frame buffer, and can be a bit of a challenge on lower-end hardware.
Local Rendering – and, of course, this is blitting that frame buffer (or sections of it) onto the actual local display.

Now, both encoding and decoding are still (as far as I can tell) completely CPU bound in xrdp and Remmina, since none of them (unlike Windows clients and servers) seem to take advantage of GPU acceleration.

But I did notice Remmina was linked with libwayland-egl.so, so I thought it would be worthwhile to check if Wayland would improve things.

And since there are constant discussions about whether or not it is worthwhile to run 64-bit binaries on the Raspberry Pi 4 and I already had both 32-bit and 64-bit OS images fully set up, I ended up trying different combinations of encoding, local rendering and system architecture.

Testing Scenario

Since I’ve already patched the server-side and gotten decent enough performance out of the GL stack, I thought a relatively simple display benchmark like the UFO test would probably be enough⁶.

So I spent a leisurely hour running it locally and via Remmina to either of my containers, with and without Wayland, and using two baseline Raspbian installs (32 and 64bit) to compile the following table:

Benchmark Results

All of the tests were run on a Raspberry Pi with 4GB of RAM, no overclocking, and GPU memory set to 256MB (in case anything made use of it):

Browser	Machine	Userland	XRDP	Encoding	Wayland	Architecture	FPS
Chromium	Local	Raspbian 11	N/A	N/A	False	aarch64	60
					False	armhf	60
					True	aarch64	16
					True	armhf	21
Edge	Gnome	Fedora 36	0.9.19 (patched)	Auto	False	aarch64	39
					False	armhf	38
					True	aarch64	36
					True	armhf	36
				GFX AVC:444	False	aarch64	57
					False	armhf	56
					True	aarch64	58
	XFCE	Ubuntu 22.04	0.9.20+ (git tip)	Auto	False	aarch64	46
					False	armhf	47
					True	aarch64	45
					True	armhf	45
				GFX AVC:444	False	aarch64	57
					False	armhf
					True	aarch64

Browser: I used Microsoft Edge on all the remote systems because it consistently outperformed Firefox in WebGL testing, and also wanted a roughly comparably approach to running Chromium locally on the Pi. Since I reconnected to the same session and just re-opened a new tab with the UFO test, I could keep browser windows static and with roughly equivalent sizes.
Machine: Gnome and XFCE are the containers I have set up with different userlands and desktop environments on my KVM server. Although the desktop environment shouldn’t influence a rendering test, it does influence the “feel”, and I have a few notes on that.
Userland: This is the container distro (or local OS for local tests)
XRDP: This is the server version I was connecting to (except for local tests)
Encoding: What was set in Remmina.
Wayland: Self-explanatory (I set it on via raspi-config and rebooted for a second round of tests)
Architecture: Whether or not I was running a 64-bit kernel and userland.
FPS what was reported in the UFO test, which is what it thinks it has rendered.

As you can see, there are a few factors at play here.

Broad Conclusions

Well, Wayland was (at least for now, and unsurprisingly given its experimental state) a bit of a wash on the Pi. Especially when doing local rendering, but in the RDP pipeline it is just a blitting mechanism, so it didn’t have a lot of influence in the results.

Also, architecture differences seemed to favor aarch64 (64-bit Raspbian) for most of the combinations, although not by much. I will keep using it on my thin client largely because I can run aarch64 binaries on it and it comes in handy, ans other than possible RAM usage differences, I don’t see any point in going back to a 32-bit OS.

What isn’t on the table, though (and that I was a bit surprised to notice while building it) is that frames per second isn’t the whole picture – this is sort of expected given there is a fair bit of disconnect between what the UFO test sees and what the entire RDP stack ends up doing, but the gist of things is that having Remmina set to Auto encoding consistently outperformed AVC:444 when dragging windows around and scrolling text.

And, funnily enough, AVC felt blockier and slower on Gnome than when connecting to XFCE. This is likely because I rebuilt xrdp completely from source on Ubuntu/XFCE and the Fedora version I patched only really improved back-end rendering.

Another likely factor is that Ubuntu tends to be a trifle more progressive in things like compiler options, and XFCE (even heavily themed) does have less rendering overhead in general.

But given the asymmetry in this setup (a moderately decent i7 versus the Pi’s relatively puny Broadcom chipset) my reading of this is that using AVC has a lot of impact on client decoding speed.

I suspect that is directly related to (as far as I know⁷) Remmina not supporting hardware decoding (at least on the Pi), so when set to Auto it almost invariably negotiates RemoteFX rendering with the server – which is also CPU-bound, but easier to handle.

Next Steps

Given that this little side trip into desktop streaming was long overdue and that I don’t have a lot of hardware to experiment with (although obviously that didn’t stop me), I don’t have anything specific planned.

I have long been pondering getting a new KVM server with a proper discrete GPU, ideally something I can both game on via Moonlight or Steam Link and run some ML workloads – although the Intel Arc A770 is the sort of off-beat thing I would love to tinker with.

On the client side, there is a lot to explore, including more modern ARM SBCs like the Khadas Edge 2, which has an interesting GPU that seems to have hardware acceleration support in Ubuntu.

Feel free to drop me a note with suggestions (or, if you’re a manufacturer, review samples would be awesome).

This should also work for AMD and Intel discrete video cards–NVIDIA support requires a lot more tweaking, as usual. ↩︎
Proxmox and regular LXC would require you to fiddle around with mknod in ancient times – not sure if that is still the case. ↩︎
You do that by setting raw.idmap: both 1000 1000 and restarting the container. ↩︎
Of course I should use groups to manage access, but group IDs for groups like video and render vary across distributions, so this was just easier to do. ↩︎
It took me a couple of tries, but the nice thing about LXD is that you can do lxc snapshot xfce pre-xrdp-install and revert back to it at will. ↩︎
I reran a couple of the tests looking at CPU usage on the Pi, and AVC:444 did seem to use a lot more CPU than Auto. However, I do think it would be fun to see if Remmina could use the Pi’s GPU somehow. I know that remmina-plugin-rdp.so is linked against libx264.so and libx265.so, but those are CPU-bound implementations, at least on Raspbian. ↩︎
There is a caveat here in that we don’t have VSYNC in Linux and no real display hardware, but as a pixel-swinging test it is actually rather good. ↩︎

Tao of Mac