If anyone were to tell me that I would be hacking X11 server modules in 2022, I’d say they were nuts. And yet, here we are.
I’ve been using a Raspberry Pi 3B as a thin-client on an ultra-wide monitor for a few months now, and it’s been a great way to have my roaming Fedora desktop available in a completely silent desktop with a big screen whenever I need it.
I use that environment mostly for CAD and 3D printing stuff via my iPad (since I can’t run OpenSCAD or SuperSlicer locally), and although it is more than speedy enough, I’ve been pushing the envelope a bit more and investigating GPU acceleration.
Parenthesis: Getting 60fps in Windows RDP
A lot of people believe RDP to be slow and confuse it with VNC, but they don’t know a Windows RDP server can actually provide clients with 60fps H.264/AVC:444-encoded full-screen video, and that Linux clients like Remmina actually support that (with some caveats).
Back when I was doing streaming hacks to broadcast desktops to Raspberry Pi 1s that would have been great, and in a Windows-to-Windows environment it is pretty easy to setup (I have notes on that), and I update them every now and then.
A Raspberry Pi 3 can actually cope (I got mine to go up to 54fps), but it is most useful with server-side GPU acceleration – which, again, is pretty easy to enable in Windows.
Server-Side GPU Acceleration in Linux
As it happens, both xrdp
and Remmina already support GFX AVC:444
encoding out of the box (more on that later).
What xorgxrdp
(the actual back-end component) does not support by default is server-side GPU acceleration for encoding the desktop stream, which has to be compiled in using --enable-glamor
.
But how do you do actually go about doing that for a Fedora container? And what kind of GPU is required?
Well, I didn’t have much choice on the latter part–none of my personal machines has a discrete GPU, and my KVM
host is an Intel i7-6700 with i530 integrated graphics, so I had to make that work1.
GPU Setup
Mapping the GPU into the container was the first step, and LXD
makes it very easy to do this (easier than LXC
, at least): you simply tell it to add a gpu
device, and it will remap it2.
lxc config gnome device add gpu gpu
lxc config gnome device gpu uid 1000
lxc config gnome device gpu gid 1000
This immediately makes /dev/dri
available inside the container, and since I have my UIDs synchronized3, the username I use to log into the container now has access to the device4.
I then did a quick test to see if running ffmpeg
with GPU acceleration worked inside the container:
# this uses VAAPI to do accelerated transcoding
ffmpeg -vaapi_device /dev/dri/renderD128 -i ancient.wmv -vf 'format=nv12,hwupload' -c:v h264_vaapi output.mp4
This was successful and intel-gpu-top
showed a good level of activity on the host, so on to the next challenge: building xorgxrdp
with glamor
baked in.
Patching Fedora
Update: As of November 2022, this is no longer necessary: Just install the new
xorgxrdp-glamor-0.9.19-4
package, which I found out about when unpacking anxorgxrdp
upgrade and finding a clause that readConflicts: %{name}-glamor
in its.spec
file. My thanks to the Fedora 36 maintainers.
Building from source on a blank system is pretty straightforward, and there are plenty of instructions out there – in fact, I decided to try things out on my Platinum-themed Ubuntu container5 using this gist, and lo and behold, I was able to run OpenSCAD and my fancy new WebGL sitemap with GPU acceleration.
But this particular Fedora container represents a significant amount of investment (as all desktops, even if you keep notes and automate your configurations, restoring them takes time), so I decided to take a trip down memory lane to when I backported packages to the Cobalt RAQ550 and patch the xorgxrdp
package.
First, I checked how it had been built:
rpm -q --queryformat="%{NAME}: %{OPTFLAGS}\n" xorgxrdp
That confirmed it did not use --enable-glamor
, so I got to work rebuilding the RPM:
sudo dnf install rpmdevtools xorg-x11-server-devel xrdp-devel nasm mesa-libgmb-devel mesa-libOSMesa-devel libepoxy-devel libdrm-devel kernel-devel
rpmdev-setuptree
cd ~/rpmbuild/SRPMS
dnf download --source xorgxrdp.x86_64
rpm2cpio xorgxrdp-0.9.19-1.fc36.src.rpm | cpio -idm
# move the .tar.gz into SRCS
# and the .spec into SPECS
You then need to edit the .spec
file to read:
%build
%configure --enable-glamor CPPFLAGS="-I/usr/include/libdrm" --libdir="/usr/lib64/xorg/modules"
%make_build
Proper packaging would require me to actually change the version and edit BuildRequires
and whatnot to include the dependencies I added (especially libepoxy
and libdrm
), but this was an experiment, and I have some hope Fedora will add this to the upstream package since it dramatically improves 3D application performance over RDP…
You then simply do:
rpmbuild -ba SPECS/xorgxrdp.spec
sudo rpm --reinstall RPMS/x86_64/xorgxrdp-0.9.19-1.fc36.x86_64.rpm
Rebuilding worked fine, but it took me a while to get it to work in Fedora since due to different configuration layouts, you actually need to explicitly load glamoregl
:
# make sure this is in /etc/X11/xrdp/xorg.conf
...
Load "glamoregl"
Load "xorgxrdp"
...
Note: You still need to check the above even if you install the new
xorgxrdp-glamor
package.
The Pendulum Swings
The upshot of all this is that I was able to manipulate models in OpenSCAD at some pretty amazing speeds, but it quickly became obvious that the Raspberry Pi 3 was becoming a rendering bottleneck, so I decided to fish out the Raspberry Pi 4 I had tested my non-accelerated setup on.
And guess what, now it was noticeably faster, and it allowed me to take advantage of a few things:
- First off, my display hacks were no longer necessary - the Pi 4 went full-screen on my ultra-wide monitor unprompted.
- Switching to a 64-bit Raspbian image brought in the 64-bit video driver, which does make things feel snappier.
And, of course, Gigabit Ethernet and a slightly beefier CPU certainly helped as well. But with 4GB of RAM instead of the original 512KB I started off with on the Pi 3A+, this setup is much less “thin” as a client,
I keep telling myself that at least now I can run beefier softsynths on it so I can try to keep my music hobby afloat, but I was curious to see exactly how far I could push the envelope.
Lies, Damned Lies and Benchmarks
I wanted to get some hard data regarding performance, so I started by reasoning out what would be worthwhile to test.
Performance Factors
The way I see it, a typical RDP pipeline breaks down into five (mostly) independent things:
- Back-end Rendering – this is where
glamor
comes in, and where I had been having trouble. Offloading 3D rendering to the Intel integrated graphics freed up the CPU for other things, but the i530 can only do so much. - Frame Encoding – this entails taking whatever passes for a rendering surface and sending it out as a stream of bits, and is the only common factor between client and server.
- Wire Speed – this (and bandwidth) are tied together, but for my use case I’ve already moved up from Wi-Fi to Ethernet as I upgraded the client hardware, and even Gigabit Ethernet makes little difference, so I could safely ignore it.
- Frame Decoding – this entails grabbing the data and tossing it up on some sort of frame buffer, and can be a bit of a challenge on lower-end hardware.
- Local Rendering – and, of course, this is blitting that frame buffer (or sections of it) onto the actual local display.
Now, both encoding and decoding are still (as far as I can tell) completely CPU bound in xrdp
and Remmina, since none of them (unlike Windows clients and servers) seem to take advantage of GPU acceleration.
But I did notice Remmina was linked with libwayland-egl.so
, so I thought it would be worthwhile to check if Wayland would improve things.
And since there are constant discussions about whether or not it is worthwhile to run 64-bit binaries on the Raspberry Pi 4 and I already had both 32-bit and 64-bit OS images fully set up, I ended up trying different combinations of encoding, local rendering and system architecture.
Testing Scenario
Since I’ve already patched the server-side and gotten decent enough performance out of the GL stack, I thought a relatively simple display benchmark like the UFO test would probably be enough6.
So I spent a leisurely hour running it locally and via Remmina to either of my containers, with and without Wayland, and using two baseline Raspbian installs (32 and 64bit) to compile the following table:
Benchmark Results
All of the tests were run on a Raspberry Pi with 4GB of RAM, no overclocking, and GPU memory set to 256MB (in case anything made use of it):
Browser | Machine | Userland | XRDP | Encoding | Wayland | Architecture | FPS |
---|---|---|---|---|---|---|---|
Chromium | Local | Raspbian 11 | N/A | N/A | False | aarch64 | 60 |
armhf | |||||||
True | aarch64 | 16 | |||||
armhf | 21 | ||||||
Edge | Gnome | Fedora 36 | 0.9.19 (patched) | Auto | False | aarch64 | 39 |
armhf | 38 | ||||||
True | aarch64 | 36 | |||||
armhf | |||||||
GFX AVC:444 | False | aarch64 | 57 | ||||
armhf | 56 | ||||||
True | aarch64 | 58 | |||||
XFCE | Ubuntu 22.04 | 0.9.20+ (git tip) | Auto | False | 46 | ||
armhf | 47 | ||||||
True | aarch64 | 45 | |||||
armhf | |||||||
GFX AVC:444 | False | aarch64 | 57 | ||||
armhf | |||||||
True | aarch64 |
Browser
: I used Microsoft Edge on all the remote systems because it consistently outperformed Firefox in WebGL testing, and also wanted a roughly comparably approach to running Chromium locally on the Pi. Since I reconnected to the same session and just re-opened a new tab with the UFO test, I could keep browser windows static and with roughly equivalent sizes.Machine
: Gnome and XFCE are the containers I have set up with different userlands and desktop environments on myKVM
server. Although the desktop environment shouldn’t influence a rendering test, it does influence the “feel”, and I have a few notes on that.Userland
: This is the container distro (or local OS for local tests)XRDP
: This is the server version I was connecting to (except for local tests)Encoding
: What was set in Remmina.Wayland
: Self-explanatory (I set it on viaraspi-config
and rebooted for a second round of tests)Architecture
: Whether or not I was running a 64-bit kernel and userland.FPS
what was reported in the UFO test, which is what it thinks it has rendered.
As you can see, there are a few factors at play here.
Broad Conclusions
Well, Wayland was (at least for now, and unsurprisingly given its experimental state) a bit of a wash on the Pi. Especially when doing local rendering, but in the RDP pipeline it is just a blitting mechanism, so it didn’t have a lot of influence in the results.
Also, architecture differences seemed to favor aarch64
(64-bit Raspbian) for most of the combinations, although not by much. I will keep using it on my thin client largely because I can run aarch64
binaries on it and it comes in handy, ans other than possible RAM usage differences, I don’t see any point in going back to a 32-bit OS.
What isn’t on the table, though (and that I was a bit surprised to notice while building it) is that frames per second isn’t the whole picture – this is sort of expected given there is a fair bit of disconnect between what the UFO test sees and what the entire RDP stack ends up doing, but the gist of things is that having Remmina set to Auto
encoding consistently outperformed AVC:444
when dragging windows around and scrolling text.
And, funnily enough, AVC
felt blockier and slower on Gnome than when connecting to XFCE
. This is likely because I rebuilt xrdp
completely from source on Ubuntu/XFCE
and the Fedora version I patched only really improved back-end rendering.
Another likely factor is that Ubuntu tends to be a trifle more progressive in things like compiler options, and XFCE
(even heavily themed) does have less rendering overhead in general.
But given the asymmetry in this setup (a moderately decent i7 versus the Pi’s relatively puny Broadcom chipset) my reading of this is that using AVC
has a lot of impact on client decoding speed.
I suspect that is directly related to (as far as I know7) Remmina not supporting hardware decoding (at least on the Pi), so when set to Auto
it almost invariably negotiates RemoteFX
rendering with the server – which is also CPU-bound, but easier to handle.
Next Steps
Given that this little side trip into desktop streaming was long overdue and that I don’t have a lot of hardware to experiment with (although obviously that didn’t stop me), I don’t have anything specific planned.
I have long been pondering getting a new KVM
server with a proper discrete GPU, ideally something I can both game on via Moonlight or Steam Link and run some ML workloads – although the Intel Arc A770 is the sort of off-beat thing I would love to tinker with.
On the client side, there is a lot to explore, including more modern ARM SBCs like the Khadas Edge 2, which has an interesting GPU that seems to have hardware acceleration support in Ubuntu.
Feel free to drop me a note with suggestions (or, if you’re a manufacturer, review samples would be awesome).
-
This should also work for AMD and Intel discrete video cards–NVIDIA support requires a lot more tweaking, as usual. ↩︎
-
Proxmox and regular LXC would require you to fiddle around with
mknod
in ancient times – not sure if that is still the case. ↩︎ -
You do that by setting
raw.idmap: both 1000 1000
and restarting the container. ↩︎ -
Of course I should use groups to manage access, but group IDs for groups like
video
andrender
vary across distributions, so this was just easier to do. ↩︎ -
It took me a couple of tries, but the nice thing about
LXD
is that you can dolxc snapshot xfce pre-xrdp-install
and revert back to it at will. ↩︎ -
I reran a couple of the tests looking at CPU usage on the Pi, and
AVC:444
did seem to use a lot more CPU thanAuto
. However, I do think it would be fun to see if Remmina could use the Pi’s GPU somehow. I know thatremmina-plugin-rdp.so
is linked againstlibx264.so
andlibx265.so
, but those are CPU-bound implementations, at least on Raspbian. ↩︎ -
There is a caveat here in that we don’t have
VSYNC
in Linux and no real display hardware, but as a pixel-swinging test it is actually rather good. ↩︎