Following last week’s post regarding my kernel_task
woes (which began nearly two weeks ago now), I decided to strip out the updates from that post and consolidate them here with a little more detail.
Update, three weeks after the Big Sur 11.4 upgrade: Long story short, my current working theory is that 11.4 rendered my Wi-Fi card unable to associate to my 5GHz network (which is all based on Apple Airport Extremes), and that there is a macOS kernel bug that handles this situation very poorly.
Summary: Every time I try to join that network,
kernel_task
immediately goes into a spin lock burning CPU cycles across all cores and stalling the machine, with variations of these two lines in the logs:
ARPT: 2859.550473: wlc_phy_rx_iq_est_acphy: SPINWAIT ERROR : IQ measurement timed out ARPT: 2859.550487: wl0: fatal error, reinitializing, total count of reinit's[65], @'wlapi_wlc_fatal_error':701
Lessons Learned: I should have been even more systematic. But the reason I failed to notice this over the course of two weeks of trial and error troubleshooting is manifold: * I have a Gigabit Ethernet connection (so I only have Wi-Fi on for Continuity). * The Wi-Fi icon was moved from the menu bar into Control Center (so status changes and error flashes are invisible). * The year-on-year changes to
Console.app
(and the glacial slowness of the machine when stalling) made it very hard to spot patterns in logs.And, of course, I have to get work done, so working through all the possible permutations was very slow. It annoys me to no end that I only noticed this was an issue due to my having tried to pick my network when doing Internet Recovery, which helped pin things down.
I may update this post (again) later if I have more details, but for now let’s start with the whats and hows:
Context
I was getting random stalls on my iMac, with kernel_task
taking up anywhere between 90-350% CPU without apparent cause ever since I upgraded to Big Sur 11.4.
It’s probably important to point out that this is a 2017 27-inch iMac with an Intel i5 CPU and 24GB of RAM, sporting a Radeon 570 and the ungodly nuisance of a 1TB Fusion Drive.
Cleaning out the machine, resetting the SMC
or using AC to keep the environment cool seemed to help at first, but I had a hard time atributing this to be thermal throttling.
I am using both Thunderbolt ports for external displays (an aging, purple-edged 4K Superfine and an HDMI adapter to a non-retina LG UltraWide), but there seems to be hardly any GPU impact.
And despite it being early Summer and temperatures starting to rise above 25oC, the prime suspect was my recent upgrade to macOS Big Sur 11.4 (20F71).
Measurement and Mitigation
Halfway through this rigmarole, I went and got iStat Menus off the App Store (which I picked because I didn’t like their direct licensing terms).
Even though that does not do fan control, I’ve also had the free version of Macs Fan Control around for a couple of years since it does basic temperature monitoring, so I can force the fans on and configure a single basic response “curve” (more of a hard line, really) based on CPU package temperature.
I now have mine set to start at 40oC and max out at 75oC , which ensures the fans are always slightly above Apple’s defaults.
This is annoying because I’ve always strived to have an absolutely silent office, but when things are OK it’s quieter than what little street noise comes through the windows.
The Saga
So this is what happened since my original post:
May 30th
Almost one full day after the initial cleaning, I had another 8-10m stall. I manage to launch Activity Monitor
(easily a 1m affair under these conditions) and search for culprits. I can only see OneDrive and kernel_task
hogging the CPU, so I kill the former.
I spend another full minute trying to launch Macs Fan Control, force the fans on again and it apparently makes a difference, as I see CPU temperature drop from 70oC to 65oC.
I also see no mds_stores
, corespotlightd
, suggestd
or photoanalysisd
above 5% CPU, so by this point, I’m starting to think that OneDrive is at least part of the problem since I’ve had to throttle its CPU usage more than once and it is pretty much crap on macOS.
SyncThing, by the way, is also running but shows no significant CPU or I/O usage (unlike OneDrive, it is not prone to just randomly fire up and start scanning my hard disk).
I start culling login items and hunting down launchd
entries.
May 31st
Another couple of stalls, again several minutes long.
Around lunchtime, I power down the machine in another attempt at resetting the SMC
and remember to run the Apple hardware test afterward (which finds nothing).
So I power down again and unplug all my peripherals (2 thunderbolt displays, webcam, audio interface, MIDI keyboard, etc.), replace my Logitech Brio’s USB 3 cable, given I wasn’t using the bulky, long original that came with it but a shorter mongrel one I had lying around and re-plugged everything back in.
Other than the single cable swap (which was a bit of voodoo), the big change is that everything USB 3 now goes via a powered hub except my Time Machine drive, the reasoning being that it should help cut down on internal heating.
It is by this time that I decide to get iStat Menus from the App Store since I know it stores long-term CPU and temperature history data.
I disable OneDrive Files On Demand and massively reduce the number of files synced to this machine.
Since OneDrive then tries to sync everything locally, everything takes a few hours to stabilize, but there are no stalls for the remainder of the afternoon, even with a fair amount of disk activity as OneDrive downloads a few hundred gigabytes.
I keep Activity Monitor
open to peek at it while I work and kernel_task
never seems to go above 10% in this period.
June 1st
Early morning, I spot a pattern that hints at this being a pure software bug, which is that kernel_task
fires up (and forces an 8-10 minute stall) just after I’ve woken my iMac from standby (I usually fire it up, go have a shower and come back).
iStat Menus struggles to capture data during the stall, but it eventually pulls up CPU, GPU and temperature charts showing no significant activity while the machine was idle before the stall.
Again, there are no mds
processes visible, and by this time I have pretty much nothing running in the background. A couple of cans of compressed air arrive, but I have to get a lot of work done and spend my lunchtime actually having lunch for a change.
Later in the day, I get one more stall that, rather suspiciously, starts on the hour. I begin to suspect hourly Time Machine snapshots may also be to blame.
I manually trigger a Time Machine backup to my NAS, and it does… nothing, in the sense that backupd
starts looking at the remote snapshots, determines what needs to be copied and goes about its business. I have no stalls until 11pm, at which time I decide that if this is about I/O, then it’s highly unlikely to be the Fusion Drive.
I add it to iStat Menus nevertheless (both SSD and HDD data) to collect more datapoints.
June 2nd
By this time, I am really annoyed at Apple in general. I rediscover the incantation to check for CPU throttling, which I haven’t used since I was shoehorning Leopard into a Dell Mini 9 many years ago:
$ pmset -g thermlog
Note: No thermal warning level has been recorded
Note: No performance warning level has been recorded
2021-06-03 10:57:46 +0100 CPU Power notify
CPU_Scheduler_Limit = 100
CPU_Available_CPUs = 4
CPU_Speed_Limit = 100
Even during another short stall, pmset
produces the same output. I leave it running in a terminal, alongside Activity Monitor
, Console
and whatnot. Good thing I have three monitors.
By this time I have disabled Time Machine, hunted down and eradicated pretty much every single non-Apple startup item and launchd
service (user and system level) and still haven’t pinned down what else might be causing this.
June 3rd
It’s a bank holiday, so I have some time and clarity of mind to start documenting this end to end and experimenting a bit.
I find that another thing that seems to start a (mercifully short, less than one minute) kernel_task
stall is to switch resolutions on an external Thunderbolt display. I never do this, but happened to do it to check what text scaling would look like if I swapped monitors, and guess what, kernel_task
stalled the machine for a few seconds.
It does, however, not happen consistently when I repeat that (only once in five tries), so I start pondering more options.
I had no stalls during the afternoon, but it also bears mentioning that temperatures dropped (it was around 19oC outside, instead of the balmier 22-25oC we experienced earlier in the week).
June 4th
I’m working a half-day on account of yesterday’s holiday and a Monday deadline, so I have a little time to tackle this.
I clear out my desk again, lay down the iMac on my cork mat and take out a can of compressed air to have another go at cleaning the vents. A few rewarding wisps of dust come out, but I’m pretty sure I’m doing it for the sheer feeling of doing something about the problem, and not as a definitive fix.
I spend a fair amount of the day at my desktop without any noticeable stalls.
June 5th
I boot up my Mac mid-morning to experience a 30s stall just after the desktop appears. OneDrive is doing its startup thing, but finishes before the stall. pmset
provides zero insights, indoor temperature is 23oC, CPU core temperature actually rises because of the stall.
There is no way I can figure this one out. I decide to clean up my notes and publish a first version of this post.
Frustratingly, at this point I don’t have any definitive causes for what is going on–only somewhat informed opinions:
- I’m pretty sure thermals are only part of the problem (indoor temperature has seldom gone above 25C, and I’ve actually had to close the windows because the external temperature has gone below 19C and there was a draft through the house). I am also thoroughly annoyed at
pmset
providing zero information, even during stalls. - I’m somewhat sure the Fusion Drive compounds it (I/O latency certainly doesn’t help), but not sure at all it contributes to triggering the situation.
- I’m also pretty sure there are no hardware issues (there are no
SMART
alerts, nothing in hardware test or console logs). - The GPU also does not seem to be at fault (no blips in its temperature charts, no definitive association with stalls).
It also doesn’t seem to be a third-party problem (I’ve reinstated OneDrive–which I still blame for general I/O slowdowns, but not for the stalls–and a couple of other things without any issues).
I have no firm idea of what is going on since this is 100% reproducible over time but not on demand, and as such I cannot determine causality.
Before that little shenanigan when changing resolutions, my suspicions were gravitating back towards the Fusion Drive (since at least some episodes of kernel_task
stalls seem to be associated with I/O), but the only hard data point I have is that it all became a daily affair after upgrading to Big Sur 11.4.
June 6th
Another thing I’ve started to notice is that I sometimes can’t play videos on YouTube, with the only real workaround being a full reboot. This is definitely new and an OS issue, as it happens across all browsers. I also can’t do screen recordings when this occurs.
And by now I’m sure it is not part of any re-indexing process. As much as I kept tabs on system processes, I can’t really blame any of the stalls on post-upgrade indexing or cache rebuilds, because some of those would surely have been evident from Activity Monitor
.
I’ve since had two sets of feedback (via e-mail and Twitter) from a few people, many stating they have similar issues with OneDrive triggering kernel_task
stalls, and a few with similar stories of 11.4 upgrades.
One of them has upgraded to 11.5 Beta 2, which (so far) appears to have solved the issue. I also got some more tips on how to go about collecting data, which I will try to leverage this week. So far nobody has complained about video playback issues as well, but I suspect that is only a matter of time.
June 8th
A week later (and two weeks after upgrading to 11.4), the stalls have become almost an hourly occurrence of varying duration. After a 2-hour stall where I cooled down the office to under 22oC and could confirm every single sensor was under 65oC with kernel_task
still taking 150%+ CPU, I decided to install Big Sur 11.5 Beta 2, but the machine rebooted straight into another stall, and disabling applications/removing peripherals is still of no avail.
June 10th
It’s a bank holiday again, so I spend a long time trying to find a suitable monitor to replace this machine (given that I’m pretty sure I will end up buying an M1 mini).
While I’m doing that, I systematically go through every hardware-based cause I can think of. I left the iMac completely unplugged overnight, removed all the peripherals (both monitors and the USB hub), disabled OneDrive, and started reintroducing a new factor every two hours:
- 2 hours without anything plugged in
- 2 hours with the USB hub plugged in
- 2 hours with a single external monitor (the lowest resolution one)
- 2 hours with OneDrive on (which was a major pain, since the onboarding experience is lousy in the sense that OneDrive insists on syncing everything before selecting just the folders you want to and only lets you disable Files on Demand after the initial slow, mammoth sync)
- 2 hours with the external 4K display on
No stalls occur, which is encouraging. The machine was smooth as silk until I turned on OneDrive again, but even though it caused major slowness (and CPU load specific to the OneDrive process), kernel_task
did not rear its ugly head.
June 11th – FB9157025
I get a new stall when unlocking my machine. By this time I’ve spent roughly 50% of my vacation time debugging it, but since I’m on the beta, I file FB9157025
via Feedback Assistant
, which packs all the logs and diagnostics and sends them to Apple.
June 12th – FB9167624
The next evening, I left the computer unattended for 15m and kernel_task
popped back up when I unlocked it (no OneDrive, two monitors, 60oC CPU temperature). Boosting the fans to maximum brought the CPU temperature to 55oC but did nothing to break the stall.
I found a forum post someplace where someone claimed that disabling the Thunderbolt Bridge in the Network
preferences pane fixed it for them, so I tried opening it and it was hung.
Not just slow, but completely hung (no network info, even though I obviously had networking and ifconfig
worked in the terminal).
In retrospect, this should have been a great hint, but, again, I had zero reason to actually suspect networking at this point – I was more worried about the Thunderbolt displays.
I started unplugging monitors (since it was the only thing I could realistically do, hoping the Thunderbolt monitors were the cause of the stall). Nothing changed for over 15 minutes, so I filed FB9167624
.
Eventually the stall stopped (after roughly 45 minutes), and the Network preference pane became responsive again. I then deleted the Thunderbolt bridge (which had nothing special in its config) and (since it was very late and this has been extremely frustrating) called it a night.
June 13th – FB9169758
The next day, I was greeted by another stall immediately upon boot. After 30m I powered down the machine, dug out some painter’s tape, covered all the bottom openings but one and proceeded to systematically apply the vacuum to each of those in turn, moving the tape as I went along and verifying that there was inbound airflow through the back vent.
Then I reversed the process–removed the tape from the bottom vents, taped the vacuum hose to the back vent and left it running for 15 minutes, only to be greeted by another stall soon after powering it back up again, during which I painstakingly filed FB9169758
.
By this time, I am well and truly fed up with it all. I need the machine to work, and have an M1 Mini in my Apple Store basket, ready to order. But I can’t find a suitable monitor and really don’t want to buy an M1 now, so I unplug literally everything but the power cord and decide to do a full nuke & pave.
I spend most of the afternoon backing up stuff and eventually boot into Recovery, which prompts me to reinstall the Big Sur Beta (with no way to pick a stable OS version).
I decide to try Internet Recovery, which prompts me to select a network interface. I notice it doesn’t want to join my 5GHz Wi-Fi, so I re-plug the Ethernet cable back in.
This was a great second hint, and by this time I decided to investigate.
Internet Recovery, in this day and age, asks me to install High Sierra on the machine, and the installer can’t deal with APFS.
So I format my Fusion Drive, manually rebuild it (good thing I knew this was a thing from listening to John Siracuse ranting over the years, because a normal person–even an Apple geek–would have a hard time finding the relevant support articles), and finally boot into High Sierra.
Which, incidentally, has a completely broken App Store, so upgrading to Big Sur was a challenge. I have no idea why Apple doesn’t upgrade Internet Recovery and spare people all this nonsense, but I suppose they have a lot on their plate these days.
In the meantime, I notice the Wi-Fi icon on the menu bar and remember to test the network, immediately getting another stall when connecting to my 5GHz SSID.
I pop open Console.app
(a nice, usable, ancient version) and immediately notice error messages related to Wi-Fi, which go away when I finally power down Wi-Fi. I test and re-test and confirm that 2.4GHz works, but 5GHz throws kernel_task
into a loop.
This is the smoking gun. I distinctly remember working via Wi-Fi a couple of months ago when I had to dust and re-wire my office gear (which took most of a Sunday), so I am absolutely certain 5GHz Wi-Fi was working before the Big Sur 11.4 update.
Around midnight I finally have a barebones working system I can at least try to use to work the next day, with just the base OS and Microsoft Remote Desktop set up. I decide to leave the Wi-Fi on but with only my 2.4GHz SSID configured.
June 14th
Nothing happens during the day other than some work getting done and OneDrive, SyncThing, Mail.app
and the App Store downloading massive amounts of stuff back in.
All my hardware is plugged in, plenty of sleep/wake cycles as I switch machines to take calls on my standing desk. Zero stalls.
Around the end of the day, I go back to the machine and set up creature comforts like wallpapers, a minimum amount of launch items, and… Apple Watch unlock.
June 15th – FB9178636
I confirm that even with my 5GHz network removed from my preferred SSID list, I still get stalls.
Reading through the logs I see, again, the same wlc_
messages, neatly sandwiched in between p2p
log entries.
Given that I spent all day yesterday without a single stall, I assume the Wi-Fi card is scanning for the presence of my Apple Watch and triggering off the stalls due to that.
I File FB9178636
with my findings, and disable Wi-Fi.
I think this is it.
Next Steps
Since most of my job involves planning and risk management, I’ve already given a great deal of thought to how I can fix this, and it is actually pretty straightforward even if there are really no good short term options:
Wi-Fi Card and Fusion Drive Replacement
Opening the machine and swapping the Wi-Fi card was never part of my plans, but swapping Fusion Drive with a 1TB SSD has been on my mind for a while, and there are three catches:
- Thanks to iFixit, opening it is not the problem (although fully replacing the Fusion drive is a lot more work than a standard HDD), it’s closing it back up and having the time (and space) to do it properly.
- It will take the machine out of commission for at least an entire working day to get it done right (open, replace, test, close, reinstall everything). Maybe more.
- I want to replace this machine anyway, so why invest more in it?
Fine, it’s around €200
or so for an SSD replacement kit, but the first two items are something I don’t really want to do right now. I might do it if I decide to gift the machine to my kids, though.
Outright Replacement
Besides all the hoopla about possible upcoming M1X machines (including the WWDC MacBook rumors, that didn’t pan out), I need a desktop. And there is no modern desktop Mac right now that can provide what this one has:
- Support for three displays (1x5K retina, 1x4K retina, 1x1080p UltraWide)
- 24GB of RAM
In all honesty I probably don’t need the full 24GB right now (I added it initially to run local VMs, and I have since moved most of my VMs to a KVM host or the cloud), but I would certainly buy more than 16GB for future-proofing any new machine, as I’ve already seen Logic slurp a massive amount of RAM with audio samples.
A 16GB, 1TB 24” iMac would set me back €2.411,40
. In comparison, a 16GB, 1TB M1 Mac mini would cost €1.511,40
and I would likely pay at least €1.200
for a decent monitor.
Monitor Replacements
What I do need are the displays, and I quite like the iMac’s 27” 5K panel. And the new iMac would be a downgrade in terms of both internal and external display support, so I spent a fair amount of the past two weeks trying to figure out what to do where it regards replacing this iMac while preserving roughly the same amount of usable screen real estate.
As it happens, right now the monitor market is mostly catering to gaming PCs, a situation that is compounded by various shortages. This means, for instance, that good resolution panels (5120x2160 or above) are essentially made of unobtanium, not shipping in volumes, prohibitively expensive, or… all three, really.
But if I had to buy a new monitor right now, my shortlist (based on a combination of all three factors above) would be:
- LG 34WK95U-W (review), a 2018 model with 5120x2160, Thunderbolt 3 and a very good dot pitch
- LG 49WL95C-W (review), a 2019 model with 5120x1440 and an OK dot pitch
- Philips 498P9/00, a sensibly-priced 5120x1440 ultrawide with a tolerable dot pitch but no modern connectors
…most of which are not actually available right now, especially here. And neither are the (pre-)announced 40” 5120x2160 equivalents from Dell and LG.
Conclusion
And that’s that, at long last (I hope).
As I type this, with the machine currently being stress tested over lunch and CPU temperatures hovering near 70oC with the fans going (with the standard curve, not Macs Fan Control) I’m fully convinced the root cause was the Big Sur upgrade.
I hope a future update (if any) can fix my Wi-Fi card, or at least fix the kernel_task
behavior to avoid overloading the machine if anything goes wrong with the Wi-Fi connection.
I may update this post later if I stumble upon any other ancillary causes, fixes, more feedback or just decide totake the plunge and switch machines.