The Big Sur kernel_task Troubleshooting Saga

Following regarding my kernel_task woes (which began nearly two weeks ago now), I decided to strip out the updates from that post and consolidate them here with a little more detail.

Update, three weeks after the Big Sur 11.4 upgrade: Long story short, my current working theory is that 11.4 rendered my Wi-Fi card unable to associate to my 5GHz network (which is all based on Apple Airport Extremes), and that there is a kernel bug that handles this situation very poorly.

Summary: Every time I try to join that network, kernel_task immediately goes into a spin lock burning CPU cycles across all cores and stalling the machine, with variations of these two lines in the logs:

ARPT: 2859.550473:  wlc_phy_rx_iq_est_acphy: SPINWAIT ERROR : IQ measurement timed out
ARPT: 2859.550487: wl0: fatal error, reinitializing, total count of reinit's[65], @'wlapi_wlc_fatal_error':701

Lessons Learned: I should have been even more systematic. But the reason I failed to notice this over the course of two weeks of trial and error troubleshooting is manifold: * I have a Gigabit Ethernet connection (so I only have Wi-Fi on for Continuity). * The Wi-Fi icon was moved from the menu bar into Control Center (so status changes and error flashes are invisible). * The year-on-year changes to Console.app (and the glacial slowness of the machine when stalling) made it very hard to spot patterns in logs.

And, of course, I have to get work done, so working through all the possible permutations was very slow. It annoys me to no end that I only noticed this was an issue due to my having tried to pick my network when doing Internet Recovery, which helped pin things down.

I may update this post (again) later if I have more details, but for now let’s start with the whats and hows:

Context

I was getting random stalls on my , with kernel_task taking up anywhere between 90-350% CPU without apparent cause ever since I upgraded to Big Sur 11.4.

It’s probably important to point out that this is with an Intel i5 CPU and 24GB of RAM, sporting a Radeon 570 and the ungodly nuisance of a 1TB Fusion Drive.

, resetting the SMC or using AC to keep the environment cool seemed to help at first, but I had a hard time atributing this to be thermal throttling.

I am using both Thunderbolt ports for external displays (an aging, purple-edged 4K Superfine and an HDMI adapter to a non-retina LG UltraWide), but there seems to be hardly any GPU impact.

And despite it being early Summer and temperatures starting to rise above 25oC, the prime suspect was my recent upgrade to macOS Big Sur 11.4 (20F71).

Measurement and Mitigation

Halfway through this rigmarole, I went and got iStat Menus off the App Store (which I picked because I didn’t like their direct licensing terms).

Even though that does not do fan control, I’ve also had the free version of Macs Fan Control around for a couple of years since it does basic temperature monitoring, so I can force the fans on and configure a single basic response “curve” (more of a hard line, really) based on CPU package temperature.

I now have mine set to start at 40oC and max out at 75oC , which ensures the fans are always slightly above Apple’s defaults.

This is annoying because I’ve always strived to have an absolutely silent office, but when things are OK it’s quieter than what little street noise comes through the windows.

The Saga

So this is what happened since :

May 30th

Almost one full day after , I had another 8-10m stall. I manage to launch Activity Monitor (easily a 1m affair under these conditions) and search for culprits. I can only see and kernel_task hogging the CPU, so I kill the former.

I spend another full minute trying to launch Macs Fan Control, force the fans on again and it apparently makes a difference, as I see CPU temperature drop from 70oC to 65oC.

I also see no mds_stores, corespotlightd, suggestd or photoanalysisd above 5% CPU, so by this point, I’m starting to think that is at least part of the problem since I’ve had to throttle its CPU usage more than once and .

, by the way, is also running but shows no significant CPU or I/O usage (unlike , it is not prone to just randomly fire up and start scanning my hard disk).

I start culling login items and hunting down launchd entries.

May 31st

Another couple of stalls, again several minutes long.

Around lunchtime, I power down the machine in another attempt at resetting the SMC and remember to run the Apple hardware test afterward (which finds nothing).

So I power down again and unplug all my peripherals (2 thunderbolt displays, webcam, audio interface, MIDI keyboard, etc.), replace my Logitech Brio’s USB 3 cable, given I wasn’t using the bulky, long original that came with it but a shorter mongrel one I had lying around and re-plugged everything back in.

Other than the single cable swap (which was a bit of voodoo), the big change is that everything USB 3 now goes via a powered hub except my Time Machine drive, the reasoning being that it should help cut down on internal heating.

It is by this time that I decide to get iStat Menus from the App Store since I know it stores long-term CPU and temperature history data.

I disable Files On Demand and massively reduce the number of files synced to this machine.

Since then tries to sync everything locally, everything takes a few hours to stabilize, but there are no stalls for the remainder of the afternoon, even with a fair amount of disk activity as downloads a few hundred gigabytes.

I keep Activity Monitor open to peek at it while I work and kernel_task never seems to go above 10% in this period.

June 1st

Early morning, I spot a pattern that hints at this being a pure software bug, which is that kernel_task fires up (and forces an 8-10 minute stall) just after I’ve woken my iMac from standby (I usually fire it up, go have a shower and come back).

iStat Menus struggles to capture data during the stall, but it eventually pulls up CPU, GPU and temperature charts showing no significant activity while the machine was idle before the stall.

Again, there are no mds processes visible, and by this time I have pretty much nothing running in the background. A couple of cans of compressed air arrive, but I have to get a lot of work done and spend my lunchtime actually having lunch for a change.

Later in the day, I get one more stall that, rather suspiciously, starts on the hour. I begin to suspect hourly snapshots may also be to blame.

I manually trigger a backup to my NAS, and it does… nothing, in the sense that backupd starts looking at the remote snapshots, determines what needs to be copied and goes about its business. I have no stalls until 11pm, at which time I decide that if this is about I/O, then it’s highly unlikely to be the Fusion Drive.

I add it to iStat Menus nevertheless (both SSD and HDD data) to collect more datapoints.

June 2nd

By this time, I am really annoyed at in general. I rediscover the incantation to check for CPU throttling, which I haven’t used since I was shoehorning Leopard into a Dell Mini 9 many years ago:

$ pmset -g thermlog
Note: No thermal warning level has been recorded
Note: No performance warning level has been recorded
2021-06-03 10:57:46 +0100 CPU Power notify
    CPU_Scheduler_Limit     = 100
    CPU_Available_CPUs      = 4
    CPU_Speed_Limit         = 100

Even during another short stall, pmset produces the same output. I leave it running in a terminal, alongside Activity Monitor, Console and whatnot. Good thing I have three monitors.

By this time I have disabled , hunted down and eradicated pretty much every single non-Apple startup item and launchd service (user and system level) and still haven’t pinned down what else might be causing this.

June 3rd

It’s a bank holiday, so I have some time and clarity of mind to start documenting this end to end and experimenting a bit.

I find that another thing that seems to start a (mercifully short, less than one minute) kernel_task stall is to switch resolutions on an external Thunderbolt display. I never do this, but happened to do it to check what text scaling would look like if I swapped monitors, and guess what, kernel_task stalled the machine for a few seconds.

It does, however, not happen consistently when I repeat that (only once in five tries), so I start pondering more options.

I had no stalls during the afternoon, but it also bears mentioning that temperatures dropped (it was around 19oC outside, instead of the balmier 22-25oC we experienced earlier in the week).

June 4th

I’m working a half-day on account of yesterday’s holiday and a Monday deadline, so I have a little time to tackle this.

I clear out my desk again, lay down the on my cork mat and take out a can of compressed air to have another go at cleaning the vents. A few rewarding wisps of dust come out, but I’m pretty sure I’m doing it for the sheer feeling of doing something about the problem, and not as a definitive fix.

I spend a fair amount of the day at my desktop without any noticeable stalls.

June 5th

I boot up my Mac mid-morning to experience a 30s stall just after the desktop appears. is doing its startup thing, but finishes before the stall. pmset provides zero insights, indoor temperature is 23oC, CPU core temperature actually rises because of the stall.

There is no way I can figure this one out. I decide to clean up my notes and publish a first version of this post.

Frustratingly, at this point I don’t have any definitive causes for what is going on–only somewhat informed opinions:

  • I’m pretty sure thermals are only part of the problem (indoor temperature has seldom gone above 25C, and I’ve actually had to close the windows because the external temperature has gone below 19C and there was a draft through the house). I am also thoroughly annoyed at pmset providing zero information, even during stalls.
  • I’m somewhat sure the Fusion Drive compounds it (I/O latency certainly doesn’t help), but not sure at all it contributes to triggering the situation.
  • I’m also pretty sure there are no hardware issues (there are no SMART alerts, nothing in hardware test or console logs).
  • The GPU also does not seem to be at fault (no blips in its temperature charts, no definitive association with stalls).

It also doesn’t seem to be a third-party problem (I’ve reinstated –which I still blame for general I/O slowdowns, but not for the stalls–and a couple of other things without any issues).

I have no firm idea of what is going on since this is 100% reproducible over time but not on demand, and as such I cannot determine causality.

Before that little shenanigan when changing resolutions, my suspicions were gravitating back towards the Fusion Drive (since at least some episodes of kernel_task stalls seem to be associated with I/O), but the only hard data point I have is that it all became a daily affair after upgrading to Big Sur 11.4.

June 6th

Another thing I’ve started to notice is that I sometimes can’t play videos on YouTube, with the only real workaround being a full reboot. This is definitely new and an OS issue, as it happens across all browsers. I also can’t do screen recordings when this occurs.

And by now I’m sure it is not part of any re-indexing process. As much as I kept tabs on system processes, I can’t really blame any of the stalls on post-upgrade indexing or cache rebuilds, because some of those would surely have been evident from Activity Monitor.

I’ve since had two sets of feedback (via e-mail and Twitter) from a few people, many stating they have similar issues with triggering kernel_task stalls, and a few with similar stories of 11.4 upgrades.

One of them has upgraded to 11.5 Beta 2, which (so far) appears to have solved the issue. I also got some more tips on how to go about collecting data, which I will try to leverage this week. So far nobody has complained about video playback issues as well, but I suspect that is only a matter of time.

June 8th

A week later (and two weeks after upgrading to 11.4), the stalls have become almost an hourly occurrence of varying duration. After a 2-hour stall where I cooled down the office to under 22oC and could confirm every single sensor was under 65oC with kernel_task still taking 150%+ CPU, I decided to install Big Sur 11.5 Beta 2, but the machine rebooted straight into another stall, and disabling applications/removing peripherals is still of no avail.

June 10th

It’s a bank holiday again, so I spend a long time trying to find a suitable monitor to replace this machine (given that I’m pretty sure I will end up buying an M1 mini).

While I’m doing that, I systematically go through every hardware-based cause I can think of. I left the iMac completely unplugged overnight, removed all the peripherals (both monitors and the USB hub), disabled , and started reintroducing a new factor every two hours:

  • 2 hours without anything plugged in
  • 2 hours with the USB hub plugged in
  • 2 hours with a single external monitor (the lowest resolution one)
  • 2 hours with on (which was a major pain, since the onboarding experience is lousy in the sense that OneDrive insists on syncing everything before selecting just the folders you want to and only lets you disable Files on Demand after the initial slow, mammoth sync)
  • 2 hours with the external 4K display on

No stalls occur, which is encouraging. The machine was smooth as silk until I turned on again, but even though it caused major slowness (and CPU load specific to the process), kernel_task did not rear its ugly head.

June 11thFB9157025

I get a new stall when unlocking my machine. By this time I’ve spent roughly 50% of my vacation time debugging it, but since I’m on the beta, I file FB9157025 via Feedback Assistant, which packs all the logs and diagnostics and sends them to Apple.

June 12thFB9167624

The next evening, I left the computer unattended for 15m and kernel_task popped back up when I unlocked it (no , two monitors, 60oC CPU temperature). Boosting the fans to maximum brought the CPU temperature to 55oC but did nothing to break the stall.

I found a forum post someplace where someone claimed that disabling the Thunderbolt Bridge in the Network preferences pane fixed it for them, so I tried opening it and it was hung.

Not just slow, but completely hung (no network info, even though I obviously had networking and ifconfig worked in the terminal).

In retrospect, this should have been a great hint, but, again, I had zero reason to actually suspect networking at this point – I was more worried about the Thunderbolt displays.

I started unplugging monitors (since it was the only thing I could realistically do, hoping the Thunderbolt monitors were the cause of the stall). Nothing changed for over 15 minutes, so I filed FB9167624.

This is as much detail as I can get from iStat Menus during a stall

Eventually the stall stopped (after roughly 45 minutes), and the Network preference pane became responsive again. I then deleted the Thunderbolt bridge (which had nothing special in its config) and (since it was very late and this has been extremely frustrating) called it a night.

June 13thFB9169758

The next day, I was greeted by another stall immediately upon boot. After 30m I powered down the machine, dug out some painter’s tape, covered all the bottom openings but one and proceeded to systematically apply the vacuum to each of those in turn, moving the tape as I went along and verifying that there was inbound airflow through the back vent.

Then I reversed the process–removed the tape from the bottom vents, taped the vacuum hose to the back vent and left it running for 15 minutes, only to be greeted by another stall soon after powering it back up again, during which I painstakingly filed FB9169758.

By this time, I am well and truly fed up with it all. I need the machine to work, and have an M1 Mini in my Apple Store basket, ready to order. But I can’t find a suitable monitor and really don’t want to buy an M1 now, so I unplug literally everything but the power cord and decide to do a full nuke & pave.

I spend most of the afternoon backing up stuff and eventually boot into Recovery, which prompts me to reinstall the Big Sur Beta (with no way to pick a stable OS version).

I decide to try Internet Recovery, which prompts me to select a network interface. I notice it doesn’t want to join my 5GHz Wi-Fi, so I re-plug the Ethernet cable back in.

This was a great second hint, and by this time I decided to investigate.

Internet Recovery, in this day and age, asks me to install High Sierra on the machine, and the installer can’t deal with APFS.

So I format my Fusion Drive, manually rebuild it (good thing I knew this was a thing from listening to John Siracuse ranting over the years, because a normal person–even an Apple geek–would have a hard time finding the relevant support articles), and finally boot into High Sierra.

Which, incidentally, has a completely broken App Store, so upgrading to Big Sur was a challenge. I have no idea why Apple doesn’t upgrade Internet Recovery and spare people all this nonsense, but I suppose they have a lot on their plate these days.

In the meantime, I notice the Wi-Fi icon on the menu bar and remember to test the network, immediately getting another stall when connecting to my 5GHz SSID.

I pop open Console.app (a nice, usable, ancient version) and immediately notice error messages related to Wi-Fi, which go away when I finally power down Wi-Fi. I test and re-test and confirm that 2.4GHz works, but 5GHz throws kernel_task into a loop.

This is the smoking gun. I distinctly remember working via Wi-Fi a couple of months ago when I had to dust and re-wire my office gear (which took most of a Sunday), so I am absolutely certain 5GHz Wi-Fi was working before the Big Sur 11.4 update.

Around midnight I finally have a barebones working system I can at least try to use to work the next day, with just the base OS and Microsoft Remote Desktop set up. I decide to leave the Wi-Fi on but with only my 2.4GHz SSID configured.

June 14th

Nothing happens during the day other than some work getting done and , , Mail.app and the App Store downloading massive amounts of stuff back in.

All my hardware is plugged in, plenty of sleep/wake cycles as I switch machines to take calls on my standing desk. Zero stalls.

Around the end of the day, I go back to the machine and set up creature comforts like wallpapers, a minimum amount of launch items, and… Apple Watch unlock.

June 15thFB9178636

I confirm that even with my 5GHz network removed from my preferred SSID list, I still get stalls.

Reading through the logs I see, again, the same wlc_ messages, neatly sandwiched in between p2p log entries.

Given that I spent all day yesterday without a single stall, I assume the Wi-Fi card is scanning for the presence of my Apple Watch and triggering off the stalls due to that.

I File FB9178636 with my findings, and disable Wi-Fi.

I think this is it.

Next Steps

Since most of my job involves planning and risk management, I’ve already given a great deal of thought to how I can fix this, and it is actually pretty straightforward even if there are really no good short term options:

Wi-Fi Card and Fusion Drive Replacement

Opening the machine and swapping the Wi-Fi card was never part of my plans, but swapping Fusion Drive with a 1TB SSD has been on my mind for a while, and there are three catches:

  • Thanks to iFixit, opening it is not the problem (although fully replacing the Fusion drive is a lot more work than a standard HDD), it’s closing it back up and having the time (and space) to do it properly.
  • It will take the machine out of commission for at least an entire working day to get it done right (open, replace, test, close, reinstall everything). Maybe more.
  • I want to replace this machine anyway, so why invest more in it?

Fine, it’s around €200 or so for an SSD replacement kit, but the first two items are something I don’t really want to do right now. I might do it if I decide to gift the machine to my kids, though.

Outright Replacement

Besides all the hoopla about possible upcoming M1X machines (including the WWDC MacBook rumors, that didn’t pan out), I need a desktop. And there is no modern desktop Mac right now that can provide what this one has:

  • Support for three displays (1x5K retina, 1x4K retina, 1x1080p UltraWide)
  • 24GB of RAM

In all honesty I probably don’t need the full 24GB right now (I added it initially to run local VMs, and I have since moved most of my VMs to a KVM host or the cloud), but I would certainly buy more than 16GB for future-proofing any new machine, as I’ve already seen Logic slurp a massive amount of RAM with audio samples.

A 16GB, 1TB 24” iMac would set me back €2.411,40. In comparison, a 16GB, 1TB M1 Mac mini would cost €1.511,40 and I would likely pay at least €1.200 for a decent monitor.

Monitor Replacements

What I do need are the displays, and I quite like the iMac’s 27” 5K panel. And the would be a downgrade in terms of both internal and external display support, so I spent a fair amount of the past two weeks trying to figure out what to do where it regards replacing this while preserving roughly the same amount of usable screen real estate.

As it happens, right now the monitor market is mostly catering to gaming PCs, a situation that is compounded by various shortages. This means, for instance, that good resolution panels (5120x2160 or above) are essentially made of unobtanium, not shipping in volumes, prohibitively expensive, or… all three, really.

But if I had to buy a new monitor right now, my shortlist (based on a combination of all three factors above) would be:

  • LG 34WK95U-W (review), a 2018 model with 5120x2160, Thunderbolt 3 and a very good dot pitch
  • LG 49WL95C-W (review), a 2019 model with 5120x1440 and an OK dot pitch
  • Philips 498P9/00, a sensibly-priced 5120x1440 ultrawide with a tolerable dot pitch but no modern connectors

…most of which are not actually available right now, especially here. And neither are the (pre-)announced 40” 5120x2160 equivalents from Dell and LG.

Conclusion

And that’s that, at long last (I hope).

As I type this, with the machine currently being stress tested over lunch and CPU temperatures hovering near 70oC with the fans going (with the standard curve, not Macs Fan Control) I’m fully convinced the root cause was the Big Sur upgrade.

I hope a future update (if any) can fix my Wi-Fi card, or at least fix the kernel_task behavior to avoid overloading the machine if anything goes wrong with the Wi-Fi connection.

I may update this post later if I stumble upon any other ancillary causes, fixes, more feedback or just decide totake the plunge and switch machines.

This page is referenced in: