Linux 卡在 CPU 软锁中?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15146372/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 19:09:07  来源:igfitidea点击:

Linux stuck in CPU soft lockup?

linuxlinux-kernelcentosdriverkernel-module

提问by Ahmed A

My system is a CentOS 6.3(running Kernel version 2.6.32-279.el6.x86_64).

我的系统是CentOS 6.3(运行内核版本2.6.32-279.el6.x86_64)。

I have a loadable kernel module which is a driver that manages a PCIe card. If I manually insert the driver using insmodwhile the OS is up and running, the driver loads successfully and is operational.

我有一个可加载的内核模块,它是一个管理 PCIe 卡的驱动程序。如果我insmod在操作系统启动并运行时手动插入驱动程序,驱动程序将成功加载并且可以运行。

However, if I try to install the driver using rpm and then reboot the system, during startup the OS gets stuck spitting out the following "soft lockup" message for ALL the CPU cores, except for one core that is in "soft lockup" in one of the threads created by my driver.

但是,如果我尝试使用 rpm 安装驱动程序然后重新启动系统,则在启动过程中,操作系统会卡住,为所有 CPU 内核吐出以下“软锁定”消息,除了一个处于“软锁定”状态的内核我的驱动程序创建的线程之一。

BUG: soft lockup - CPU#X stuck for 67s! [migration/8:36]
.......(same above message for all cores except one)
BUG: soft lockup - CPU#10 stuck for 67s! [mydriver_thread/8:36]
(one core is locked up in one of the threads in my driver).

I searched the net quite a bit for info on this kernel msg / bug, and there are quite a bit of posts about it, none on what causes it or how to debug. Any help with the following questions would really be appreciated:

我在网上搜索了很多关于这个内核消息/错误的信息,有很多关于它的帖子,没有关于它的原因或如何调试。对以下问题的任何帮助将不胜感激:

  1. I am not able to log into the system, I think it's because all the cores are in a "soft lockup" state, and hence cannot trigger a kernel dump from shell prompt. I enabled SysRq, and tried to trigger a kernel dump with SysRq key combo, but no luck. It seems the system is not responding to keyboard (not even responding to CapsLock button). Any suggestions on how I can trigger a kernel dump in this circumstance?

  2. I can imagine the possibly of my driver thread causing "soft lockup". But how can the "migration" thread (a kernel thread) be in a "soft lockup" just because of my driver?

  3. From browsing the net, the "migration" thread is used to move tasks from one cpu to another. Can someone please help me understand what this thread exact does? And how it can be affected by other threads, if at all.

  1. 我无法登录系统,我认为这是因为所有内核都处于“软锁定”状态,因此无法从 shell 提示符触发内核转储。我启用了 SysRq,并尝试使用 SysRq 键组合触发内核转储,但没有成功。似乎系统没有响应键盘(甚至没有响应 CapsLock 按钮)。关于在这种情况下如何触发内核转储的任何建议?

  2. 我可以想象我的驱动程序线程可能导致“软锁定”。但是“迁移”线程(内核线程)怎么会因为我的驱动程序而处于“软锁定”状态?

  3. 通过浏览网络,“迁移”线程用于将任务从一个 CPU 移动到另一个 CPU。有人可以帮我理解这个线程到底是做什么的吗?以及它如何受到其他线程的影响,如果有的话。

回答by Giorgian Borca-Tasciuc

I had a very similar problem on my desktop. It would soft lockup very frequently - about once a day or so.

我的桌面上有一个非常相似的问题。它会非常频繁地软锁定 - 大约一天一次左右。

It turns out it was because I was running on an Intel Haswell. It seems that the Haswell/Broadwell series of Intel processors have a bug which can cause system instability. This bug was fixed in a microcode update.

原来是因为我在 Intel Haswell 上运行。看来 Haswell/Broadwell 系列的 Intel 处理器有一个会导致系统不稳定的错误。此错误已在微码更新中修复。

Check if CentOS offers an intel-microcode package, and install it. Make sure you configure grub to load it as the initial ramdisk before it loads initramfs.

检查 CentOS 是否提供 intel-microcode 包,并安装它。确保在加载 initramfs 之前将 grub 配置为将其加载为初始 ramdisk。

Personally, I upgraded my microcode by booting into Windows and running a BIOS Update. You can check if the micrcode was actually updated by comparing the output of grep 'microcode' /proc/cpuinfobefore and after the update.

就个人而言,我通过启动到 Windows 并运行 BIOS 更新来升级我的微代码。您可以通过比较更新grep 'microcode' /proc/cpuinfo前后的输出来检查微码是否实际更新。