在新的 linux 内核中上下文切换要慢得多

Question

提问by Michael Goldshteyn

We are looking to upgrade the OS on our servers from Ubuntu 10.04 LTS to Ubuntu 12.04 LTS. Unfortunately, it seems that the latency to run a thread that has become runnable has significantly increased from the 2.6 kernel to the 3.2 kernel. In fact the latency numbers we are getting are hard to believe.

我们希望将我们服务器上的操作系统从 Ubuntu 10.04 LTS 升级到 Ubuntu 12.04 LTS。不幸的是，从 2.6 内核到 3.2 内核，运行已变为可运行的线程的延迟似乎显着增加。事实上，我们得到的延迟数字令人难以置信。

Let me be more specific about the test. We have a program that runs two threads. The first thread gets the current time (in ticks using RDTSC) and then signals a condition variable once a second. The second thread waits on the condition variable and wakes up when it is signaled. It then gets the current time (in ticks using RDTSC). The difference between the time in the second thread and the time in the first thread is computed and displayed on the console. After this the second thread waits on the condition variable once more. It will be signaled again by the first thread after about a second passes.

让我更具体地说明测试。我们有一个运行两个线程的程序。第一个线程获取当前时间（使用 RDTSC 以滴答为单位），然后每秒发送一次条件变量信号。第二个线程等待条件变量并在收到信号时唤醒。然后它获取当前时间（使用 RDTSC 以滴答为单位）。计算第二个线程中的时间与第一个线程中的时间之间的差异并显示在控制台上。此后，第二个线程再次等待条件变量。大约第二次通过后，第一个线程将再次发出信号。

So, in a nutshell we get a thread to thread communication via condition variablelatency measurement once a second as a result.

因此，简而言之，我们通过每秒一次的条件变量延迟测量获得一个线程到线程的通信。

In kernel 2.6.32, this latency is somewhere on the order of 2.8-3.5 us, which is reasonable. In kernel 3.2.0, this latency has increased to somewhere on the order of 40-100 us. I have excluded any differences in hardware between the two hosts. They run on identical hardware (dual socket X5687 {Westmere-EP} processors running at 3.6 GHz with hyperthreading, speedstep and all C states turned off). The test app changes the affinity of the threads to run them on independent physical cores of the same socket (i.e., the first thread is run on Core 0 and the second thread is run on Core 1), so there is no bouncing of threads on cores or bouncing/communication between sockets.

在内核 2.6.32 中，此延迟大约为 2.8-3.5 us，这是合理的。在内核 3.2.0 中，此延迟已增加到大约 40-100 us。我已经排除了两台主机之间硬件的任何差异。它们运行在相同的硬件上（双插槽 X5687 {Westmere-EP} 处理器以 3.6 GHz 运行，超线程、speedstep 和所有 C 状态都关闭）。测试应用程序更改线程的亲和性以在同一个套接字的独立物理内核上运行它们（即，第一个线程在内核 0 上运行，第二个线程在内核 1 上运行），因此不会出现线程反弹内核或套接字之间的弹跳/通信。

The only difference between the two hosts is that one is running Ubuntu 10.04 LTS with kernel 2.6.32-28 (the fast context switch box) and the other is running the latest Ubuntu 12.04 LTS with kernel 3.2.0-23 (the slow context switch box). All BIOS settings and hardware are identical.

两台主机之间的唯一区别是，一台运行带有内核 2.6.32-28（快速上下文切换框）的 Ubuntu 10.04 LTS，另一台运行带有内核 3.2.0-23（慢速上下文）的最新 Ubuntu 12.04 LTS开关盒）。所有 BIOS 设置和硬件都是相同的。

Have there been any changes in the kernel that could account for this ridiculous slow down in how long it takes for a thread to be scheduled to run?

内核中是否有任何更改可以解释调度线程需要多长时间运行的这种荒谬的减慢？

Update:If you would like to run the test on your host and linux build, I have posted the code to pastebinfor your perusal. Compile with:

更新：如果您想在您的主机和 linux 版本上运行测试，我已将代码发布到 pastebin供您阅读。编译：

g++ -O3 -o test_latency test_latency.cpp -lpthread

Run with (assuming you have at least a dual-core box):

运行（假设您至少有一个双核盒子）：

./test_latency 0 1 # Thread 1 on Core 0 and Thread 2 on Core 1

Update 2: After much searching through kernel parameters, posts on kernel changes and personal research, I have figured out what the problem is and have posted the solution as an answer to this question.

更新 2：在对内核参数、内核更改帖子和个人研究进行大量搜索后，我已经弄清楚问题是什么，并发布了解决方案作为这个问题的答案。

Answer 1

采纳答案by Michael Goldshteyn

The solution to the bad thread wake up performance problemin recent kernels has to do with the switch to the intel_idlecpuidle driver from acpi_idle, the driver used in older kernels. Sadly, the intel_idledriver ignores the user's BIOS configuration for the C-states and dances to its own tune. In other words, even if you completely disable all C states in your PC's (or server's) BIOS, this driver will still force them on during periods of brief inactivity, which are almost always happening unless an all core consuming synthetic benchmark (e.g., stress) is running. You can monitor C state transitions, along with other useful information related to processor frequencies, using the wonderful Google i7z toolon most compatible hardware.

最近内核中坏线程唤醒性能问题的解决方案与intel_idle从 cpuidle 驱动程序切换到acpi_idle旧内核中使用的驱动程序有关。可悲的是，intel_idle驱动程序忽略了用户的 C 状态的 BIOS 配置，并按照自己的节奏跳舞。换句话说，即使您在 PC（或服务器）的 BIOS 中完全禁用所有 C 状态，该驱动程序仍将在短暂不活动期间强制启用它们，这几乎总是发生，除非所有核心消耗综合基准（例如，压力）在跑。您可以在大多数兼容硬件上使用出色的 Google i7z 工具来监视 C 状态转换以及与处理器频率相关的其他有用信息。

To see which cpuidle driver is currently active in your setup, just cat the current_driverfile in the cpuidlesection of /sys/devices/system/cpuas follows:

要查看您的设置中当前哪个 cpuidle 驱动程序处于活动状态，只需将current_driver文件放在cpuidle部分中/sys/devices/system/cpu，如下所示：

cat /sys/devices/system/cpu/cpuidle/current_driver

If you want your modern Linux OS to have the lowest context switch latency possible, add the following kernel boot parameters to disable all of these power saving features:

如果您希望现代 Linux 操作系统具有尽可能低的上下文切换延迟，请添加以下内核启动参数以禁用所有这些节能功能：

On Ubuntu 12.04, you can do this by adding them to the GRUB_CMDLINE_LINUX_DEFAULTentry in /etc/default/gruband then running update-grub. The boot parameters to add are:

在 Ubuntu 12.04 上，您可以通过将它们添加到GRUB_CMDLINE_LINUX_DEFAULT条目中/etc/default/grub然后运行update-grub. 要添加的引导参数是：

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Here are the gory details about what the three boot options do:

以下是有关三个引导选项的作用的详细信息：

Setting intel_idle.max_cstateto zero will either revert your cpuidle driver to acpi_idle(at least per the documentation of the option), or disable it completely. On my box it is completely disabled (i.e., displaying the current_driverfile in /sys/devices/system/cpu/cpuidleproduces an output of none). In this case the second boot option, processor.max_cstate=0is unnecessary. However, the documentation states that setting max_cstate to zero for the intel_idledriver should revert the OS to the acpi_idledriver. Therefore, I put in the second boot option just in case.

设置intel_idle.max_cstate为零会将您的 cpuidle 驱动程序恢复为acpi_idle（至少根据该选项的文档），或者完全禁用它。在我的盒子上它是完全禁用的（即，在中显示current_driver文件/sys/devices/system/cpu/cpuidle会产生输出none）。在这种情况下，第二个引导选项processor.max_cstate=0是不必要的。但是，文档指出将intel_idle驱动程序的max_cstate 设置为零应该将操作系统恢复到acpi_idle驱动程序。因此，我添加了第二个引导选项以防万一。

The processor.max_cstateoption sets the maximum C state for the acpi_idledriver to zero, hopefully disabling it as well. I do not have a system that I can test this on, because intel_idle.max_cstate=0completely knocks out the cpuidle driver on all of the hardware available to me. However, if your installation does revert you from intel_idleto acpi_idlewith just the first boot option, please let me know if the second option, processor.max_cstatedid what it was documented to do in the comments so that I can update this answer.

该processor.max_cstate选项将acpi_idle驱动程序的最大 C 状态设置为零，希望也能禁用它。我没有可以对此进行测试的系统，因为intel_idle.max_cstate=0完全淘汰了我可用的所有硬件上的 cpuidle 驱动程序。但是，如果您的安装确实仅使用第一个引导选项将您从恢复intel_idle到acpi_idle，请让我知道第二个选项是否processor.max_cstate执行了评论中记录的操作，以便我可以更新此答案。

Finally, the last of the three parameters, idle=pollis a real power hog. It will disable C1/C1E, which will remove the final remaining bit of latency at the expense of a lot more power consumption, so use that one only when it's really necessary. For most this will be overkill, since the C1* latency is not all that large. Using my test application running on the hardware I described in the original question, the latency went from 9 us to 3 us. This is certainly a significant reduction for highly latency sensitive applications (e.g., financial trading, high precision telemetry/tracking, high freq. data acquisition, etc...), but may not be worth the incurred electrical power hit for the vast majority of desktop apps. The only way to know for sure is to profile your application's improvement in performance vs. the actual increase in power consumption/heat of your hardware and weigh the tradeoffs.

最后，三个参数中的最后一个， idle=poll是一个真正的权力猪。它将禁用 C1/C1E，这将以更多的功耗为代价消除最后剩余的延迟位，因此仅在真正需要时才使用该延迟。对于大多数人来说，这将是矫枉过正，因为 C1* 延迟并不是那么大。使用我在原始问题中描述的硬件上运行的测试应用程序，延迟从 9 us 增加到 3 us。对于高度延迟敏感的应用程序（例如，金融交易、高精度遥测/跟踪、高频数据采集等）来说，这当然是一个显着的减少，但对于绝大多数应用程序而言，可能不值得产生电力损失。桌面应用程序。确定的唯一方法是分析您的应用程序在性能与性能方面的改进。

Update:

更新：

After additional testing with various idle=*parameters, I have discovered that setting idleto mwaitif supported by your hardware is a much better idea. It seems that the use of the MWAIT/MONITORinstructions allows the CPU to enter C1E without any noticeable latency being added to the thread wake up time. With idle=mwait, you will get cooler CPU temperatures (as compared to idle=poll), less power use and still retain the excellent low latencies of a polling idle loop. Therefore, my updated recommended set of boot parameters for low CPU thread wake up latency based on these findings is:

在使用各种idle=*参数进行额外测试后，我发现将硬件支持设置idle为mwait如果您的硬件支持是一个更好的主意。似乎MWAIT/MONITOR指令的使用允许 CPU 进入 C1E，而不会在线程唤醒时间中增加任何明显的延迟。使用idle=mwait，您将获得更低的 CPU 温度（与相比idle=poll）、更少的电力使用并且仍然保持轮询空闲循环的出色低延迟。因此，我根据这些发现更新的低 CPU 线程唤醒延迟的推荐启动参数集是：

intel_idle.max_cstate=0 processor.max_cstate=0 idle=mwait

The use of idle=mwaitinstead of idle=pollmay also help with the initiation of Turbo Boost (by helping the CPU stay below its TDP [Thermal Design Power]) and hyperthreading (for which MWAIT is the ideal mechanism for not consuming an entire physical core while at the same time avoiding the higher C states). This has yet to be proven in testing, however, which I will continue to do.

使用idle=mwait代替idle=poll也可能有助于启动 Turbo Boost（通过帮助 CPU 保持低于其 TDP [热设计功率]）和超线程（对于这种机制，MWAIT 是不消耗整个物理内核同时不消耗整个物理内核的理想机制）时间避免较高的 C 状态）。然而，这尚未在测试中得到证实，我将继续这样做。

Update 2:

更新 2：

The mwaitidle option has been removed from newer 3.x kernels(thanks to user ck_ for the update). That leaves us with two options:

在mwait空闲选项已经从较新的3.x的内核去掉（感谢用户ck_进行更新）。这给我们留下了两个选择：

idle=halt- Should work as well as mwait, but test to be sure that this is the case with your hardware. The HLTinstruction is almost equivalent to an MWAITwith state hint 0. The problem lies in the fact that an interrupt is required to get out of a HLT state, while a memory write (or interrupt) can be used to get out of the MWAIT state. Depending on what the Linux Kernel uses in its idle loop, this can make MWAIT potentially more efficient. So, as I said test/profile and see if it meets your latency needs...

idle=halt- 应该和一样好用mwait，但要测试以确保您的硬件是这种情况。该HLT指令几乎等同于MWAIT带状态提示0。问题在于需要中断才能退出HLT 状态，而可以使用内存写入（或中断）退出MWAIT 状态。根据 Linux 内核在其空闲循环中使用的内容，这可以使 MWAIT 更有效率。所以，正如我所说的测试/配置文件，看看它是否满足您的延迟需求......

and

和

idle=poll- The highest performance option, at the expense of power and heat.

idle=poll- 最高性能选项，以牺牲功率和热量为代价。

Answer 2

回答by amdn

Perhaps what got slower is futex, the building block for condition variables. This will shed some light:

也许变慢的是 futex，条件变量的构建块。这将说明一些问题：

strace -r ./test_latency 0 1 &> test_latency_strace & sleep 8 && killall test_latency

then

然后

for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done

which will show the microseconds taken for the interesting system calls, sorted by time.

这将显示有趣的系统调用所花费的微秒，按时间排序。

On kernel 2.6.32

在内核 2.6.32 上

$ for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done
futex
 1.000140 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000129 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000124 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000119 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000106 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000103 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000102 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 0.000125 futex(0x7f98ce4c0b88, FUTEX_WAKE_PRIVATE, 2147483647) = 0
 0.000042 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000038 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000037 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000030 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000029 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 0
 0.000028 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000027 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000018 futex(0x7fff82f0ec3c, FUTEX_WAKE_PRIVATE, 1) = 0
nanosleep
 0.000027 nanosleep({1, 0}, {1, 0}) = 0
 0.000019 nanosleep({1, 0}, {1, 0}) = 0
 0.000019 nanosleep({1, 0}, {1, 0}) = 0
 0.000018 nanosleep({1, 0}, {1, 0}) = 0
 0.000018 nanosleep({1, 0}, {1, 0}) = 0
 0.000018 nanosleep({1, 0}, {1, 0}) = 0
 0.000018 nanosleep({1, 0}, 0x7fff82f0eb40) = ? ERESTART_RESTARTBLOCK (To be restarted)
 0.000017 nanosleep({1, 0}, {1, 0}) = 0
rt_sig
 0.000045 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000040 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000038 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000034 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000033 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000032 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000032 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000028 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000028 rt_sigaction(SIGRT_1, {0x37f8c052b0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x37f8c0e4c0}, NULL, 8) = 0
 0.000027 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000027 rt_sigaction(SIGRTMIN, {0x37f8c05370, [], SA_RESTORER|SA_SIGINFO, 0x37f8c0e4c0}, NULL, 8) = 0
 0.000027 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000023 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000022 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
 0.000022 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000021 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000021 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000021 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000021 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000021 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000019 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

On kernel 3.1.9

在内核 3.1.9 上

$ for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done
futex
 1.000129 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000126 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000122 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000115 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000114 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000112 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 1.000109 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 0.000139 futex(0x3f8b8f2fb0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
 0.000043 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000041 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000037 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000036 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000034 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000034 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
nanosleep
 0.000025 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000022 nanosleep({1, 0}, {0, 3925413}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
 0.000021 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
 0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
rt_sig
 0.000045 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000044 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000043 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000040 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000038 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000037 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000036 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000036 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000035 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000034 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000027 rt_sigaction(SIGRT_1, {0x3f892067b0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x3f8920f500}, NULL, 8) = 0
 0.000026 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000026 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000024 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000023 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
 0.000023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
 0.000022 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
 0.000021 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
 0.000019 rt_sigaction(SIGRTMIN, {0x3f89206720, [], SA_RESTORER|SA_SIGINFO, 0x3f8920f500}, NULL, 8) = 0

I found this 5 year old bug reportthat contains a "ping pong" performance test that compares

我发现这个5 年前的错误报告包含一个“乒乓”性能测试，比较

single-threaded libpthread mutex
libpthread condition variable
plain old Unix signals

单线程 libpthread 互斥锁
libpthread 条件变量
普通的老式 Unix 信号

I had to add

我不得不添加

#include <stdint.h>

in order to compile, which I did with this command

为了编译，我用这个命令做的

g++ -O3 -o condvar-perf condvar-perf.cpp -lpthread -lrt

On kernel 2.6.32

在内核 2.6.32 上

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29085 us; per iteration:   29 ns / 9.4e-05 context switches.
c.v. ping-pong test   elapsed:  4771993 us; per iteration: 4771 ns / 4.03 context switches.
signal ping-pong test elapsed:  8685423 us; per iteration: 8685 ns / 4.05 context switches.

On kernel 3.1.9

在内核 3.1.9 上

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    26811 us; per iteration:   26 ns / 8e-06 context switches.
c.v. ping-pong test   elapsed: 10930794 us; per iteration: 10930 ns / 4.01 context switches.
signal ping-pong test elapsed: 10949670 us; per iteration: 10949 ns / 4.01 context switches.

I conclude that between kernel 2.6.32 and 3.1.9 context switch has indeed slowed down, though not as much as you observe in kernel 3.2. I realize this doesn't yet answer your question, I'll keep digging.

我得出的结论是，内核 2.6.32 和 3.1.9 之间的上下文切换确实变慢了，尽管没有您在内核 3.2 中观察到的那么多。我意识到这还没有回答你的问题，我会继续挖掘。

Edit: I've found that changing the real time priority of the process (both threads) improves the performance on 3.1.9 to match 2.6.32. However, setting the same priority on 2.6.32 makes it slow down... go figure - I'll look into it more.

编辑：我发现更改进程（两个线程）的实时优先级可以提高 3.1.9 的性能以匹配 2.6.32。但是，在 2.6.32 上设置相同的优先级会使它变慢......去图 - 我会更多地研究它。

Here's my results now:

这是我现在的结果：

On kernel 2.6.32

在内核 2.6.32 上

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29629 us; per iteration:   29 ns / 0.000418 context switches.
c.v. ping-pong test   elapsed:  6225637 us; per iteration: 6225 ns / 4.1 context switches.
signal ping-pong test elapsed:  5602248 us; per iteration: 5602 ns / 4.09 context switches.
$ chrt -f 1 ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29049 us; per iteration:   29 ns / 0.000407 context switches.
c.v. ping-pong test   elapsed: 16131360 us; per iteration: 16131 ns / 4.29 context switches.
signal ping-pong test elapsed: 11817819 us; per iteration: 11817 ns / 4.16 context switches.
$

On kernel 3.1.9

在内核 3.1.9 上

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    26830 us; per iteration:   26 ns / 5.7e-05 context switches.
c.v. ping-pong test   elapsed: 12812788 us; per iteration: 12812 ns / 4.01 context switches.
signal ping-pong test elapsed: 13126865 us; per iteration: 13126 ns / 4.01 context switches.
$ chrt -f 1 ./condvar-perf 1000000
NPTL
mutex                 elapsed:    27025 us; per iteration:   27 ns / 3.7e-05 context switches.
c.v. ping-pong test   elapsed:  5099885 us; per iteration: 5099 ns / 4 context switches.
signal ping-pong test elapsed:  5508227 us; per iteration: 5508 ns / 4 context switches.
$

Answer 3

回答by Kyle Brandt

You might also see processors clicking down in more recent processes and Linux kernels due to the pstatedriver which is separate from c-states. So in addition, to disable this, you the following kernel parameter:

由于pstate驱动程序与 c-states 分开，您可能还会看到处理器在更新的进程和 Linux 内核中点击。因此，此外，要禁用此功能，请使用以下内核参数：

intel_pstate=disable

在新的 linux 内核中上下文切换要慢得多

提问by Michael Goldshteyn

采纳答案by Michael Goldshteyn

回答by amdn

回答by Kyle Brandt

相关推荐

最近更新

标签

在新的 linux 内核中上下文切换要慢得多

提问by Michael Goldshteyn

采纳答案by Michael Goldshteyn

回答by amdn

回答by Kyle Brandt

相关推荐

C# 如何转义JSON字符串？

Linux CentOS mod_fastcgi

Linux 压缩和解压缩文件 C++

在 C# 中将字节转换为 GB？

相关推荐

最近更新

标签