Linux 为什么导入 numpy 后多处理只使用一个核心?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15639779/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 22:33:59  来源:igfitidea点击:

Why does multiprocessing use only a single core after I import numpy?

pythonlinuxnumpymultiprocessingblas

提问by ali_m

I am not sure whether this counts more as an OS issue, but I thought I would ask here in case anyone has some insight from the Python end of things.

我不确定这是否更像是一个操作系统问题,但我想我会在这里问,以防有人对 Python 的事情有一些了解。

I've been trying to parallelise a CPU-heavy forloop using joblib, but I find that instead of each worker process being assigned to a different core, I end up with all of them being assigned to the same core and no performance gain.

我一直在尝试使用 并行化一个 CPU 密集型for循环joblib,但我发现不是每个工作进程被分配到不同的内核,我最终将它们全部分配到同一个内核并且没有性能提升。

Here's a very trivial example...

这是一个非常简单的例子......

from joblib import Parallel,delayed
import numpy as np

def testfunc(data):
    # some very boneheaded CPU work
    for nn in xrange(1000):
        for ii in data[0,:]:
            for jj in data[1,:]:
                ii*jj

def run(niter=10):
    data = (np.random.randn(2,100) for ii in xrange(niter))
    pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
    results = pool(delayed(testfunc)(dd) for dd in data)

if __name__ == '__main__':
    run()

...and here's what I see in htopwhile this script is running:

...这是我在htop此脚本运行时看到的内容:

htop

顶

I'm running Ubuntu 12.10 (3.5.0-26) on a laptop with 4 cores. Clearly joblib.Parallelis spawning separate processes for the different workers, but is there any way that I can make these processes execute on different cores?

我在 4 核的笔记本电脑上运行 Ubuntu 12.10 (3.5.0-26)。显然joblib.Parallel是为不同的工作人员生成单独的进程,但是有什么方法可以让这些进程在不同的内核上执行?

采纳答案by ali_m

After some more googling I found the answer here.

经过更多的谷歌搜索,我在这里找到了答案。

It turns out that certain Python modules (numpy, scipy, tables, pandas, skimage...) mess with core affinity on import. As far as I can tell, this problem seems to be specifically caused by them linking against multithreaded OpenBLAS libraries.

事实证明,某些 Python 模块(numpy, scipy, tables, pandas, skimage...)在导入时会混淆核心关联。据我所知,这个问题似乎是由它们链接到多线程 OpenBLAS 库引起的。

A workaround is to reset the task affinity using

一种解决方法是使用重置任务关联

os.system("taskset -p 0xff %d" % os.getpid())

With this line pasted in after the module imports, my example now runs on all cores:

在模块导入后粘贴这一行后,我的示例现在可以在所有内核上运行:

htop_workaround

htop_workaround

My experience so far has been that this doesn't seem to have any negative effect on numpy's performance, although this is probably machine- and task-specific .

到目前为止,我的经验是这似乎对性能没有任何负面影响numpy,尽管这可能是特定于机器和任务的。

Update:

更新:

There are also two ways to disable the CPU affinity-resetting behaviour of OpenBLAS itself. At run-time you can use the environment variable OPENBLAS_MAIN_FREE(or GOTOBLAS_MAIN_FREE), for example

还有两种方法可以禁用 OpenBLAS 本身的 CPU 关联性重置行为。在运行时,您可以使用环境变量OPENBLAS_MAIN_FREE(或GOTOBLAS_MAIN_FREE),例如

OPENBLAS_MAIN_FREE=1 python myscript.py

Or alternatively, if you're compiling OpenBLAS from source you can permanently disable it at build-time by editing the Makefile.ruleto contain the line

或者,如果您从源代码编译 OpenBLAS,您可以在构建时通过编辑Makefile.rule包含该行的内容来永久禁用它

NO_AFFINITY=1

回答by NPE

回答by WoJ

Python 3 now exposes the methodsto directly set the affinity

Python 3 现在公开了直接设置关联的方法

>>> import os
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}
>>> os.sched_setaffinity(0, {1, 3})
>>> os.sched_getaffinity(0)
{1, 3}
>>> x = {i for i in range(10)}
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> os.sched_setaffinity(0, x)
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}