multithreading - 如何在多核上编译 OpenCL 程序？

问题描述

OpenCL 程序/内核在运行时使用 clBuildProgram() 函数进行构建/编译。我的程序动态创建要构建的内核，因此要花费大量时间来编译它们。当然，看到有很多内核并且它们完全相互独立，我想将这项工作拆分到多个内核上，如下面的代码片段所示：

这个人似乎有一个非常相似的问题，但这是 6 年前的问题，而且解决方案并不令人满意 imo

ThreadPool tempPool = ThreadPool();
auto start = std::chrono::steady_clock::now();

for (int reps = 0; reps < 50; reps++) {
    tempPool.addJob([this] () {
        auto start = std::chrono::steady_clock::now();

        //These would hold the program sources
        std::vector<const char*> sources = {sourceCode.toRawUTF8()};
        std::vector<const size_t> sourceLengths = {sourceCode.getNumBytesAsUTF8()};

        cl_int ret;
        cl_program program = clCreateProgramWithSource(getCLContext()(), 1, sources.data(), sourceLengths.data(), &ret);

        // Build the program
        ret = clBuildProgram(program, 1, &getCLDevices()[0](), NULL, NULL, NULL);
        if (ret) {
            //Generic error checking
        }

        auto singleDuration = std::chrono::duration<double, std::milli>(std::chrono::steady_clock::now() - start).count();
    });
}

//Simple way to wait for all jobs to be finished
while (tempPool.getNumJobs() > 0) {
    Thread::sleep(1);
}

 auto totaDuration = std::chrono::duration <double, std::milli> (std::chrono::steady_clock::now() - start).count();

我使用这个 ThreadPool 设置所做的一切都会导致 5-6 的加速（我有 8 个线程），这是意料之中的。但是，构建 OpenCL 内核却没有。似乎同时只能构建一个内核。

有针对这个的解决方法吗？我在 MacOS atm 上，但我也对 Linux/Windows 感兴趣。

如果没有，有没有办法构建不涉及 clBuildProgram() 的 OpenCL 内核，但例如 gcc 或类似的解决方案？

标签： multithreadingmacosopencl

（我很惊讶您平台的驱动程序还不是多线程的。您确定您的调用真的是并行的。）

如果您仍然被卡住，那么可能适用于扩展您所引用问题中的解决方案的可怜的黑客攻击。对于一些司机clCreateProgramWithBinaries来说要快得多。因此，

fork 新进程（或调用使用相同设备集的辅助可执行文件）
每个子进程调用clCreateProgramWithSource然后clBuildProgram
孩子们调用clGetProgramInfo(...CL_PROGRAM_BINARIES...)以获取二进制文件，然后通过文件、管道或其他一些进程间通信将其传回。

再次，我会先再次检查您的设置代码，然后再将这个黑客粘在一起。

multithreading - 如何在多核上编译 OpenCL 程序？

问题描述

解决方案

推荐阅读