首页 > 解决方案 > 通过并行化提高循环性能

问题描述

所以我试图围绕 Julia 的并行化选项。我将随机过程建模为马尔可夫链。由于链是独立的复制,外部循环是独立的 - 使问题令人尴尬地并行。我尝试实现 a@distributed@threads解决方案,两者似乎都运行良好,但并不比顺序快。

这是我的代码的简化版本(顺序):

function dummy(steps = 10000, width = 100, chains = 4)
    out_N = zeros(steps, width, chains)
    initial = zeros(width)
    for c = 1:chains
        # print("c=$c\n")
        N = zeros(steps, width)
        state = copy(initial)
        N[1,:] = state
        for i = 1:steps
            state = state + rand(width)
            N[i,:] = state
        end
        out_N[:,:,c] = N
    end
    return out_N
end

将这个问题并行化以提高性能的正确方法是什么?

标签: multithreadingperformanceloopsparallel-processingjulia

解决方案


这是正确的方法(在撰写此答案时,另一个答案不起作用-请参阅我的评论)。

我将使用比问题中稍微简单的示例(但非常相似)。

1.非并行化版本(基线场景)

using Random
const m = MersenneTwister(0);

function dothestuff!(out_N, N, ic, m)
    out_N[:, ic] .= rand(m, N)
end

function dummy_base(m=m, N=100_000,c=256)
    out_N = Array{Float64}(undef,N,c)
    for ic in 1:c
        dothestuff!(out_N, N, ic, m)
    end
    out_N 
end

测试:

julia> using BenchmarkTools; @btime dummy_base();
  106.512 ms (514 allocations: 390.64 MiB)

2. 与线程并行

#remember to run before starting Julia:
# set JULIA_NUM_THREADS=4
# OR (Linux)
# export JULIA_NUM_THREADS=4

using Random

const mt = MersenneTwister.(1:Threads.nthreads());
# required for older Julia versions, look still good in later versions :-)

function dothestuff!(out_N, N, ic, m)
    out_N[:, ic] .= rand(m, N)
end
function dummy_threads(mt=mt, N=100_000,c=256)
    out_N = Array{Float64}(undef,N,c)
    Threads.@threads for ic in 1:c
        dothestuff!(out_N, N, ic, mt[Threads.threadid()])
    end
    out_N 
end

让我们测试一下性能:

julia> using BenchmarkTools; @btime dummy_threads();
  46.775 ms (535 allocations: 390.65 MiB)

3. 与进程并行(在单台机器上)

using Distributed

addprocs(4) 

using Random, SharedArrays
@everywhere using Random, SharedArrays, Distributed
@everywhere Random.seed!(myid())

@everywhere function dothestuff!(out_N, N, ic)
    out_N[:, ic] .= rand(N)
end
function dummy_distr(N=100_000,c=256)
    out_N = SharedArray{Float64}(N,c)
    @sync @distributed for ic in 1:c
        dothestuff!(out_N, N, ic)
    end
    out_N 
end

性能(请注意,进程间通信需要一些时间,因此对于小型计算线程通常会更好):

julia> using BenchmarkTools; @btime dummy_distr();
  62.584 ms (1073 allocations: 45.48 KiB)

推荐阅读