optimization - Optimize highly vectorized code in Julia?
问题描述
I want to optimize the following code in Julia, it is written in a highly vectorized form at which languages like MATLAB excel. The exact same code in MATLAB takes Elapsed time is 0.277608 seconds.
, that's 2.8X faster, so I thought something could be done in Julia. To be fair, I noticed that MATLAB uses multithreading by default, so no problem if multithreading is enabled in Julia too. Thank you for your help.
function fit_xlin(x, y, w)
n = length(x)
regularization = 1.0e-5
xx_0_0 = fill(sum(w.*1) , n)
xx_1_0 = fill(sum(w.*x) , n)
xx_0_1 = fill(sum(w.*x) , n)
xx_1_1 = fill(sum(w.*x.*x), n)
xy_0 = fill(sum(w.*y) , n)
xy_1 = fill(sum(w.*x.*y), n)
xx_1_0 .+= regularization
xx_0_1 .+= regularization
xxk_0_0 = xx_0_0 .- w.*1
xxk_1_0 = xx_1_0 .- w.*x
xxk_0_1 = xx_0_1 .- w.*x
xxk_1_1 = xx_1_1 .- w.*x.*x
xyk_0 = xy_0 .- w.*y
xyk_1 = xy_1 .- w.*x.*y
det = xxk_0_0.*xxk_1_1 .- xxk_0_1.*xxk_1_0
c0 = (xxk_1_1.*xyk_0 .- xxk_0_1.*xyk_1)./det
c1 = (-xxk_1_0.*xyk_0 .+ xxk_0_0.*xyk_1)./det
y_est = c0 .+ c1.*x
end
using BenchmarkTools
function test_xlin()
x = rand( 0.0:4.0, 5000_000)
y = rand( 0.0:4.0, 5000_000)
w = rand( 0.0:4.0, 5000_000)
@btime fit_xlin($x, $y, $w)
end
This times as:
julia> test_xlin();
775.292 ms (46 allocations: 877.38 MiB)
解决方案
Here's a code that is still vectorized, and it does not use multithreading. On my computer it's a bit more than twice as fast as the original, but there is still things that can be done here.
BTW, I seriously doubt that the Matlab code is particularly optimized, since there are some very wasteful operations going on -- unnecessary allocation, unnecessary operations (sum(w.*1)
is really bad, multiplying an array with 1 and allocating an extra array in the process is horribly wasteful :) Furthermore, you do not need to allocate any of the vectors xx_0_0
, xx_1_0
in Matlab either, you can just use broadcasting like in Julia.
Anyway, here's my first attempt:
function fit_xlin2(x, y, w)
regularization = 1.0e-5
sumwx = (w' * x) + regularization
sumwy = (w' * y)
sumwxx = sum(a[1]*a[2]^2 for a in zip(w, x))
sumwxy = sum(prod, zip(w, x, y))
wx = w .* x
xxk_0_0 = sum(w) .- w
xxk_1_0 = sumwx .- wx
xxk_1_1 = sumwxx .- wx .* x
xyk_0 = sumwy .- w .* y
xyk_1 = sumwxy .- wx .* y
det = xxk_0_0 .* xxk_1_1 .- xxk_1_0 .* xxk_1_0
c0 = (xxk_1_1 .* xyk_0 .- xxk_1_0 .* xyk_1)./det
c1 = (-xxk_1_0 .* xyk_0 .+ xxk_0_0 .* xyk_1)./det
return c0 .+ c1 .* x
end
Edit: You can get a fair bit of speed-up by devectorizing the main loop. This code is ~17 times faster than the original Julia code, still single-threaded, and quite readable:
function fit_xlin_loop(x, y, w)
if !(size(x) == size(y) == size(w))
error("Input vectors must have the same size.")
end
regularization = 1.0e-5
sumw = sum(w)
sumwx = (w' * x) + regularization
sumwy = (w' * y)
sumwxx = sum(a[1]*a[2]^2 for a in zip(w, x))
sumwxy = sum(prod, zip(w, x, y))
y_est = similar(x)
@inbounds for i in eachindex(y_est)
wx = w[i] * x[i]
xxk_0_0 = sumw - w[i]
xxk_1_0 = sumwx - wx
xxk_1_1 = sumwxx - wx * x[i]
xyk_0 = sumwy - w[i] * y[i]
xyk_1 = sumwxy - wx * y[i]
det = xxk_0_0 * xxk_1_1 - xxk_1_0 * xxk_1_0
c0 = (xxk_1_1 * xyk_0 - xxk_1_0 * xyk_1) / det
c1 = (-xxk_1_0 * xyk_0 + xxk_0_0 * xyk_1) / det
y_est[i] = c0 + c1 * x[i]
end
return y_est
end
推荐阅读
- ios - 在从数据库获取数据之前初始化 CollectionView - iOS
- javascript - React/Javascript 排序复杂函数
- javascript - 从 slideUp 中排除元素及其子元素
- java - 值不更新服务器端事件spring boot
- r - 在 data.table 中由 n 个不同的组创建 n 个新列
- go - 声明一个没有数据类型的变量
- c# - 为什么 GoogleApiClient.Build.EnableAutoManage 软锁定我的应用程序?
- r - dynlm 给出完全相同的输出,但滞后不同
- numpy - 为字符串 dtypes 写一个 `__array_ufunc__`
- r - 为什么R显示这个矩阵不是正定的