首页 > 解决方案 > 如果提供了 FrequencyWeights(),则 StatsBase.sample() 无法在不替换的情况下绘制

问题描述

我正在尝试StatsBase.sample()在 Julia 中使用不替换的方式进行采样。因为我的数据采用以下形式,所以我可以将计数用作FrequencyWeights()

using StatsBase

data   = ["red", "blue", "green"]
counts = [2000, 2000, 1]

balls  = StatsBase.sample(data, FrequencyWeights(counts), 1000)

这样做的一个问题是StatsBase.sample()隐式设置replace=true,因此这是可能的:

countmap(balls)
Dict("blue"  => 478,
     "green" => 2,  # <= two green balls?
     "red"   => 520)

显式设置replace=false会引发错误。

balls  = StatsBase.sample(data, FrequencyWeights(counts), 1000, replace=false)

Cannot draw 3 samples from 1000 samples without replacement.

error(::String)@error.jl:33
var"#sample!#174"(::Bool, ::Bool, ::typeof(StatsBase.sample!), ::Random._GLOBAL_RNG, ::Vector{String}, ::StatsBase.FrequencyWeights{Int64, Int64, Vector{Int64}}, ::Vector{String})@sampling.jl:858
#sample#175@sampling.jl:871[inlined]
#sample#176@sampling.jl:874[inlined]
top-level scope@Local: 2[inlined]

我在这里唯一的解决方案是将我的数据重新格式化为这样的广泛形式吗?因为这似乎非常低效,因为我的实际数据集有很多计数。:

wide_data = [fill("red", 2000)..., fill("blue", 2000)..., "green"]
sample(wide_data, 1000, replace=false)

标签: randomjulia

解决方案


你可以使用这样的东西:

function mysample(data::AbstractVector, counts::AbstractVector, n::Integer)
    @assert n <= sum(counts)
    @assert firstindex(data) == 1
    @assert firstindex(counts) == 1
    res = similar(data, n)
    fw = FrequencyWeights(copy(counts))
    for i in 1:n
        j = sample(axes(data, 1), fw)
        res[i] = data[j]
        fw.sum -= 1
        fw.values[j] -= 1
    end
    return res
end

推荐阅读