首页 > 解决方案 > 在 cuda 中,是否可以从具有预期序列的稀疏数组中写入密集数组?

问题描述

array1代表 0 或 1(对于每个线程块):

bool array1[]: [1, 1, 0, 0, 1, 1]

线程块中的每个线程都array1使用threadIdx.x.

而且,我需要做shared dense array2 (each value represents thread ID with '1' value from array1

__shared__ bool array2[] (thread ID) : [0, 1, 4, 5]

看来,至少,我需要atomicAdd()对 index 进行操作array2

即使有atomicAdd(),我认为很难array2像上面的序列

(0, 1, 4, 5)。

是否可以在 cuda 中制作array2array1对于每个线程块)?

标签: cuda

解决方案


您可以合并组

假设读取的布尔值是threasIsIN

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

uint32_t tid = threadIdx.x;
const uint32_t warpLength = 32;
uint32_t warpIdx = tid / warpLength;
if (threadIsIn){
   auto active = cg::coalesced_threads();
   uint32_t idx = active.thread_rank() + warpIdx * warpLength;
   array2[idx] = tid;
}

编辑

一个块中有多个warp的解决方案:块的第一个warp将为块中的其余warp准备共享数组,这使得其他warp等待第一个warp完成。

thread_block block = this_thread_block();
uint32_t tid = threadIdx.x;
const uint32_t warpLength = 32;
uint32_t warpIdx = tid / warpLength;
uint32_t startIdx = 0;
uint32_t tidToWrite = tid;
uint32_t maxItr = blockSize / warpLength;
uint32_t itr = 0;
while (warpIdx == 0 && itr < maxItr){
    auto warp = cg::coalesced_threads();
    auto warpMask = warp.ballot(threadIsIn); // the tid'th bit is set to 1 if threadIsIn is true for tid
    uint32_t trueThreadsSize = __popc(warpMask); // counts the number of bits that are set to 1
    if(threadIsIn){
        auto active = cg::coalesced_threads();
        // active.size() has the same value as trueThreadsSize 
        array2[startIdx + active.thread_rank()] = tidToWrite;
    }
    startIdx += trueThreadsSize;
    tidToWrite += warpLength;
    ++itr;       
    arr1Idx += warpLength;
    threadIsIn = arr1[arr1Idx];
}
block.sync();

推荐阅读