首页 > 解决方案 > 我应该为可以在运行时在 cython 中创建的 numpy 数组动态分配内存吗?

问题描述

这运行平稳快速:

solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?@ABCDEFGHI"

cdef np.ndarray[np.uint32_t, ndim=2] sums = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32) 

cdef bytes line
cdef str decoded_line
cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
    for line in f:
    
        if counter%4==0: # first line of the sequence (obtain tile info)
            counter=0
    
        elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
            decoded_line = line.decode('utf-8')
            for n in range(len(decoded_line)): #     enumerate(line.decode('utf-8')):
                sums[n, ord(decoded_line[n])] +=1
                
        counter+=1

这里 numpy ndarray sums包含结果。

但是,我需要字典中未知数量的数组(名为tiles )而不是单个 numpy 数组,这是应该实现我的目标的代码:

solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?@ABCDEFGHI"

cdef dict tiles = {} # each tile will have it's own 'sums' numpy array

cdef bytes line
cdef str decoded_line
cdef str tile

cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
    for line in f:

        if counter%4==0: # first line of the sequence (obtain tail info)
            decoded_line = line.decode('utf-8')
            tile = decoded_line.split(':')[4]
            if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere. 
                tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)

            counter=0

        elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
            decoded_line = line.decode('utf-8')
            for n in range(len(decoded_line)): #     enumerate(line.decode('utf-8')):
                tiles[tile][n, ord(decoded_line[n])] +=1
                
        counter+=1

在第二个示例中,我事先不知道字典图块中键的数量,因此,将在运行时声明和初始化 numpy 数组(如果我错了或使用了错误的术语,请纠正我)。使用 numpy 数组的 cython 声明时,Cython 没有翻译/编译,因此,我将其保留为tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32). 由于两个片段之间共享的代码的所有其他 cython 优化都很好,我相信这个 numpy 数组声明是问题所在。

我应该如何解决这个问题?在这里,手册指出了动态分配内存的方法,但我不知道这如何与 numpy 数组一起使用,以及我是否应该全部完成。

谢谢!

标签: pythoncython

解决方案


我会忽略有关动态分配内存的文档。这不是您想要做的——它非常处于 C 级别,并且您正在处理 Python 对象。

您可以轻松地多次重新分配类型为 Numpy 数组(或同样是较新类型的 memoryview)的变量,以便它引用不同的 Numpy 数组。我怀疑你想要的是类似的东西

# start of function
cdef np.ndarray[np.uint32_t, ndim=2] tile_array

# in "if counter%4==0":
if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere. 
    tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
tile_array = tiles[tile]  # not a copy! Just two references to exactly the same object

# in "if counter%3==0"
tile_array[n, ord(decoded_line[n])] +=1

仅进行一些类型检查的成本很小tile_array = tiles[tile],因此如果您tile_array在每次分配之间使用几次可能才值得(很难准确猜测阈值是多少,但要根据您当前的版本计时)。


推荐阅读