首页 > 解决方案 > HDF5 到 CSV 转换期间的多个错误

问题描述

我有一个巨大的 h5 文件,我需要将每个数据集提取到一个单独的 csv 文件中。该模式类似于/Genotypes/GroupN/SubGroupN/calls与“N”组和“N”子组。我创建了与主文件结构相同的示例 h5 文件并测试了正常工作的代码,但是当我将代码应用于我的主 h5 文件时,它遇到了各种错误。HDF5 文件的架构:

/Genotypes
    /genotype a
        /genotype a_1 #one subgroup for each genotype group
            /calls #data that I need to extract to csv file
            depth #data
    /genotype b
        /genotype b_1 #one subgroup for each genotype group
            /calls #data
            depth #data
    .
    .
    .
    /genotype n #1500 genotypes are listed as groups
        /genotype n_1
            /calls 
            depth

/Positions
    /allel #data 
    chromo #data#
/Taxa 
    /genotype a
        /genotype a_1
    /genotype b
        /genotype b_1 #one subgroup for each genotype group
    .
    .
    .
    /genotype n #1500 genotypes are listed as groups
        /genotype n_1

/_Data-Types_
    Enum_Boolean
    String_VariableLength

这是创建示例 h5 文件的代码:

import h5py  
import numpy as np  
    ngrps = 2  
    nsgrps = 3  
    nds = 4  
    nrows = 10  
    ncols = 2  
    
    i_arr_dtype = ( [ ('col1', int), ('col2', int) ] )  
    with h5py.File('d:/Path/sample_file.h5', 'w') as h5w :  
        for gcnt in range(ngrps):  
            grp1 = h5w.create_group('Group_'+str(gcnt))  
            for scnt in range(nsgrps):  
                grp2 = grp1.create_group('SubGroup_'+str(scnt))  
                for dcnt in range(nds):  
                    i_arr = np.random.randint(1,100, (nrows,ncols) )  
                    ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)  

我使用numpy如下:

import h5py
import numpy as np

def dump_calls2csv(name, node):    

    if isinstance(node, h5py.Dataset) and 'calls' in node.name :
       print ('visiting object:', node.name, ', exporting data to CSV')
       csvfname = node.name[1:].replace('/','_') +'.csv'
       arr = node[:]
       np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')

##########################    

with h5py.File('d:/Path/sample_file.h5', 'r') as h5r :        
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!

我也使用PyTables如下:

import tables as tb
import numpy as np

with tb.File('sample_file.h5', 'r') as h5r :     
    for node in h5r.walk_nodes('/',classname='Leaf') :         
       print ('visiting object:', node._v_pathname, 'export data to CSV')
       csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
       np.savetxt(csvfname, node.read(), fmt='%5d', delimiter=',')

但我看到下面提到的每种方法的错误:

 C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
visiting object: /Genotypes/Genotype a/genotye a_1/calls , exporting data to CSV
.
.
.
some of the datasets
.
.
.
Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 31, in <module>
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 565, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5o.pyx", line 355, in h5py.h5o.visit
  File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name
  File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 564, in proxy
    return func(name, self[name])
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 10, in dump_calls2csv
    np.savetxt(csv_name, arr, fmt='%5d', delimiter=',')
  File "<__array_function__ internals>", line 6, in savetxt
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1377, in savetxt
    open(fname, 'wt').close()
OSError: [Errno 22] Invalid argument: 'Genotypes_Genotype_Name-Genotype_Name2_calls.csv'

Process finished with exit code 1

第二个代码的错误是:

C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'locked' in node 'Genotypes'. Offending HDF5 class: 8
  value = self._g_getattr(self._v_node, name)
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'retainRareAlleles' in node 'Genotypes'. Offending HDF5 class: 8
  value = self._g_getattr(self._v_node, name)
visiting object: /Genotypes/AlleleStates export data to CSV
Traceback (most recent call last):
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1447, in savetxt
    v = format % tuple(row) + newline
TypeError: %d format: a number is required, not numpy.bytes_

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 40, in <module>
    np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',')
  File "<__array_function__ internals>", line 6, in savetxt
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1451, in savetxt
    % (str(X.dtype), format))
TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d')

Process finished with exit code 1

有人可以帮我解决这个问题吗?请提及我需要对代码应用的确切更改并提供完整代码,因为我的背景是注释编码,如果提供进一步的解释会很棒。

标签: numpyhdf5h5pypytables

解决方案


这不是一个完整的答案(还)。我正在使用它来格式化我对您上述评论的问题。
您的组/数据集名称中有空格吗?
如果是这样,我认为这是我的简单示例中的一个问题。我从组/数据集名称路径创建每个 CSV 文件名。我用'_'替换了每个'/'。您需要对空格执行相同的操作(通过添加 将每个“”替换为“-” .replace(' ','-')。打印csvfname变量以确认它按预期工作(并创建了一个有效的文件名)。

如果这不足以解决您的问题,请继续阅读。
我知道了:/Genotypes/genotype a/genotype a-1/calls是您要写入 CSV 的数据集(每个genotype x/genotype x-i/calls数据集 1 个)如果是这样,您可能在数据集中的数据与用于写入它的格式不匹配。首先打印dtypein dump_calls2csv(),如下所示print(arr.dtype):注释掉该np.savetxt()行,直到这有效。从错误消息中,我希望您将得到"|S1"而不是整数,这是一个问题,因为我的示例打印整数格式:fmt='%d'。理想情况下,您获得数据集/数组,dtype然后创建fmt=要匹配的字符串。

希望有帮助。如果没有,请使用新信息更新您的问题。


推荐阅读