numpy - HDF5 到 CSV 转换期间的多个错误
问题描述
我有一个巨大的 h5 文件,我需要将每个数据集提取到一个单独的 csv 文件中。该模式类似于/Genotypes/GroupN/SubGroupN/calls与“N”组和“N”子组。我创建了与主文件结构相同的示例 h5 文件并测试了正常工作的代码,但是当我将代码应用于我的主 h5 文件时,它遇到了各种错误。HDF5 文件的架构:
/Genotypes
/genotype a
/genotype a_1 #one subgroup for each genotype group
/calls #data that I need to extract to csv file
depth #data
/genotype b
/genotype b_1 #one subgroup for each genotype group
/calls #data
depth #data
.
.
.
/genotype n #1500 genotypes are listed as groups
/genotype n_1
/calls
depth
/Positions
/allel #data
chromo #data#
/Taxa
/genotype a
/genotype a_1
/genotype b
/genotype b_1 #one subgroup for each genotype group
.
.
.
/genotype n #1500 genotypes are listed as groups
/genotype n_1
/_Data-Types_
Enum_Boolean
String_VariableLength
这是创建示例 h5 文件的代码:
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
i_arr_dtype = ( [ ('col1', int), ('col2', int) ] )
with h5py.File('d:/Path/sample_file.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)
我使用numpy
如下:
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('d:/Path/sample_file.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
我也使用PyTables
如下:
import tables as tb
import numpy as np
with tb.File('sample_file.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%5d', delimiter=',')
但我看到下面提到的每种方法的错误:
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
visiting object: /Genotypes/Genotype a/genotye a_1/calls , exporting data to CSV
.
.
.
some of the datasets
.
.
.
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 31, in <module>
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 565, in visititems
return h5o.visit(self.id, proxy)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit
File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name
File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 564, in proxy
return func(name, self[name])
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 10, in dump_calls2csv
np.savetxt(csv_name, arr, fmt='%5d', delimiter=',')
File "<__array_function__ internals>", line 6, in savetxt
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1377, in savetxt
open(fname, 'wt').close()
OSError: [Errno 22] Invalid argument: 'Genotypes_Genotype_Name-Genotype_Name2_calls.csv'
Process finished with exit code 1
第二个代码的错误是:
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'locked' in node 'Genotypes'. Offending HDF5 class: 8
value = self._g_getattr(self._v_node, name)
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'retainRareAlleles' in node 'Genotypes'. Offending HDF5 class: 8
value = self._g_getattr(self._v_node, name)
visiting object: /Genotypes/AlleleStates export data to CSV
Traceback (most recent call last):
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1447, in savetxt
v = format % tuple(row) + newline
TypeError: %d format: a number is required, not numpy.bytes_
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 40, in <module>
np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',')
File "<__array_function__ internals>", line 6, in savetxt
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1451, in savetxt
% (str(X.dtype), format))
TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d')
Process finished with exit code 1
有人可以帮我解决这个问题吗?请提及我需要对代码应用的确切更改并提供完整代码,因为我的背景是注释编码,如果提供进一步的解释会很棒。
解决方案
这不是一个完整的答案(还)。我正在使用它来格式化我对您上述评论的问题。
您的组/数据集名称中有空格吗?
如果是这样,我认为这是我的简单示例中的一个问题。我从组/数据集名称路径创建每个 CSV 文件名。我用'_'替换了每个'/'。您需要对空格执行相同的操作(通过添加 将每个“”替换为“-” .replace(' ','-')
。打印csvfname
变量以确认它按预期工作(并创建了一个有效的文件名)。
如果这不足以解决您的问题,请继续阅读。
我知道了:/Genotypes/genotype a/genotype a-1/calls
是您要写入 CSV 的数据集(每个genotype x/genotype x-i/calls
数据集 1 个)如果是这样,您可能在数据集中的数据与用于写入它的格式不匹配。首先打印dtype
in dump_calls2csv()
,如下所示print(arr.dtype)
:注释掉该np.savetxt()
行,直到这有效。从错误消息中,我希望您将得到"|S1"
而不是整数,这是一个问题,因为我的示例打印整数格式:fmt='%d'
。理想情况下,您获得数据集/数组,dtype
然后创建fmt=
要匹配的字符串。
希望有帮助。如果没有,请使用新信息更新您的问题。
推荐阅读
- java - 在同一场景中在移动和桌面浏览器/驱动程序之间切换 - java selenium cucumber
- node.js - 使用 React、Express 获取 404 onSubmit
- javascript - 我可以在 .reduce 方法上更改累加器的类型吗?
- python - 根据另一列中的条件,使用 .diff() 函数的结果在 pandas df 中创建一个新列
- swift - Swift 协议在同一行声明多个属性
- java - SwitchPreferenceCompat setOnPreferenceChangeListener 使应用程序崩溃
- javascript - JavaScript 中的抽象操作
- python - 相同的 Python SQLite3 查询为一个而不是另一个提供 IndexError
- excel - 基于概率的分布
- html - 如何动态更改破折号中的 html.Button 文本?