首页 > 解决方案 > 在不知道其结构的情况下在 python 中导入 .dat 文件

问题描述

我正在尝试加载并查看可以从这里下载的数据内容。之后我需要对其进行分析。在这方面,我已经提出了问题,但我无法得到任何解决方案。

现在,我浏览了他们位于此处的标签文件。其中,提到了

“将编写有用的基于 Python 的字母来描述每个对象
//有关代码, 参见http://docs.python.org/library/struct.html//格式将以逗号分隔,以“RJW”开头,然后作为键 // {NAME }, {FORMAT}, {Number of dims}, {Size Dim 1}, {Size Dim 2}, ... //其中{FORMAT}为类型的Python代码,即i为uint32 //还有尺寸尺寸与尺寸数量一样多。”</p>

所以,我想可以试试python。我确实有python的工作知识。所以,我从从这里得到的这个程序开始(为简单起见,python 文件和数据文件在同一个文件夹中):

import numpy as np
data = np.genfromtxt('JAD_L30_LRS_ELC_ANY_CNT_2018091_V03.dat')
print(data)

我得到了错误“UnicodeDecodeError: 'cp949' codec can't decode byte 0xff in position 65: illegal multibyte sequence”.

如果我将代码更改为(如此处所述)

data=open('JAD_L30_LRS_ELC_ANY_CNT_2018091_V03.DAT', encoding='utf-8')
print(data)

错误消息消失了,但我得到的只是:

<_io.TextIOWrapper name='JAD_L30_LRS_ELC_ANY_CNT_2018091_V03.DAT' mode='r' encoding='utf-8'>

我在 StackOverflow 中检查了其他答案,但没有得到任何答案。我的问题可能与此处发布的内容非常相似

我需要先查看此 dat 文件的内容,然后导出为其他格式,例如 .csv。

任何帮助将不胜感激......

标签: pythonimport

解决方案


您需要以二进制模式打开文件。

with open('JAD_L30_LRS_ELC_ANY_CNT_2018091_V03.DAT', 'rb') as f:
    while True:
        chunk = f.read(160036) # that is record size as per LBL file
            # because the file is huge it will expect to hit Enter
            # to display next chunk. Use Ctrl+C to interrupt
        print(chunk)
        input('Hit Enter...')

请注意,您可以解析 LBL 文件,构造格式字符串以与struct模块一起使用,并将每个块解析为有意义的字段。这就是你引用的评论所说的。

"""Example of reading NASA JUNO JADE CALIBRATED SCIENCE DATA
https://pds-ppi.igpp.ucla.edu/search/view/?f=yes&id=pds://PPI/JNO-J_SW-JAD-3-CALIBRATED-V1.0/DATA/2018/2018091/ELECTRONS/JAD_L30_LRS_ELC_ANY_CNT_2018091_V03&o=1
https://stackoverflow.com/a/66687113/4046632
"""

import struct
from functools import reduce
from operator import mul
from collections import namedtuple

__author__ = "Boyan Kolev, https://stackoverflow.com/users/4046632/buran"

with open('JAD_L30_LRS_ELC_ANY_CNT_2018091_V03.LBL') as f:
    rjws = [line.strip('\n/* ') for line in f if line.startswith('/* RJW')]

# create the format string for struct
rjws = rjws[2:] # exclude first 2 RJW comments related to file itself
names = []
FMT = '='
print(f'Number of objects: {len(rjws)}')
for idx, rjw in enumerate(rjws):
    _, name, fmt, num_dim, *dims = rjw.split(', ')
    fstr = f'{reduce(mul, map(int, dims))}{fmt}'
    FMT = f'{FMT} {fstr}'
    names.append(name)
    print(f'{idx}:{name}, {fstr}')
FMT = FMT.replace('c', 's') # for conveninece treat 21c as s char[]
print(f"Format string: {repr(FMT)}")

# parse DAT file
s = struct.Struct(FMT)
print(f'Struct size:{s.size}')
with open('JAD_L30_LRS_ELC_ANY_CNT_2018091_V03.DAT', 'rb') as f:
    n = 0
    while True: # in python3.8+ this loop can be simplified with walrus operator
        chunk = f.read(s.size)
        if not chunk:
            break
        data = s.unpack_from(chunk)
        # process data further, e.g. split data in 2D containers where appropriate
        n += 1

print(f'Number of records: {n}')

# make a named tuple to represent first 10 fields
# for nice display. This basic use of namedtuple works only
# for first 23 objects, which have single item.
num_fields = 10
Record = namedtuple('Record', names[:num_fields])
record = Record(*data[:num_fields])
print('\n----------------------\n')
print(f'First {num_fields} fields of the last record.')
print(record)

输出:

Number of objects: 49
0:DIM0_UTC, 21c
1:PACKETID, 1B
2:DIM0_UTC_UPPER, 21c

--- omitted for sake of brevity ---

46:DIM2_AZIMUTH_DESPUN_LOWER, 3072f
47:MAG_VECTOR, 3f
48:ESENSOR, 1H
Format string: '= 21s 1B 21s 1b 21s 1b 1H 1B 1B 1B 1B 1h 1h 1f 1f 1f 1f 1f 1f 1f 1f 1f 1f 3f 3f 3f 1f 9f 9f 9f 1f 1I 1I 1H 3072f 3072f 3072f 3072f 3072f 3072f 3072f 3072f 3072f 3072f 3072f 3072f 3072f 3f 1H'
Struct size:160036
Number of records: 1101

----------------------

First 10 fields of the last record.
Record(DIM0_UTC=b'2018-091T23:56:08.925', PACKETID=106, DIM0_UTC_UPPER=b'2018-092T00:01:08.925', PACKET_MODE=1, DIM0_UTC_LOWER=b'2018-091T23:51:08.925', PACKET_SPECIES=-1, ACCUMULATION_TIME=600, DATA_UNITS=2, SOURCE_BACKGROUND=3, SOURCE_DEAD_TIME=0)

链接到GutHub 要点


推荐阅读