python-3.x - 读取具有不同长度标头的数据
问题描述
我想在 python 中读取一个包含可变长度标头的文件,然后在数据帧/系列中提取标头之后的变量。
数据看起来像:
....................................................................
Data coverage and measurement duty cycle:
When the instrument duty cycle is not in measure mode (i.e. in-flight
calibrations) the data is not given here (error flag = 2).
The measurements have been found to exhibit a strong sensitivity to cabin
pressure.
Consequently the instrument requires calibrated at each new cabin
pressure/altitude.
Data taken at cabin pressures for which no calibration was performed is
not given here (error flag = 2).
Measurement sensivity to large roll angles was also observed.
Data corresponding to roll angles greater than 10 degrees is not given
here (error flag = 2)
......................................................................
High Std: TBD ppb
Target Std: TBD ppb
Zero Std: 0 ppb
Mole fraction error flag description :
0 : Valid data
2 : Missing data
31636 0.69 0
31637 0.66 0
31638 0.62 0
31639 0.64 0
31640 0.71 0
.....
.....
所以我想要的是将数据提取为:
Time C2H6 Flag
0 31636 0.69 0 NaN
1 31637 0.66 0 NaN
2 31638 0.62 0 NaN
3 31639 0.64 0 NaN
4 31640 0.71 0 NaN
5 31641 0.79 0 NaN
6 31642 0.85 0 NaN
7 31643 0.81 0 NaN
8 31644 0.79 0 NaN
9 31645 0.85 0 NaN
我可以做到这一点
infile="/nfs/potts.jasmin-north/scratch/earic/AEOG/data/mantildas_faam_20180911_r1_c118.na"
flightdata = pd.read_fwf(infile, skiprows=53, header=None, names=['Time', 'C2H6', 'Flag'],)
但我跳过了大约 53 行,因为我计算了我应该跳过多少。我有一堆这些文件,有些在标题中没有正好 53 行,所以我想知道处理这个问题的最佳方法是什么,以及让 Python 在找到它们时始终只读取三列数据的标准? 我想如果我想要让我们说 Python 实际从遇到的地方读取数据
Mole fraction error flag description :
0 : Valid data
2 : Missing data
我应该怎么办 ?使用另一个更好的标准呢?
解决方案
您可以拆分标题分隔符,如下所示:
with open(filename, 'r') as f:
myfile = f.read()
infile = myfile.split('Mole fraction error flag description :')[-1]
# skip lines with missing data
infile = infile.split('\n')
# likely a better indicator of a line with incorrect format, you know the data better
infile = '\n'.join([line for line in infile if ' : ' not in line])
# create dataframe
flightdata = pd.read_fwf(infile, header=None, names=['Time', 'C2H6', 'Flag'],)
推荐阅读
- excel - ElseIf 语句未评估所有条件
- apache-spark - 在创建嵌套 pyspark 数据框架时重命名嵌套字段名称
- javascript - 如何从 React 中的公共文件夹设置背景图像(创建 React 应用程序)
- python - 部署 python 脚本的最佳实践
- recursion - wget中的递归与外部目录
- javascript - 2个或更多随机数的总和
- angular - Angular BroadcastChannel 在 Safari 上不起作用
- javascript - 如何在生产包中包含 corejs polyfill?
- c++ - sprintf 未定义的行为
- azure - 在 Azure 中实现分布式跟踪