python - 过滤和解析太阳区域摘要文件中的文本
问题描述
我试图过滤一些 .txt 文件,这些文件以 YYYYMMDD 格式的日期命名,并包含一些关于太阳活动区域的数据。我编写了一个代码,给定 YYYYMMDD 格式的日期,可以列出在我期望我正在寻找的活动区域的时间范围内的文件,并根据该条目解析信息。可以在下面看到这些 txt 的示例,有关它的更多信息(如果您感到好奇)可以在SWPC 网站上看到。
:Product: 0509SRS.txt
:Issued: 2012 May 09 0030 UTC
# Prepared jointly by the U.S. Dept. of Commerce, NOAA,
# Space Weather Prediction Center and the U.S. Air Force.
#
Joint USAF/NOAA Solar Region Summary
SRS Number 130 Issued at 0030Z on 09 May 2012
Report compiled from data received at SWO on 08 May
I. Regions with Sunspots. Locations Valid at 08/2400Z
Nmbr Location Lo Area Z LL NN Mag Type
1470 S19W68 284 0030 Cro 02 02 Beta
1471 S22W60 277 0120 Cso 05 03 Beta
1474 N14W13 229 0010 Axx 00 01 Alpha
1476 N11E35 181 0940 Fkc 17 33 Beta-Gamma-Delta
1477 S22E73 144 0060 Hsx 03 01 Alpha
IA. H-alpha Plages without Spots. Locations Valid at 08/2400Z May
Nmbr Location Lo
1472 S28W80 297
1475 N05W05 222
II. Regions Due to Return 09 May to 11 May
Nmbr Lat Lo
1460 N16 126
1459 S16 110
我用来解析这些 txt 文件的代码是:
import glob
def seeker(noaa_number, t_start, path = None):
'''
This function will open an SRS file
and look for each line if the given AR
(specified by its NOAA number) is there.
If so, this function should grab the
entries and return them.
'''
#defaulting path if none is given
if path is None:
#assigning
path = 'defaultpath'
#listing the items within the directory
files = sorted(glob.glob(path+'*.txt'))
#finding the index in the list of
#the starting time
index = files.index(path+str(t_start)+'SRS.txt')
#looping over each file
for file in files[index: index+20]:
#opening file
f = open(file, 'r')
#reading the lines
text = f.readlines()
#looping over each line in the text
for line in text:
#checking if the noaa number is mentioned
#in the given line
if noaa_number in line:
#test print
print('Original line: ', line)
#slicing the text to get the column values
nbr = line[:4]
Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]
MagType = line[37:]
#test prints
print('nbr: ', nbr)
print('location: ', Location)
print('Lo: ', Lo)
print('Area: ', Area)
print('Z: ', Z)
print('LL: ', LL)
print('NN: ', NN)
print('MagType: ', MagType)
return
我对此进行了测试,它正在工作,但我有点愚蠢,原因有两个:
尽管这些文件是按照标准制作的,但考虑到我按索引对数组进行切片的方式,只需一个额外的空间就可以使代码崩溃。有更好的选择吗?
表 IA 和 II 上的信息与我无关,所以理想情况下,我想阻止我的代码扫描它们。由于第一列的行数不同,是否可以告诉代码何时停止阅读给定文档?
谢谢你的时间!
解决方案
稳健性:
.split()
您可以使用该方法将行拆分为列表,而不是按绝对位置切片。这对于额外的空间将是健壮的。
所以而不是
Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]
你可以使用
Location = line.split()[1]
Lo = line.split()[2]
Area = line.split()[3]
Z = line.split()[4]
LL = line.split()[5]
NN = line.split()[6]
如果您希望它更快,您可以将列表拆分一次,然后从同一个列表中提取相关数据,而不是每次都拆分:
data = line.split()
Location = data[1]
Lo = data[2]
Area = data[3]
Z = data[4]
LL = data[5]
NN = data[6]
停止:
To stop it from continuing reading the file after it's passed the relevant data you could just have something that exits the loop once it no longer finds the noaa_number in the line
# In the file function but before looping through the lines.
started_reading = False ## Set this to false so
## that it doesn't exit
## before it gets to the
## relevant data
for line in text:
if noaa_number in line:
started_reading = True
## Parsing stuff
elif started_reading is True:
break # exits the loop
推荐阅读
- apache - 如何在apache中将任何网页重定向到另一个网页
- asp.net-core - 开发时在 ASP.NET Core 2.1 中使用 SSL 证书
- hbase - 使用 HBase 扫描的 ScanMetrics 中的 countOfRowsFiltered 到底是什么?
- python - 使用 OpenCV 合并重叠矩形
- vba - Excel VBA循环通过选项卡复制并粘贴到另一个文件中的一个单独的文件中......
- laravel - 使用 Axios 切换 Laravel 语言环境
- amazon-web-services - 使用内在函数的 aws cloudformation
- reactjs - 如何以正确的方式在“redux”中实现“thunk”来处理异步函数?
- linux - 如何使用 shell 脚本递归地抓取所有 jpg 文件
- c# - 如何在.net中设置多个子复选框