首页 > 解决方案 > 使用python从棘手的文本文件中解析数据-如何将所有相关数据放在一行

问题描述

我对 python 比较陌生,并且在一个项目的中间,我必须从格式不佳的文本/NAVTEX 文件(~150000 行)示例文件中提取预测风速数据。我已经设法解析了日期、预测区域,但我遇到了风速线“WND:”的问题,因为在某些情况下它占用了不止一条线,而在其他情况下则没有:

NORTHEAST COAST: *<--- forecast region*
WNG: STORM / FREEZING SPRAY.
WND: NW25. 01/15Z S15. 02/00Z SE25. 02/12Z NE40. 02/18Z N50 LCLY G60 *<--- Wind speed line* 
ALONG THE COAST. *<--- Wind speed line* 
VIS: 02/00Z-03/03Z 0-1 SN.

我对预测区域有同样的问题,但设法使用以下代码解决了这个问题:

lines = open(newfile,'r').readlines()
finalfile = open(final, 'w')
for i, line in enumerate(lines):
    if line.startswith("AND SOUTH:") or line.startswith("BANKS:"):
        lines[i-1] = lines[-1].strip() + line
        lines.pop(i)
    finalfile.write(line)

我尝试使用“VIS:”作为关键字做类似的事情,将“WND:”(风速)放在一行中,但我没有得到想要的结果:

lines = open(newfile,'r').readlines()
finalfile = open(final, 'w')
for i, line in enumerate(lines):
    if line.startswith("AND SOUTH:") or line.startswith("BANKS:"):
        lines[i-1] = lines[-1].strip() + line
        lines.pop(i)
    if line.startswith("VIS:"):
        if not lines[i-1].startswith("WND:") and lines[i-2].startswith("WND:"):
            lines[i-2] = lines[i-1].strip() + lines[i-1]
            lines.pop(i-1)
    finalfile.write(line)

我想要的输出是:

   NORTHEAST COAST: *<--- forecast region*
    WNG: STORM / FREEZING SPRAY.
    WND: NW25. 01/15Z S15. 02/00Z SE25. 02/12Z NE40. 02/18Z N50 LCLY G60 ALONG THE COAST. *<--- Wind speed line* 
    VIS: 02/00Z-03/03Z 0-1 SN.

从这里我想我可以根据需要分割风速线。提前致谢。

标签: python-3.xtext

解决方案


该脚本将找到每个部分,WNG:然后删除过多的换行符(变量txt是问题链接中的字符串)(regex101):

import re

def get_lines(txt):
    lines = iter(txt.splitlines())
    buf = next(lines, '')
    for line in lines:
        if ': ' in line:
            yield buf
            buf = line
        else:
            buf += ' ' + line
    if buf:
        yield buf

for wind_data in re.findall(r'([^\n]+:\nWNG:.*?)\n\n', txt, flags=re.S):
    for line in get_lines(wind_data):
        print(line)
    print('-' * 80)

印刷:

EAST COAST-CAPE ST  FRANCIS AND SOUTH:
WNG: NIL.
WND: SW25 LCLY G35 ALONG THE COAST. 14/23Z SW25. 15/05Z SW15 XCPT SW25 OVER SOUTHERN SECTIONS. 15/11Z LGT XCPT W25 OVER SOUTHERN SECTIONS.
--------------------------------------------------------------------------------
EAST COAST-NORTH OF CAPE ST  FRANCIS:
WNG: NIL.
WND: SW25 LCLY G35 ALONG THE COAST. 15/05Z SW15-20. 15/11Z VRB10-15. 15/17Z NW15-20.
--------------------------------------------------------------------------------
NORTHEAST COAST:
WNG: NIL.
WND: SW15-20. 14/21Z VRB15. 15/02Z NW20. 15/23Z NW10-15.
--------------------------------------------------------------------------------
FUNK ISLAND BANK:
WNG: NIL.
WND: SW25. 15/05Z S15-20. 15/17Z VRB10-15.
--------------------------------------------------------------------------------
NORTHERN GRAND BANKS:
WNG: NIL.
WND: SW25. 15/02Z SW15-20.
--------------------------------------------------------------------------------
SOUTHWEST COAST:
WNG: GALE.
WND: W25-35. 14/23Z W25. 15/23Z NW15-20.
--------------------------------------------------------------------------------
SOUTH COAST:
WNG: NIL.
WND: SW25 LCLY G35 ALONG THE COAST. 15/01Z SW25. 15/08Z W25 XCPT W15 OVER NORTHERN SECTIONS.
--------------------------------------------------------------------------------
SOUTHEASTERN GRAND BANKS:
WNG: NIL.
WND: SW15-20. 15/14Z W25.
--------------------------------------------------------------------------------
SOUTHWESTERN GRAND BANKS:
WNG: NIL.
WND: W20. 15/05Z W25.
--------------------------------------------------------------------------------

推荐阅读