首页 > 解决方案 > 解析具有随机间距和重复文本的文本文件?

问题描述

我正在尝试解析一个间距和重复行不一致的大型文本文件。文件中的很多文本我不需要,但例如在一行中我可能需要 6 个项目,一些用逗号分隔,一些用空格分隔。

示例行:1 23456 John,Doe 366:F.7

我想要的(CSV 格式):1 2456 John Doe 366 F.7(全部作为他们自己的单元格)

最终,我试图将输出转换为 CSV,并且希望到目前为止我已经尝试在文件中逐行分隔我试图按其特定空间提取的组件,但我觉得有更好的方法。

import csv

def is_page_header(line):
    return(line[0] == '1') and ("RUN DATE:" not in line)

def read_header(inFile):
    while True:
        line = inFile.readline()
        if '************************' in line:
            break

def is_rec_start(line):
    try:
        x = int(line[0:6])
        return True
    except:
        return False

filename = r"TEXT_TEST.txt"

inFile = open(filename)

while True:
    line = inFile.readline()    

    if line == "\n":
        continue
    elif line == "":
        break
    elif is_page_header(line):
        read_header(inFile)
    elif is_rec_start(line):
          docketno = int(line[0:6])
          fileno = line[8:20]
    elif 'FINGERPRINTED' in line:
        fingerprinted = True
    else:
        print(line)

标签: python

解决方案


you can use regex

import re
import csv
pattern = re.compile("(\d+)\s+(\d+)\s*(\w+)\s*\,\s*(\w+)\s*(\d+)\s*\:\s*([\w\.]+)")
with open("TEXT_TEST.txt") as txt_file, open("CSV_TEST.csv", "w") as csv_file:
    csv_writer = csv.writer(csv_file)
    for line in txt_file:
        g = pattern.findall(line)
        if g: csv_writer.writerows(g)

(\d+): \d match any digit from 0 to 9, + after means match one or more, () is used to capture and extract information for further processing.

\s+: \s to match whitespace, + one or more.

\s*: * after \s match zero or more of whitespaces.

\w: is used to match characters in range A-Z, a-z, 0-9

[] is used for matching specific characters, eg. [abc] will only match a single a, b, or c letter and nothing else, so [\w\.] matches A-Z, a-z, 0-9 or ., \ before . is used to escape a character that has special meaning inside a regular expression.

\d \w \s * + . [] () re.findall


推荐阅读