首页 > 解决方案 > 从 file.txt 文本文件中提取数据到 csv

问题描述

我有一个文本文件,其数据格式如下:

1   Disease1
     01000 01001 01002 01003 01004 01005 01006 01010 01011 01012 01013 01014 01015 01016 01080 01081 01082 01083 01084 01085
     01086 01090 01091 01092 01093 01094 01095 01096 01100 01101 01102 01103 01104 01105 01106 01110 01111 01112 01113 01114
     1370 1371 1372 1373 1374 V1201 

                           

2    Disease2 (except in illness)
     0031 0202 0223 0362 0380 0381 03810 03811 03812 03819 0382 0383 03840 03841 03842 03843 03844 03849 0388 0389
     0545 449 77181 7907 99591 99592                            

3    Disease3; unspecified site
     0200 0208 0209 0218 0219 0228 0229 0230 0231 0232 0233 0238 0239 024 025 0260 0269 0270 0271 0272
     0278 0279 0300 0301 0302 0303 0308 0309 0312 0318 0319 03289 0329 0330 0331 0338 0339 0341 0363 03681
     04181 04182 04183 04184 04185 04186 04189 0419 390 3929 7953 79531 79539 V090 V091 V092 V093 V094 V0950 V0951
     V096 V0970 V0971 V0980 V0981 V0990 V0991 V1204                        
---------** and so on**

如何将此文本文件提取到 csv 文件中,其中包含疾病类别的一列(疾病 1、疾病 2 等)以及另一列中相应代码的列?

标签: python

解决方案


手动解决方案:

import re
import csv
from pathlib import Path # pathlib is friend, not food

# Create Path() object
# Join subpath to csv output file.
folder = Path(r"C:\path\to\output\folder")
csv_output_path = folder.joinpath("my-csv-file.csv")

# Sample data from OP.
sample = """1 Disease1 01000 01001 01002 01003 01004 01005 01006 01010 01011 01012 01013 01014 01015 01016 01080 01081 01082 01083 01084 01085 01086 01090 01091 01092 01093 01094 01095 01096 01100 01101 01102 01103 01104 01105 01106 01110 01111 01112 01113 01114 1370 1371 1372 1373 1374 V1201

2 Disease2 (except in illness) 0031 0202 0223 0362 0380 0381 03810 03811 03812 03819 0382 0383 03840 03841 03842 03843 03844 03849 0388 0389 0545 449 77181 7907 99591 99592

3 Disease3; unspecified site 0200 0208 0209 0218 0219 0228 0229 0230 0231 0232 0233 0238 0239 024 025 0260 0269 0270 0271 0272 0278 0279 0300 0301 0302 0303 0308 0309 0312 0318 0319 03289 0329 0330 0331 0338 0339 0341 0363 03681 04181 04182 04183 04184 04185 04186 04189 0419 390 3929 7953 79531 79539 V090 V091 V092 V093 V094 V0950 V0951 V096 V0970 V0971 V0980 V0981 V0990 V0991 V1204"""


# Create a regular expression to help segment data into keys and values.
ptrn = r"(.+?)\s+?(?=0)(.+)"
p =re.compile(ptrn)

# temp dictionary to store initial values
ddict = {}

# Used double newline becuase that's how the data was divided.  Adjust accordingly.

for row in sample.split("\n" * 2):

    # If we find a match
    res = p.search(row)
    if res:

        # Assign raw header and value variables.
        disease, numbers = res.groups()

        # Capture 'Disease' and number portion.
        disease_key = re.search(r"(Disease\d+?)",disease.strip()).group()

        # Substitute whitespace with command and append to dictionary
        ddict[disease_key] = re.sub(r"\s+", ",", numbers.strip())


# The fun part - Dictionary -> Records

# 'Header' values (dictionary keys)
csv_columns = ddict.keys()


# Probably a better way to evaluate this...
matrix = []

# Need equal length "columns" to write.
max_count = -1

# Iterate dictonary values, append to matrix
# evaluate number of values
# Update max_count accordingly.
for i, v in enumerate(ddict.values()):
    matrix.append(v.split(","))
    m_len = len(matrix[i])
    if m_len > max_count:
        max_count = m_len

# You'll have t have each column be equal length
# So we'll iterate again and add filler values to 
# get each column to be the same length.
for i, row in enumerate(matrix):
    m_len = len(matrix[i])
    if m_len < max_count:
        matrix[i].extend([""] * (max_count - m_len))

# Create a list variable to hold subdictionaries (records)
output = []

# Iterate for each `row` of our max count
for r in range(max_count):
    # Create generator for the column values in the current row
    row = (matrix[c][r] for c, val in enumerate(matrix))

    # Zip the columns and row values into a dictionary
    # Append to output list variable.
    output.append(dict(zip(csv_columns, row)))


# Write output using csv.DictWriter
# If on Windows, make sure to include newline = ""
with csv_output_path.open(mode="w", newline="") as outf:
    csv_writer = csv.DictWriter(outf, fieldnames = csv_columns)
    csv_writer.writeheader()
    csv_writer.writerows(output)

注意:该output列表将如下所示:

[{'Disease1': '01000', 'Disease2': '0031', 'Disease3': '0200'},
 {'Disease1': '01001', 'Disease2': '0202', 'Disease3': '0208'},
 {'Disease1': '01002', 'Disease2': '0223', 'Disease3': '0209'},
 {'Disease1': '01003', 'Disease2': '0362', 'Disease3': '0218'},
 ...

推荐阅读