首页 > 解决方案 > 如何使用文件名将 csv 中的值映射到 pandas 中的列

问题描述

我在一个文件夹中有多个 bz.2 文件。我必须提取它们,这给了我 csv 文件并将它们加载到数据库中。这些文件没有标题。我必须在 pandas 中创建标题并将这些列与自定义列一起映射。我编写了下面的代码来遍历 bz.2 文件并在数据框中提取 csv 文件。

        filenames are : 
        1.error_file_20210625_Events_00001.error.csv.bz2
        2.error_file_20210625_Events_00002.error.csv.bz2

在我们提取 csv 数据后看起来像:

error_file_20210625_Events_00001.error.csv:
20210625,111|112|246, apple, 20210625_error,new,,,,,D000000,true


error_file_20210625_Events_00002.error.csv:
20210625,111|222|246, pear, 20210625_error,new,,,,,C000000,false
    

从这些 bz2 中提取 csv 的代码:

    error_folder="c:\\"
        Error_filenames = glob.glob(error_folder+"error_file_"+date+"_Events_*.error.csv.bz2")
        if Error_filenames:
             df_error_generator = (pd.read_csv(f, lines=True, compression={'method' : 'bz2'}) for f in Error_filenames)
             df_error = pd.concat(df_error_generator, ignore_index=True)
             filename="error_file_"+ date +"_Events_error.csv"
             df_error.to_csv("c:\\chits\"+filename)
             df_error_csv=pd.DataFrame(pd.read_csv("c:\\chits\"+filename))

headers=["filedate","errorcode","errorROEID","ROEID","type","rawrecord","filename"]

现在我必须将数据映射到标题中:

filedate  errorcode     errorROEID  ROEID            type  rawrecord                                                           filename
20210625  111|112|246   apple       20210625_error   new   20210625,111|112|246, apple, 20210625_error,new,,,,,D000000,true    error_file_20210625_Events_00001.error.csv
20210625  111|222|246   pear        20210625_error   new   20210625,111|222|246, pear, 20210625_error,new,,,,,C000000,false    error_file_20210625_Events_00002.error.csv

这里,

filedate is a date from the file
errorcode is second column 
errorROEID is the third column 
ROEID is the fourth column 
type is fifth column 
rawrecord is the whole row 
filename is the filename 

我不明白如何提取文件名和文件日期并在 rawrecord 列中插入整行。谁能帮帮我吗?

标签: pythonpandascsv

解决方案


这是仅使用内置 Python 库的可能解决方案:

from operator import itemgetter
import glob
import bz2
import csv

header = ["filedate", "errorcode", "errorROEID", "ROEID", "type", "rawrecord", "filename"]
req_cols = itemgetter(0, 1, 2, 3, 4, 9, 10)

with open("output.csv", "w", newline="") as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)
    
    for error_filename in glob.glob("*.bz2"):
        with bz2.open(error_filename, "rt", encoding="utf-8") as f_error_file:
            csv_input = csv.reader(f_error_file, skipinitialspace=True)
            
            for orig_row in csv_input:
                row = req_cols(orig_row)
                csv_output.writerow([row[0], row[1], row[2], row[3], row[4], ",".join(orig_row), error_filename])

可以按如下方式构建数据框:

import pandas as pd
from operator import itemgetter
import glob
import bz2
import csv

header = ["filedate", "errorcode", "errorROEID", "ROEID", "type", "rawrecord", "filename"]
data = []
req_cols = itemgetter(0, 1, 2, 3, 4, 9, 10)

for error_filename in glob.glob("*.bz2"):
    with bz2.open(error_filename, "rt", encoding="utf-8") as f_error_file:
        csv_input = csv.reader(f_error_file, skipinitialspace=True)
        
        for orig_row in csv_input:
            row = req_cols(orig_row)
            data.append([row[0], row[1], row[2], row[3], row[4], ",".join(orig_row), error_filename])

df = pd.DataFrame(data, columns=header)
print(df)

给你:

   filedate    errorcode errorROEID           ROEID type                                                       rawrecord                                        filename
0  20210625  111|112|246      apple  20210625_error  new  20210625,111|112|246,apple,20210625_error,new,,,,,D000000,true  error_file_20210625_Events_00001.error.csv.bz2
1  20210625  111|222|246       pear  20210625_error  new  20210625,111|222|246,pear,20210625_error,new,,,,,C000000,false  error_file_20210625_Events_00002.error.csv.bz2

推荐阅读