python - 如何使用文件名将 csv 中的值映射到 pandas 中的列
问题描述
我在一个文件夹中有多个 bz.2 文件。我必须提取它们,这给了我 csv 文件并将它们加载到数据库中。这些文件没有标题。我必须在 pandas 中创建标题并将这些列与自定义列一起映射。我编写了下面的代码来遍历 bz.2 文件并在数据框中提取 csv 文件。
filenames are :
1.error_file_20210625_Events_00001.error.csv.bz2
2.error_file_20210625_Events_00002.error.csv.bz2
在我们提取 csv 数据后看起来像:
error_file_20210625_Events_00001.error.csv:
20210625,111|112|246, apple, 20210625_error,new,,,,,D000000,true
error_file_20210625_Events_00002.error.csv:
20210625,111|222|246, pear, 20210625_error,new,,,,,C000000,false
从这些 bz2 中提取 csv 的代码:
error_folder="c:\\"
Error_filenames = glob.glob(error_folder+"error_file_"+date+"_Events_*.error.csv.bz2")
if Error_filenames:
df_error_generator = (pd.read_csv(f, lines=True, compression={'method' : 'bz2'}) for f in Error_filenames)
df_error = pd.concat(df_error_generator, ignore_index=True)
filename="error_file_"+ date +"_Events_error.csv"
df_error.to_csv("c:\\chits\"+filename)
df_error_csv=pd.DataFrame(pd.read_csv("c:\\chits\"+filename))
headers=["filedate","errorcode","errorROEID","ROEID","type","rawrecord","filename"]
现在我必须将数据映射到标题中:
filedate errorcode errorROEID ROEID type rawrecord filename
20210625 111|112|246 apple 20210625_error new 20210625,111|112|246, apple, 20210625_error,new,,,,,D000000,true error_file_20210625_Events_00001.error.csv
20210625 111|222|246 pear 20210625_error new 20210625,111|222|246, pear, 20210625_error,new,,,,,C000000,false error_file_20210625_Events_00002.error.csv
这里,
filedate is a date from the file
errorcode is second column
errorROEID is the third column
ROEID is the fourth column
type is fifth column
rawrecord is the whole row
filename is the filename
我不明白如何提取文件名和文件日期并在 rawrecord 列中插入整行。谁能帮帮我吗?
解决方案
这是仅使用内置 Python 库的可能解决方案:
from operator import itemgetter
import glob
import bz2
import csv
header = ["filedate", "errorcode", "errorROEID", "ROEID", "type", "rawrecord", "filename"]
req_cols = itemgetter(0, 1, 2, 3, 4, 9, 10)
with open("output.csv", "w", newline="") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for error_filename in glob.glob("*.bz2"):
with bz2.open(error_filename, "rt", encoding="utf-8") as f_error_file:
csv_input = csv.reader(f_error_file, skipinitialspace=True)
for orig_row in csv_input:
row = req_cols(orig_row)
csv_output.writerow([row[0], row[1], row[2], row[3], row[4], ",".join(orig_row), error_filename])
可以按如下方式构建数据框:
import pandas as pd
from operator import itemgetter
import glob
import bz2
import csv
header = ["filedate", "errorcode", "errorROEID", "ROEID", "type", "rawrecord", "filename"]
data = []
req_cols = itemgetter(0, 1, 2, 3, 4, 9, 10)
for error_filename in glob.glob("*.bz2"):
with bz2.open(error_filename, "rt", encoding="utf-8") as f_error_file:
csv_input = csv.reader(f_error_file, skipinitialspace=True)
for orig_row in csv_input:
row = req_cols(orig_row)
data.append([row[0], row[1], row[2], row[3], row[4], ",".join(orig_row), error_filename])
df = pd.DataFrame(data, columns=header)
print(df)
给你:
filedate errorcode errorROEID ROEID type rawrecord filename
0 20210625 111|112|246 apple 20210625_error new 20210625,111|112|246,apple,20210625_error,new,,,,,D000000,true error_file_20210625_Events_00001.error.csv.bz2
1 20210625 111|222|246 pear 20210625_error new 20210625,111|222|246,pear,20210625_error,new,,,,,C000000,false error_file_20210625_Events_00002.error.csv.bz2
推荐阅读
- dart - forEach vs for in:调用方法时的不同行为
- php - 用于回显数组结果的简单 PHP 函数,不显示所有行
- reactjs - React 测试库/React 路由器
- node.js - Docker 镜像中的 Heroku Buildpacks
- c++ - 如何在 Windows 上为 Tensorflow 构建和使用 C++ API
- python - (Discord.py)如何让机器人在一段时间后删除自己的消息?
- python - 运行代码返回“color must int or tuple”
- python - 如何连接从api下载的json文件?
- python - 给定尺寸,枕头没有绘制正确的边界框
- javascript - 让 HTML Canvas 生成 PDF 就绪的线条艺术?