python - 如何更改我的 Python 代码,以便更有效地将 txt 文件转换为 CSV?
问题描述
我最近学习了 Python 并开始开发可以读取和清理数据的代码。在下面的代码中,我试图读取大约 200 个 200 MB 的 txt 文件,每个文件都带有 | 分隔符并尝试将它们合并到单个 CSV 文件中,其中更改了 1 个特定内容。源文件具有负数,其中负号位于数字的末尾。例如 221.36- 111- 等。我需要将它们转换为 -221.36 和 -111。
目前处理 8000 万条记录大约需要 100 分钟。由于这只是我用 Python 编写的第二或第三代码,我正在寻找您对如何优化此代码的意见。在准备好投入生产之前,您可以建议的任何最佳实践都会有很大帮助。
from tempfile import NamedTemporaryFile
import shutil
import csv
import glob
# List out all files that needs to be used as Input
list_of_input_files = (glob.glob("C:/Users/datafolder/pattern*"))
with open('C:/Users/datafolder/tempfile.txt','wb') as wfd:
for f in list_of_input_files:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd, 1024*1024*10)
print('File Merge Complete')
# Create temporary files for processing
txt_file = "C:/Users/datafolder/tempfile.txt"
csv_file = "C:/Users/datafolder/mergedcsv.csv"
# Write CSV file after reading data from a txt file. Converts delimeter from '|' to ','
with open(txt_file,'r', encoding='utf-8') as file_pipe:
with open(csv_file, 'w', encoding='utf-8', newline='') as file_comma: #newline paramater to ignore blank lines in the final file
csv.writer(file_comma, delimiter=',').writerows(csv.reader(file_pipe, delimiter='|'))
print('CSV File Created.')
tempfile = NamedTemporaryFile(mode='w', encoding='utf-8', delete=False)
# Data Definition
fields = ['Field 1','Field 2','Field 3,'Field 4','Field 5','Field 6','Field 7','Field 8','Field 9','Field 10','Field 11','Field 12','Field 13','Field 14','Field 15','Field 16','Field 17,'Field 18','Field 19','Field 20']
count=0
# Open files in read and write modes for data processing
with open(csv_file, 'r', encoding='utf-8') as csvfile, tempfile:
reader = csv.DictReader(csvfile, fieldnames=fields) #Using a Python dictionary to read and write data into a CSV file.
writer = csv.DictWriter(tempfile, fieldnames=fields, lineterminator='\n')
writer.writeheader()
for row in reader:
if count < 1000000:
if row['Field 10'].endswith('-'):
row['Field 10']=float(row['Field 10'].replace('-',''))*(-1) #Trims - sign from the end of line, converts the target field to Float and makes it negative
count=count+1
else:
print('1 Million records Processed')
count=0
# Creating a row for final write
row={'Field 1' : row['Field 1'],'Field 2' : row['Field 2'],'Field 3 : row['Field 3 ],'Field 4' : row['Field 4'],'Field 5' : row['Field 5'],'Field 6' : row['Field 6'],'Field 7' : row['Field 7'],'Field 8' : row['Field 8'],'Field 9' : row['Field 9'],'Field 10' : row['Field 10'],'Field 11' : row['Field 11'],'Field 12' : row['Field 12'],'Field 13' : row['Field 13'],'Field 14' : row['Field 14'],'Field 15' : row['Field 15'],'Field 16' : row['Field 16'],'Field 17' : row['Field 17'],'Field 18' : row['Field 18'],'Field 19' : row['Field 19'],'Field 20' : row['Field 20']}
writer.writerow(row) # Writer write rows to the CSV File
print('Data write to CSV file complete')
# Renaming the newly created temp file as the final file. New file now has fully processed data.
shutil.move(tempfile.name, csv_file)
print('Renaming Complete')
解决方案
csv.DictReader
使用而不是 just背后的原因是csv.reader
什么?Usingcsv.reader
将允许您通过索引访问行数据,而不是使用如下键'Field 10'
:
if row[9].endswith('-'):
row[9]=float(row[9].replace('-',''))*(-1)
这将消除对第 51 行代码的需求,这将稍微加快程序速度,因为您可以简单地writer.writerow(row)
使用row
已有的 调用,因为它已经是一个元组。
使用csv.reader
还可以提供另一个小的优化。目前,您正在检查是否count < 1000000
每次通过循环,并增加count
变量。相反,您可以执行以下操作:
row_count = sum(1 for row in reader)
if row_count >= 1000000:
row_count = 1000000
for i in itertools.islice(reader, row_count):
// trim logic
if row_count == 1000000:
print('1 Million records Processed')
这消除了条件检查和计数变量的增量,在 8000 万次迭代过程中,这些变量加起来可以节省一些实际时间。
推荐阅读
- sql - SQL:表达式不在 GROUP BY 键中
- c# - 在 Xamarin.Forms.Maps Visual Studio 上使用 C# 代码中的变量从 .xaml 文件更改静态地图坐标
- reactjs - 故事书样式不一致
- java - 如何使用子字符串而不是 charAt 检查单词是否为回文?
- node.js - 在 Node.JS 中随机选择一组值
- ios - App Store 应用分析指标和维度 - 了解如何计算“删除”?
- python - 如何在python中的一行if条件中编写多个for循环?(特定于 DOCX)
- linux - 输入字符串命令以运行另一个 .sh 文件 bash 脚本
- css - 使用 li vs p 标签的显示项目之间的区别?
- node.js - 如何在 nginx 中设置子域?