python - how do I remove commas within columns from data retrieved from a CSV file
问题描述
I have several CSV files that I need to process. Within the columns of each, there might be commas in the fields. Strings might also be sitting within double quotes. I got it right to come up with something, but I am working with CSV files that are sometimes between 200 - 400 MB. Processing them with my current code lets a 11MB file take 4 minutes to be processed.
What can I do here to have it run faster or maybe to process the entire data all at once instead of running through the code field by field ?
import csv
def rem_lrspaces(data):
data = data.lstrip()
data = data.rstrip()
data = data.strip()
return data
def strip_bs(data):
data = data.replace(",", " ")
return data
def rem_comma(tmp1,tmp2):
with open(tmp2, "w") as f:
f.write("")
f.close()
file=open(tmp1, "r")
reader = csv.reader(file,quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True)
for line in reader:
for field in line:
if "," in field :
field=rem_lrspaces(strip_bs(field))
with open(tmp2, "a") as myfile:
myfile.write(field+",")
with open(tmp2, "a") as myfile:
myfile.write("\n")
pdfsource=r"C:\automation\cutoff\test2"
csvsource=pdfsource
ofn = "T3296N17"
file_in = r"C:\automation\cutoff\test2"+chr(92)+ofn+".CSV"
file_out = r"C:\automation\cutoff\test2"+chr(92)+ofn+".TSV"
rem_comma(file_in,file_out)
解决方案
A few low-hanging fruit:
strip_bs
is too simple to justify the overhead of calling the function.rem_lrspaces
is redundantly stripping whitespace; one call todata.strip()
is all you need, in which case it too is too simple to justify a separate function.- You are also spending a lot of time repeatedly opening the output file.
Also, it's better to pass already-open file handles to rem_comma
, as it makes testing easier by allowing in-memory file-like objects to be passed as arguments.
This code simply builds a new list of fields from each line, then uses csv.writer
to write the new fields back to the output file.
import csv
def rem_comma(f_in, f_out):
reader = csv.reader(f_in, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True)
writer = csv.writer(f_out)
for line in reader:
new_line = [field.replace(",", " ").strip() for field in line]
writer.write_row(new_line)
ofn = "T3296N17"
file_in = r"C:\automation\cutoff\test2"+chr(92)+ofn+".CSV"
file_out = r"C:\automation\cutoff\test2"+chr(92)+ofn+".TSV"
with open(file_in) as f1, open(file_out) as f2:
rem_comma(f1, f2)
推荐阅读
- postgresql - oid 和 bytea 正在表中创建系统
- rust - 解决 rust 中需要相互引用的特征之间的循环依赖关系
- java - JVM关闭顺序和事务管理?
- database-design - 使用 Data Vault 建模 INFORMATION_SCHEMA
- r - 在 r 中对二元概率模型进行后测?
- java - 如何从 Java 中的 API 获取自定义数据
- odata - 使用 odata-core 将 sap:quickinfo 映射到生成的 Java 实体中
- c# - 从 C# 中的 Parquet 文件中读取前 100 行
- google-apps-script - 创建新的触发器问题
- python - 如何约束 Union 以使输入和输出类型匹配?