python-3.x - 使用 python 脚本从文件夹中的 csv 文件中删除重复行的问题
问题描述
我是python的初学者。我正在编写一个脚本:
- 读取文件夹中的所有 csv 文件
- 通过一次读取一个 csv 文件删除 .csv 文件中的重复行
- 写入 *_new.csv 文件
编码 :
import csv
import os
import pandas as pd
path = "/Users/<mylocaldir>/Documents/Data/"
file_list = os.listdir(path)
for file in file_list:
fullpath = os.path.join(path, file)
data = pd.read_csv(fullpath)
newdata = data.drop_duplicates()
newfile = fullpath.replace(".csv","_new.csv")
newdata.to_csv ("newfile", index=True, header=True)
当我运行脚本时,没有显示错误。但是, *_new.csv 没有创建
任何帮助解决这个问题?
解决方案
我不知道pandas
,但你不需要它。你可以尝试这样的事情:
import os
file_list = os.listdir()
# loop through the list
for filename in file_list:
# don't process any non csv file
if not filename.endswith('.csv'):
continue
# lines will be a temporary holding spot to check
# for duplicates
lines = []
new_file = filename.replace('.csv', '_new.csv')
# open 2 files - csv file and new csv file to write
with open(filename, 'r') as fr, open(new_file, 'w') as fw:
# read line from csv
for line in fr:
# if that line is not in temporary list called lines,
# add it there and write to file
# if that line is found in temporary list called lines,
# don't do anything
if line not in lines:
lines.append(line)
fw.write(line)
print('Done')
结果
原始文件
cat name.csv
id,name
1,john
1,john
2,matt
1,john
新文件
cat name_new.csv
id,name
1,john
2,matt
另一个原始文件
cat pay.csv
id,pay
1,100
2,300
1,100
4,400
4,400
2,300
4,400
是新文件
id,pay
1,100
2,300
4,400