首页 > 解决方案 > 使用 python 脚本从文件夹中的 csv 文件中删除重复行的问题

问题描述

我是python的初学者。我正在编写一个脚本:

  1. 读取文件夹中的所有 csv 文件
  2. 通过一次读取一个 csv 文件删除 .csv 文件中的重复行
  3. 写入 *_new.csv 文件

编码 :

import csv
import os
import pandas as pd

path = "/Users/<mylocaldir>/Documents/Data/"
file_list = os.listdir(path)
for file in file_list:
fullpath = os.path.join(path, file)
data = pd.read_csv(fullpath)
newdata = data.drop_duplicates()
newfile = fullpath.replace(".csv","_new.csv")
newdata.to_csv ("newfile", index=True, header=True)

当我运行脚本时,没有显示错误。但是, *_new.csv 没有创建

任何帮助解决这个问题?

标签: python-3.x

解决方案


我不知道pandas,但你不需要它。你可以尝试这样的事情:

import os

file_list = os.listdir()

# loop through the list
for filename in file_list:

    # don't process any non csv file
    if not filename.endswith('.csv'):
        continue

    # lines will be a temporary holding spot to check 
    # for duplicates
    lines = []
    new_file = filename.replace('.csv', '_new.csv')

    # open 2 files - csv file and new csv file to write
    with open(filename, 'r') as fr, open(new_file, 'w') as fw:

        # read line from csv
        for line in fr:

            # if that line is not in temporary list called lines,
            #   add it there and write to file
            # if that line is found in temporary list called lines,
            #   don't do anything
            if line not in lines:
                lines.append(line)
                fw.write(line)

print('Done')

结果

原始文件

cat name.csv
id,name
1,john
1,john
2,matt
1,john

新文件

cat name_new.csv 
id,name
1,john
2,matt

另一个原始文件

cat pay.csv
id,pay
1,100
2,300
1,100
4,400
4,400
2,300
4,400

是新文件

id,pay
1,100
2,300
4,400

推荐阅读