首页 > 解决方案 > 如何在python代码中查找文件中重复行的总数

问题描述

如何查找文件中重复行的总数以及如何编写python代码

import csv

csv_data = csv.reader(file('T:\DataDump\Book1.csv'))

next(csv_data)

already_seen = set()

for row in csv_data:
    Address = row[6]
    if Address in already_seen:
        print('{} is a duplicate Address'.format(Address))
    else:
        print('{} is a unique Address'.format(Address))
        already_seen.add(Address)

标签: python

解决方案


尝试使用 pandas 而不是 csv 模块

import pandas as pd

csv_data = pd.read_csv('T:/DataDump/Book1.csv')

shape_original = csv_data.shape

print(f"Number of rows: {shape_original[0]}")

#Below how to drop duplicates

csv_data_no_duplicates = csv_data.drop_duplicates(keep="first")

shape_new = csv_data_no_duplicates.shape

print(f"Number of rows: {shape_new[0]}")

number_duplicates = shape_original[0] - shape_new[0]

我做了这个例子来尝试它是否有效:

thisdict = {
  "brand": ["Ford","Renault","Ford"],
  "model": ["Mustang","Laguna","Mustang"],
  "year": ["1964","1978","1964"]
}

data = pd.DataFrame.from_dict(thisdict)

data_no_duplicates = data.drop_duplicates(keep="first")

print(data_no_duplicates.head())

推荐阅读