首页 > 解决方案 > 为什么熊猫读取 .csv 文件中的重复项然后重命名它们?

问题描述

我正在使用 python,我读取了一个文件,我想从相同的问题中删除重复项,但它会继续读取名称为 1 的重复项

例如:有 2 个 question1 它读取它们 question1 和 question1.1

所以当我使用 .drop_duplicates() 它什么都不做时,这里有什么问题?

file = 'survey.csv'
responses = pd.read_csv(file,header=1)
responses.head()
responses.drop_duplicates()

这是 .cvs 文件的示例

>         ,,,X,,,,,,,,,,,,,,,,
>     Timestamp,Email Address,,"Know about basic linear algebra and matrices operations (multiplication, add, transpose)?",Know how to
> apply differentiation and the chain rule?,Know how to apply
> differentiation and the chain rule?,"Know what is a probability
> distribution and density function, and how to sample it?","Know what
> is a probability distribution and density function, and how to sample
> it?",Know the difference between classification and regression?,Know
> the difference between training and testing data?,Know the difference
> between training and testing data?,Know what is a training loop and
> what is an epoch?,Know what is a batch?,Know what is
> regularization?,Know what is overfitting and underfitting?,Know what
> is a feature vector?,,,,
>     ,,,,,,,,,,,,,,,,,,,
>     10/14/2021 17:15:05,y.sedki@gmail.com,,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,,,,
>     10/14/2021 17:15:39,k.abdulaal@hotmail.com,,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,,,,

但是写完上面的代码后的输出是

> Know how to apply differentiation and the chain rule?   Know how to
> apply differentiation and the chain rule?.1

标签: pythonpandascsvjupyter-notebookduplicates

解决方案


我认为你应该考虑只指定你知道哪些列是重复的,并专门将它们放入。我不知道 Pandas,但我想你可以指定一行中的列,也许像下面这样删除第四列(如果那是重复的):

row1 = responses[1]
values_I_care_about = row1[0:3] + row1[4:]

您还可以使用csv模块中的 Python 的DictReader类按列快速删除数据:

主文件

import csv
import sys

with open('sample.csv', 'r', newline='') as f:
    reader = csv.DictReader(f)
    row = next(reader)

writer = csv.DictWriter(sys.stdout, fieldnames=row.keys())
writer.writeheader()

我在这个示例数据上运行它(复制你的标题,取消换行,并添加一个单一值为 1 的虚拟行):

Timestamp,Email Address,,"Know about basic linear algebra and matrices operations (multiplication, add, transpose)?",Know how to apply differentiation and the chain rule?,Know how to apply differentiation and the chain rule?,"Know what is a probability distribution and density function, and how to sample it?","Know what is a probability distribution and density function, and how to sample it?",Know the difference between classification and regression?,Know the difference between training and testing data?,Know the difference between training and testing data?,Know what is a training loop and what is an epoch?,Know what is a batch?,Know what is regularization?,Know what is overfitting and underfitting?,Know what is a feature vector?
1

我还使用我最喜欢的 CSV 命令行工具GoCSV来检查标题:

% python3 main.py | gocsv headers 
1: Timestamp
2: Email Address
3: 
4: Know about basic linear algebra and matrices operations (multiplication, add, transpose)?
5: Know how to apply differentiation and the chain rule?
6: Know what is a probability distribution and density function, and how to sample it?
7: Know the difference between classification and regression?
8: Know the difference between training and testing data?
9: Know what is a training loop and what is an epoch?
10: Know what is a batch?
11: Know what is regularization?
12: Know what is overfitting and underfitting?
13: Know what is a feature vector?

因为 DictReader 将列/标题名称作为 dict 键读取,所以您不能有重复的键,因此没有重复的列/标题。但是您无法控制删除哪些重复项。


推荐阅读