首页 > 解决方案 > 使用 pandas 查找数据框中重复句子的数量

问题描述

我正在尝试找出我的数据框有多少重复的句子,即任何完全匹配的句子重复不止一个,我正在使用Dataframe.Duplicated但它忽略了句子的第一个 oucurency,我想要它而不是打印重复的句子只打印重复的句子一及其出现的次数

我正在尝试的代码是

wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
data=wdata[wdata.duplicated()]
print(data)



#dataframe example
#hi how are you
#hello sam how are you doing
#hello sam how are you doing
#helll Alex how are you doing
#hello sam how are you doing
#let us go eat
#where is the dog
#let us go eat 


我希望我的输出类似于

#hello sam how are you doing   3
#let us go eat                 2

使用重复的功能,我得到了这个输出

#hello sam how are you doing
#hello sam how are you doing
#let us go eat

这是我得到第二个答案的输出

wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)

data=wdata.groupby(['sentences']).size().reset_index(name='counts')


#                      sentences  counts
#0  hello Alex how are you doing       1
#1   hello sam how are you doing       3
#2                hi how are you       1
#3                 let us go eat       1
#4                let us go eat        1
#5              where is the dog       1

我希望我的输出类似于

#hello sam how are you doing   3
#let us go eat                 2

标签: pythonpandasdataframe

解决方案


因为有空格,解决方案是通过Series.str.stripwith删除它们GroupBy.size

data=wdata.groupby(wdata['sentences'].str.strip()).size().reset_index(name='counts')

然后过滤boolean indexing

data = data[data['counts'].gt(1)]

另一个想法是Series.value_counts用于系列、过滤和最后转换为 2 列 DataFrame:

s = wdata['sentences'].str.strip().value_counts()
data = s[s.gt(1)].rename_axis('sentences').reset_index(name='counts')

推荐阅读