首页 > 解决方案 > 在比较熊猫中的两个数据帧时存储重复的行

问题描述

大家好(我是 python 新手) 问题:我有 2 个数据帧 df1 和 df2,我想检查是否有基于相同(url、价格、pourcent)的重复项,然后将它们存储在新的 datframe 中还检查是否有重复的 url 但价格更改并将它们存储在新的数据框中

df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '23.450', '12'], ['www.sercos.com.tn/after/', '11.000', '5'], ['www.sercos.com.tn/new/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
              columns=['url', 'price', 'pourcent'])

df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '13.890', '18'], ['www.sercos.com.tn/new/', '34.000', '10'], ['www.sercos.com.tn/before/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
              columns=['url', 'price', 'pourcent'])

标签: pythonpandasdataframedrop-duplicates

解决方案


以下是一些可能有助于您入门的代码。这将创建两个示例数据框,创建一个匹配 url 的新数据框,然后最后检查行是否完全匹配。

#Sample df 1
df1 = pd.DataFrame({'url': ["urlone","urltwo","urlthree","urlfour"],
                   'price': [1, 2, 3, 4],
                   'percent': [0.5, 1, 3, 8]
                   })

#sample df 2
df2 = pd.DataFrame({'url': ["urlone","urlthree","urlfive","urlsix"],
                   'price': [1, 2, 3, 4],
                   'percent': [0.5, 1, 3, 8]
                   })


##This tells you all of the matches between the two columns and stores it in a variable called match
match = pd.match(df2['url'],df1['url'])

>>>print(match)
[ 0  2 -1 -1]
##The index tells you where the matches are in df2
##The number tells you where the corresponding match is in df1
##A value of -1 means no match
##You can copy both over to df3

##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)

#Iterate through match and add to df3
for n,i in enumerate(match):
    print(n)
    print(i)
    if i >= 0: # negative numbers are not matches
        print("Loop")
        df3 = df3.append(df1.iloc[i])
        df3 = df3.append(df2.iloc[n])


#df3.duplicated will then tell you if the rows are exactly the same or not. 
df3.duplicated()

PS 如果您在文本中包含代码以便其他人可以轻松运行它,这很有用:)

使用您的数据框和使用 set 而不是 pd.match 更新变体


df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '23.450', '12'], ['www.sercos.com.tn/after/', '11.000', '5'], ['www.sercos.com.tn/new/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
              columns=['url', 'price', 'pourcent'])

df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '13.890', '18'], ['www.sercos.com.tn/new/', '34.000', '10'], ['www.sercos.com.tn/before/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
              columns=['url', 'price', 'pourcent'])


##This tells you all of the matches between the two columns and stores it in a variable called match_set
match_set = set(df2['url']).intersection(df1['url'])

print(match_set)
#List of urls that match

##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)

for item in match_set:
    df3 = df3.append(df1.loc[df1['url'] == item])
    df3 = df3.append(df2.loc[df2['url'] == item])


#Iterate through match and add to df3


#df3.duplicated will then tell you if the rows are exactly the same or not. 
df3.duplicated()
print(df3)
print(df3.duplicated())


推荐阅读