python - 在比较熊猫中的两个数据帧时存储重复的行
问题描述
大家好(我是 python 新手) 问题:我有 2 个数据帧 df1 和 df2,我想检查是否有基于相同(url、价格、pourcent)的重复项,然后将它们存储在新的 datframe 中还检查是否有重复的 url 但价格更改并将它们存储在新的数据框中
df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '23.450', '12'], ['www.sercos.com.tn/after/', '11.000', '5'], ['www.sercos.com.tn/new/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
columns=['url', 'price', 'pourcent'])
df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '13.890', '18'], ['www.sercos.com.tn/new/', '34.000', '10'], ['www.sercos.com.tn/before/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
columns=['url', 'price', 'pourcent'])
解决方案
以下是一些可能有助于您入门的代码。这将创建两个示例数据框,创建一个匹配 url 的新数据框,然后最后检查行是否完全匹配。
#Sample df 1
df1 = pd.DataFrame({'url': ["urlone","urltwo","urlthree","urlfour"],
'price': [1, 2, 3, 4],
'percent': [0.5, 1, 3, 8]
})
#sample df 2
df2 = pd.DataFrame({'url': ["urlone","urlthree","urlfive","urlsix"],
'price': [1, 2, 3, 4],
'percent': [0.5, 1, 3, 8]
})
##This tells you all of the matches between the two columns and stores it in a variable called match
match = pd.match(df2['url'],df1['url'])
>>>print(match)
[ 0 2 -1 -1]
##The index tells you where the matches are in df2
##The number tells you where the corresponding match is in df1
##A value of -1 means no match
##You can copy both over to df3
##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)
#Iterate through match and add to df3
for n,i in enumerate(match):
print(n)
print(i)
if i >= 0: # negative numbers are not matches
print("Loop")
df3 = df3.append(df1.iloc[i])
df3 = df3.append(df2.iloc[n])
#df3.duplicated will then tell you if the rows are exactly the same or not.
df3.duplicated()
PS 如果您在文本中包含代码以便其他人可以轻松运行它,这很有用:)
使用您的数据框和使用 set 而不是 pd.match 更新变体
df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '23.450', '12'], ['www.sercos.com.tn/after/', '11.000', '5'], ['www.sercos.com.tn/new/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
columns=['url', 'price', 'pourcent'])
df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '13.890', '18'], ['www.sercos.com.tn/new/', '34.000', '10'], ['www.sercos.com.tn/before/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
columns=['url', 'price', 'pourcent'])
##This tells you all of the matches between the two columns and stores it in a variable called match_set
match_set = set(df2['url']).intersection(df1['url'])
print(match_set)
#List of urls that match
##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)
for item in match_set:
df3 = df3.append(df1.loc[df1['url'] == item])
df3 = df3.append(df2.loc[df2['url'] == item])
#Iterate through match and add to df3
#df3.duplicated will then tell you if the rows are exactly the same or not.
df3.duplicated()
print(df3)
print(df3.duplicated())
推荐阅读
- html - 我如何在页面上水平和垂直居中图像,除非它太大而无法容纳?
- angular - Service Worker 仅适用于外部图像
- spring-mvc - Spring Boot:带有 @RestController 的 Apache CXF SOAP 用于休息 ws
- python - Python selenium 无法找到用户名元素
- javascript - 提交值后如何在Angular中清除模态数据
- sql - 设置默认值永远不会被解雇
- android - 将 LinearLayout 翻译成 ConstraintLayout
- knapsack-problem - 解决大量类似背包实例的最佳实践
- amazon-web-services - 在 Lambda 函数中获取 Cognito 用户池身份
- vhdl - 输出未连接到 rtl 中的其余设计