首页 > 解决方案 > 删除 pandas 列中的重复值,但忽略一个值

问题描述

我确信对此有一个优雅的解决方案,但我找不到。在熊猫数据框中,如何在忽略一个值的同时删除列中的所有重复值?

repost_of_post_id                                              title
0        7139471603    Man with an RV needs a place to park for a week   
1        6688293563                                     Land for lease   
2              None                  2B/1.5B, Dishwasher, In Lancaster   
3              None  Looking For Convenience? Check Out Cordova Par...   
4              None  2/bd 2/ba, Three Sparkling Swimming Pools, Sit...   
5              None  1 bedroom w/Closet is bathrooms in Select Unit...   
6              None  Controlled Access/Gated, Availability 24 Hours...   
7              None         Beautiful 3 Bdrm 2 & 1/2 Bth Home For Rent   
8        7143099582                        Need Help Getting Approved?   
9              None            *MOVE IN READY APT* REQUEST TOUR TODAY!   

我想要的是将所有None值保留在 中repost_of_post_id,但省略数值的任何重复项,例如,如果数据框中有 的重复项7139471603


[更新]我使用这个脚本得到了想要的结果,但如果可能的话,我想在一个单行中完成这个。

# remove duplicate repost id if present (i.e. don't remove rows where repost_of_post_id value is "None")
# ca_housing is the original dataframe that needs to be cleaned

ca_housing_repost_none = ca_housing.loc[ca_housing['repost_of_post_id'] == "None"]
ca_housing_repost_not_none = ca_housing.loc[ca_housing['repost_of_post_id'] != "None"]
ca_housing_repost_not_none_unique = ca_housing_repost_not_none.drop_duplicates(subset="repost_of_post_id")

ca_housing_unique = ca_housing_repost_none.append(ca_housing_repost_not_none_unique)

标签: pythonpython-3.xpandasnumpydataframe

解决方案


您可以尝试删除 None 值,然后检测重复项,然后将它们从原始列表中过滤掉。

In [1]: import pandas as pd 
   ...: from string import ascii_lowercase 
   ...:  
   ...: ids = [1,2,3,None,None, None, 2,3, None, None,4,5] 
   ...: df = pd.DataFrame({'id': ids, 'title': list(ascii_lowercase[:len(ids)])}) 
   ...: print(df) 
   ...:  
   ...: print(df[~df.index.isin(df.id.dropna().duplicated().loc[lambda x: x].index)])                                 
     id title
0   1.0     a
1   2.0     b
2   3.0     c
3   NaN     d
4   NaN     e
5   NaN     f
6   2.0     g
7   3.0     h
8   NaN     i
9   NaN     j
10  4.0     k
11  5.0     l

     id title
0   1.0     a
1   2.0     b
2   3.0     c
3   NaN     d
4   NaN     e
5   NaN     f
8   NaN     i
9   NaN     j
10  4.0     k
11  5.0     l

推荐阅读