首页 > 解决方案 > 当两列中有重复的单元格时,如何更改一列中单元格的值

问题描述

我有一个pandas由列的地址字段组成的数据框。我的问题是,在两列中,我在行中有重复的单元格值。有谁知道当在两列中发现重复时,我如何有条件地更改一列的值?理想情况下,我想保留一个值,并将另一个设置为np.nan.

这是一个测试用例:

import pandas as pd

test = pd.read_json('{"housename":{"16":null,"17":null,"18":null},"name":{"16":"Shoecare","17":"33","18":"33A"},"house_number":{"16":"32","17":"33","18":"33A"},"street":{"16":"Carfax","17":"Carfax","18":"Carfax"},"city":{"16":"Horsham","17":"Horsham","18":"Horsham"},"postcode":{"16":"RH12 1EE","17":"RH12 1EE","18":"RH12 1EE"}}')

    city        house_number    housename   name        postcode    street
16  Horsham     32              NaN         Shoecare    RH12 1EE    Carfax
17  Horsham     33              NaN         33          RH12 1EE    Carfax
18  Horsham     33A             NaN         33A         RH12 1EE    Carfax

在测试用例中,我玩过test.duplicated(subset=['house_number', 'name']),但它不会识别house_numberandname列中的重复值。

有人对如何首先识别两列中的重复单元格,然后将一个值设置为有任何建议np.nan吗?

期望的输出:

    housename   name      house_number  street  city     postcode
16  NaN         Shoecare  32            Carfax  Horsham  RH12 1EE
17  NaN         NaN       33            Carfax  Horsham  RH12 1EE
18  NaN         NaN       33A           Carfax  Horsham  RH12 1EE

标签: pythonpandas

解决方案


如果 2 列是house_numberand name,您可以这样做:

test['name'] = np.where((test['house_number'] == test['name']), np.nan, test['name'])

输出:

       city house_number  housename      name  postcode  street
16  Horsham           32        NaN  Shoecare  RH12 1EE  Carfax
17  Horsham           33        NaN       NaN  RH12 1EE  Carfax
18  Horsham          33A        NaN       NaN  RH12 1EE  Carfax

推荐阅读