首页 > 解决方案 > 使用 loc 标准化 pandas 中的列

问题描述

我有一列似乎有 4 种不同格式的数据,我创建了一个带有数组的小片段来说明我正在使用的内容

ex_array = np.array(['100X172',
 '78X120',
 '1 ac',
 '76,666',
 '85X175',
 '19,928',
 '14810',
 '3 ac',
 '90X181',
 '38X150',
 '19040',
 '8265',
 '100X125',
 '6000',
 '8,750',
 '.448 ac'])

ex_df = pd.DataFrame(data=ex_array, columns=['ex_col'])

这将按预期输出以下内容:

    ex_col
0   100X172
1   78X120
2   1 ac
3   76,666
4   85X175
5   19,928
6   14810
7   3 ac
8   90X181
9   38X150
10  19040
11  8265
12  100X125
13  6000
14  8,750
15  .448 ac

目标是标准化列,其中所有内容都以英亩为单位,所需的输出如下

ex_df['acreage'] =
    acreage
0   .394858
1   .214876
2   1
3   1.76
4   .341483
5   .457484
6   .339991
7   3
8   .373967
9   .130854
10  .437098
11  .189738
12  .284665
13  .137741
14  .200872
15  .448

我在 pandas 中的想法是创建 3 个布尔列来处理不同类型的数据

ex_df['hasX'] = ex_df['ex_col'].str.contains('X')
ex_df['has_ac'] = ex_df['ex_col'].str.contains('ac')
ex_df['has_comma'] = ex_df['ex_col'].str.contains(',')

这按预期输出

ex_df
ex_col  hasX    has_ac  has_comma
0   100X172     True    False   False
1   78X120  True    False   False
2   1 ac    False   True    False
3   76,666  False   False   True
4   85X175  True    False   False
5   19,928  False   False   True
6   14810   False   False   False
7   3 ac    False   True    False
8   90X181  True    False   False
9   38X150  True    False   False
10  19040   False   False   False
11  8265    False   False   False
12  100X125     True    False   False
13  6000    False   False   False
14  8,750   False   False   True
15  .448 ac     False   True    False

接下来我尝试了多个 loc 操作,如下所示

ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.split()[0]))
ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))

这会输出以下错误:

<ipython-input-40-e9eb1eacbacb> in <module>
----> 1 ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
      2 ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: x.split()[0])
      3 ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
      4 ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))
4198             else:
   4199                 values = self.astype(object)._values
-> 4200                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4201 
   4202         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-40-e9eb1eacbacb> in <lambda>(x)
----> 1 ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
      2 ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: x.split()[0])
      3 ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
      4 ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))

ValueError: invalid literal for int() with base 10: '1 ac'

标签: pythonpandas

解决方案


让我们尝试extract拆分您的数据列并用于np.select映射:

data = (ex_df['ex_col'].str.replace(',','')
            .str.extract('([\.\d]+)\s?(ac|X)?([\.\d,]+)?')
       )

data[[0,2]] = data[[0,2]].astype(float)

ex_df['area'] = np.select((data[1].eq('X'), data[1].eq('ac')),
                          (data[0]* data[2]/43560,data[0]), 
                          data[0]/43560 )

输出:

     ex_col      area
0   100X172  0.394858
1    78X120  0.214876
2      1 ac  1.000000
3    76,666  1.760009
4    85X175  0.341483
5    19,928  0.457484
6     14810  0.339991
7      3 ac  3.000000
8    90X181  0.373967
9    38X150  0.130854
10    19040  0.437098
11     8265  0.189738
12  100X125  0.286961
13     6000  0.137741
14    8,750  0.200872
15  .448 ac  0.448000

推荐阅读