首页 > 解决方案 > 拆分包含 str 和 int 的列

问题描述

我有一列应该只包含整数,但是由于数据错误,它当前包含字符串和整数。我需要申请一份np.where声明,内容如下np.where(df['IO8'] >= 2002),"NEW","OLD")

该语句失败并出现错误 cannot use >= on strings。我将如何解决这个问题?任何帮助都会很棒。让我知道是否需要更多细节。我也尝试过使用正则表达式,如下所示:

df['split'] = pd.np.where(df['IO8'].str.contains("^\d{4}$", regex=True), "Number", "Error")
df['IO8'] = pd.np.where(df['split'].str.contains("Number"), df['IO8'].astype(int), df['IO8'].astype(str))
df['split1'] = pd.np.where(df['split'].str.contains("Number") & (df['IO8'] >= 2002),"NEW","OLD")

但仍然得到这个错误。

标签: pythonregexpython-3.xpandasnumpy

解决方案


用于Series.str.extract获取新列并转换为浮点数:

df = pd.DataFrame({'IO8':['2000','2009','20','dwd21']})

df['num'] = df['IO8'].str.extract("(^\d{4}$)").astype(float)

然后可能numpy.select用于 3 种状态:

m1 = df['num'].notna()
m2 = df['num'] >= 2002
df['split1'] = pd.np.select([m1 & m2, m1 & ~m2],["NEW","OLD"], default='no match')

或使用 double np.where

df['split1'] = pd.np.where(m2, "NEW", pd.np.where(m1, "OLD", 'no match'))

print (df)
     IO8     num    split1
0   2000  2000.0       OLD
1   2009  2009.0       NEW
2     20     NaN  no match
3  dwd21     NaN  no match

因为如果只使用np.where输出是:

df = pd.DataFrame({'IO8':['2000','2009','20','dwd21']})

df['num'] = df['IO8'].str.extract("(^\d{4}$)").astype(float)

m1 = df['num'].notna()
m2 = df['num'] >= 2002
df['split1'] = pd.np.where(m1 & m2, "NEW","OLD")

print (df)
     IO8     num split1
0   2000  2000.0    OLD
1   2009  2009.0    NEW
2     20     NaN    OLD
3  dwd21     NaN    OLD

推荐阅读