首页 > 解决方案 > 大熊猫列上的正则表达式以创建新列

问题描述

我有一个熊猫专栏:

df['审稿人']

0    more, 25-34 Male on Treatment for 10 years or more
1    Idapida, 25-34 Female on Treatment for 2 to less than 5 years
2    Anna, 13-18 Female on Treatment for 5 to less than 10 years 
3    Kepons, 55-64 on Treatment for 1 to 6 months 
4    sammymaguire, 45-54 Female on Treatment for 1 to less than 2 years 

我正在寻找使用以下正则表达式模式

ageRegex = re.compile('13-18|19-24|25-34|35-44|45-54|55-64|65-74|75 or 
over')
timeRegex = re.compile('less than 1 month|1 to 6 months|6 months to less 
than 1 year|1 to less than 2 years|2 to less than 5 years|5 to less than 
10 years|10 years or more')
genderRegex = re.compile('Male|Female')

将年龄、时间和性别提取到看起来像这样的新列

0    25-34    10 years or more    Male 
1    25-34    Treatment for 2 to less than 5 years    Female 
2    13-18    Treatment for 5 to less than 10 years    Female 
3    55-64    Treatment for 1 to 6 months    na
4    45-54    Treatment for 1 to less than 2 years    Female 

我试过这样的东西

df['age'] = ageRegex.findall(df['Reviewer'])

但我得到了错误

expected string or bytes-like object

标签: regexpandas

解决方案


采用.str.extract

df["age"] = df["Reviewer"].str.extract('(13-18|19-24|25-34|35-44|45-54|55-64|65-74|75 or over)')

df["Time"] = df["Reviewer"].str.extract('(less than 1 month|1 to 6 months|6 months to less than 1 year|1 to less than 2 years|2 to less than 5 years|5 to less than 10 years|10 years or more)')

df["Gender"] = df["Reviewer"].str.extract('(Male|Female)')

推荐阅读