python - How to extract apartment from address in Pandas
问题描述
I have a rather messy data set that has a lot of inconsistencies and errors due to manually input data.
I'm working on the address column of this dataset in pandas.
What I would like to do is break the address column into 3 separate entities:
1) a column for the address
2) a column for the street number
3) a column for the apartment or unit number
The data looks like the following:
address
----------------------
123 smith street #5234
5000 john ct
34 wood st apt# 23
523 fire road apt #87
charles way apt. 434
0987 misty lane unit B
I have already removed the street numbers into their own column. For this I used "np.where" using an easy logical condition that if the string started with digits extract them into the new street column.
I am now stuck on how to do this with apartment numbers.
I am assuming that because of the inconsistencies, I have to do something like:
df['apt/unit'] = np.where(str contains "apt", extract string starting at "apt" until end, else np.NaN)
df['apt/unit'] = np.where(str contains "unit", extract string starting at "unit" until end, else np.NaN)
Will I have to use regex to do this? If so, what is the way to go about that?
Are there any alternatives to this line of thinking?
解决方案
由于您的apt/unit
列有多个条件,因此您可以np.select
在此处使用,如下所示:
# Define our conditions
conditions = [
df.address.str.contains('apt'),
df.address.str.contains('unit'),
df.address.str.contains('#')
]
# Define our choices based on our conditions
choices = [
df.address.apply(lambda x: x[x.find('apt'):]),
df.address.apply(lambda x: x[x.find('unit'):]),
df.address.apply(lambda x: x[x.find('#'):])
]
# Apply this logic by creating the new column and cleaning up address column
df['apt/unit'] = np.select(conditions, choices, default = '')
# Clean up our address column
choices2 = [
df.address.apply(lambda x: x[:x.find('apt')]),
df.address.apply(lambda x: x[:x.find('unit')]),
df.address.apply(lambda x: x[:x.find('#')])
]
df['address'] = np.select(conditions, choices2, default = df.address)
输出
print(df)
address apt/unit
0 123 smith street #5234
1 5000 john ct
2 34 wood st apt# 23
3 523 fire road apt #87
4 charles way apt. 434
5 0987 misty lane unit B
推荐阅读
- java - 我如何从 Set 中获取对象?
- python - 从父类函数继承属性而不执行父类函数的其余部分
- c# - Xamarin 按钮坏了?
- python - super().__init__(**kwargs) TypeError: __init__() got an unexpected keyword argument 'attrs'
- javascript - 未捕获的 ReferenceError:未定义 helloSpeaker
- python - TypeError:在“NoneType”和“float”的实例之间不支持
- javascript - UnhandledPromiseRejectionWarning:TypeError:无法读取未定义的属性参数
- binary-tree - Racket:给定卖家名称,以二叉树的形式打印销售产品的结果
- python - OpenCV - 使用多个网络摄像机进行运动捕捉
- css - Bootstrap 页脚覆盖正文内容