首页 > 解决方案 > How to extract apartment from address in Pandas

问题描述

I have a rather messy data set that has a lot of inconsistencies and errors due to manually input data.

I'm working on the address column of this dataset in pandas.

What I would like to do is break the address column into 3 separate entities:

1) a column for the address

2) a column for the street number

3) a column for the apartment or unit number

The data looks like the following:

address
----------------------
123 smith street #5234
5000 john ct
34 wood st apt# 23
523 fire road apt #87
charles way apt. 434
0987 misty lane unit B 

I have already removed the street numbers into their own column. For this I used "np.where" using an easy logical condition that if the string started with digits extract them into the new street column.

I am now stuck on how to do this with apartment numbers.

I am assuming that because of the inconsistencies, I have to do something like:

df['apt/unit'] = np.where(str contains "apt", extract string starting at "apt" until end, else np.NaN)
df['apt/unit'] = np.where(str contains "unit", extract string starting at "unit" until end, else np.NaN)

Will I have to use regex to do this? If so, what is the way to go about that?

Are there any alternatives to this line of thinking?

标签: pythonregexpandas

解决方案


由于您的apt/unit列有多个条件,因此您可以np.select在此处使用,如下所示:

# Define our conditions
conditions = [
    df.address.str.contains('apt'),
    df.address.str.contains('unit'),
    df.address.str.contains('#')
]

# Define our choices based on our conditions
choices = [
    df.address.apply(lambda x: x[x.find('apt'):]),
    df.address.apply(lambda x: x[x.find('unit'):]),
    df.address.apply(lambda x: x[x.find('#'):])
]

# Apply this logic by creating the new column and cleaning up address column
df['apt/unit'] = np.select(conditions, choices, default = '')

# Clean up our address column
choices2 = [
    df.address.apply(lambda x: x[:x.find('apt')]),
    df.address.apply(lambda x: x[:x.find('unit')]),
    df.address.apply(lambda x: x[:x.find('#')])
]
df['address'] = np.select(conditions, choices2, default = df.address)

输出

print(df)

             address  apt/unit
0  123 smith street      #5234
1       5000 john ct          
2        34 wood st    apt# 23
3     523 fire road    apt #87
4       charles way   apt. 434
5   0987 misty lane     unit B

推荐阅读