首页 > 解决方案 > 我如何使用正则表达式仅识别熊猫数据框中连续的 4-5 位数字

问题描述

以下代码使用硬编码数据帧设置,可以将其复制并粘贴到 anaconda3 jupyter 笔记本中并在表格视图中查看。通过以“df2=”开头的行复制并粘贴代码,然后在 jupyter 笔记本中运行代码,如果键入 df1.head(14) 然后运行,您将看到输出。然后键入 df2.head(14) 以查看第二个数据帧。然后您可以运行其余代码,直到您拥有合并的数据框。

我需要帮助调整此代码以使其仅识别 4 到 5 位数的 dept_nbrs。所以换句话说,我如何使用正则表达式不匹配任何数字,除非它是 4 或 5 个连续数字长?我不想识别与城市相关的任何部门号码,这些号码都是一位、两位或三位长(但我不能只从原始数据框中删除这些数字。非常感谢任何帮助。期望的结果是看起来像这样:

    Department         TrueDeparment    num_col Dept_Nbr    Dept_Desc_HR
0   Merch - 1854       Empty            1854    1854 Community Relations
1   1925 - WH          empty            1925    1925     Human Resources
2   Montreal 10        empty                 
3   CMI-General Liability | 05-9362 empty 9632  9632               Legal
4   Market 466         empty                
5   7763               empty            7763    7763       Merchandising
6   Realty Supply Chain 44-9635 empty   9635    9635          Contractor
7   Merchandising (18-9087) empty       9087    9087            Security
8   07-7882 | Supply Chain empty        7882    7882             Produce
9   6101-09-01 00:00:00 empty           6101    6101          Recruiting
10  Supply Chain Engineering 9826 empty 9826    9826                 R&D
11  Dept. 93 - Fresh Meat 01-9245 empty 9245    9245         Real Estate
12  Health & Wellness WBU Reg 54 empty  
13  US09027            empty            9027    9027     Retail Services

代码:

import pandas as pd
import re

df1 = pd.DataFrame({'Department' : ['Merch - 1854', '1925 - WH','Montreal 10','CMI-General Liability | 05-9362', 'Market 466','7763','Realty Supply Chain  44-9635','Merchandising (18-9087)','07-7882 | Supply Chain','6101-09-01 00:00:00','Supply Chain Engineering 9826','Dept. 93 - Fresh Meat 01-9245','Health & Wellness WBU Reg 54','US09027'],'TrueDeparment' : ['Empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty']})   

df2 = pd.DataFrame({'Dept_Nbr' : [1,5,7,9,10,18,44,54,93,466,1854,1925,6101,7763,7882,9027,9087,9245,9362,9635,9826], 'Dept_Desc_HR' : ['springfield','new orleans','san diego','new york','cleveland','orlando','san francisco','st louis','kansas city','detroit','Community Relations','Human Resources','Recruiting','Merchandising','Produce','Retail Services','Security','RealEstate','Legal','Contractor','R&D']})

line = 'Merch - 1854 '
match = re.search(r'[0-9]+', line)
if match is None:
    print(0)
else:
    print(int(match[0]))

def extract_number(field):
    match = re.search(r'[0-9]+', field)
    if match is None:
        return 0
    else:
        return int(match[0])

df1['num_col'] = df1[['Department']].apply(lambda row:extract_number(row['Department']),axis=1)

df1.merge(df2, left_on = ['num_col'], right_on = ['Dept_Nbr'])

标签: regexpython-3.xpandas

解决方案


import pandas as pd
import re

df1 = pd.DataFrame({'Department' : ['Merch - 1854', '1925 - WH','Montreal 10','CMI-General Liability | 05-9362', 'Market 466','7763','Realty Supply Chain  44-9635','Merchandising (18-9087)','07-7882 | Supply Chain','6101-09-01 00:00:00','Supply Chain Engineering 9826','Dept. 93 - Fresh Meat 01-9245','Health & Wellness WBU Reg 54','US09027'],'TrueDeparment' : ['Empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty','empty']})   

df2 = pd.DataFrame({'Dept_Nbr' : [1,5,7,9,10,18,44,54,93,466,1854,1925,6101,7763,7882,9027,9087,9245,9362,9635,9826], 'Dept_Desc_HR' : ['springfield','new orleans','san diego','new york','cleveland','orlando','san francisco','st louis','kansas city','detroit','Community Relations','Human Resources','Recruiting','Merchandising','Produce','Retail Services','Security','RealEstate','Legal','Contractor','R&D']})

正如@Jan 提到的,您可以使用匹配最少4个和最多5个连续数字字符的正则表达式 \d{4,5} 。

line = 'Merch - 1854 '
match = re.search(r'\d{4,5}', line)
if match is None:
    print(0)
else:
    print(int(match[0]))

现在 pandas 带有内置的字符串操作,我们要在这里使用的是pd.Series.str.extract. 我们指定expandFalse,仅从每一行中获取第一个匹配项。如果我们没有指定这一点,如果其中一行中有多个匹配项,pandas 将返回 DataFrame 而不是 Series。最后请注意,我们的正则表达式两侧有一个捕获组(),这就是 extract 将写入新系列的内容。如果没有匹配,它将只NaN为这些行返回 a。

编辑:我们将列的 dtype 更改为浮点数(当列包含NaN值时,整数向上转换为浮点数),因此合并操作有效。

df1['num_col'] = df1.Department.str.extract(r'(\d{4,5})', expand=False).astype(float)
df2.Dept_Nbr = df2.Dept_Nbr.astype(float)

df1.merge(df2, left_on = ['num_col'], right_on = ['Dept_Nbr'])

推荐阅读