首页 > 解决方案 > Python Pandas 正则表达式特定字符串

问题描述

我想遍历一列记录(字符串目录路径)并提取括号内的记录ID。但是,在其他情况下,括号中的详细信息不是记录 ID,需要忽略。

代码:

df1['Doc ID'] = df['Folder Path'].str.extract('.*\((.*)\).*', expand=True) #this does not ignore instances with (2018-03) or (yyyy-mm)

我也试过:

df1['Doc ID'] = df['Folder Path'].str.extract('\((?!date_format)([^()]+)\)',expand=True) #this does not ignore (Data Only)

  Folder Path                                          Doc ID
1  /report/support + admin. (256)/ Global (2018-03)    (256) # ignores: (2018-03)
2  /reports/limit/sector(139)/2017                     (139)
3  /reports/sector/region(147,189 and 132)/2018        (147, 189 and 132)
4  /reports/support.(Data Only)/Region (2558)          (2558)  #ignores(Data Only)

标签: pythonregexpandas

解决方案


This uses negative lookahead to filter out "Data Only" and date formats:

(\((?!Data Only)[^\-]+\))

Setup:

df = pd.DataFrame(
    {'Path': ['(Data Only) text (1, 2 and 3)',
    '(2013-08) foo (123)',
    '(Data Only) bar (1,2,3,4,5 and 6)']}
)

                                Path
0      (Data Only) text (1, 2 and 3)
1                (2013-08) foo (123)
2  (Data Only) bar (1,2,3,4,5 and 6)

Using str.extract:

df.Path.str.extract(r'(\((?!Data Only)[^\-]+\))', expand=True)

                   0
0      (1, 2, and 3)
1              (123)
2  (1,2,3,4,5 and 6)

推荐阅读