首页 > 解决方案 > Pandas 在括号内拆分/提取字符串,但有异常

问题描述

我遇到了一个特定的问题,最好用几个例子来解释:

我有一个如下所示的数据框:

import pandas as pd
items=["ga - bg - cg - dg", "ag - bg - cg (u i)","ag - bg - cg ","ag - bg - cg - d g(u i)","ag - bg - cg (u i(u i))","ag - bg (ui)","ag - bg - cg (ATO) - dg","ag - bg - cg (ATO) (dg)"]
df = pd.DataFrame(columns=['R'],data=items)

#------------------------------------------------

0          ga - bg - cg - dg
1         ag - bg - cg (u i)
2               ag - bg - cg 
3    ag - bg - cg - d g(u i)
4    ag - bg - cg (u i(u i))
5               ag - bg (ui)
6    ag - bg - cg (ATO) - dg
7    ag - bg - cg (ATO) (dg)

最终结果应如下所示:

0          ga - bg - cg - dg  ga  bg        cg        dg
1         ag - bg - cg (u i)  ag  bg        cg       u i
2              ag - bg - cg   ag  bg        cg        cg
3    ag - bg - cg - d g(u i)  ag  bg        cg  d g(u i)
4    ag - bg - cg (u i(u i))  ag  bg        cg  u i(u i)
5               ag - bg (ui)  ag  bg      None        ui
6    ag - bg - cg (ATO) - dg  ag  bg  cg (ATO)        dg
7    ag - bg - cg (ATO) (dg)  ag  bg  cg (ATO)        dg

到目前为止,我想出了这段代码:

df[['R.1','R.2','R.3','R.4']]=df['R'].str.split(' - ',n=3,expand=True).apply(lambda x: x.str.strip())
splits=df['R'].str.split(' - ',n=3)
lastelem=splits.str[-1]
NoBrackets=df['R'].str.replace(r"\(.*\)","")
df[['R.1','R.2','R.3','false']]=NoBrackets.str.split(' - ',n=3,expand=True).apply(lambda x: x.str.strip())
df.drop(['false'],axis=1, inplace=True)

splitNum=splits.agg([len])
for index,item in lastelem.iteritems():
    n=splitNum.iat[index,0]
    if n!=4:
        df.iat[index,-1]=lastelem[index].split('(',1)[-1].rsplit(')',1)[0].strip()
print(df)

#--------------------------

0        ga - bg - cg - dg  ga  bg    cg        dg
1       ag - bg - cg (u i)  ag  bg    cg       u i
2            ag - bg - cg   ag  bg    cg        cg
3  ag - bg - cg - d g(u i)  ag  bg    cg  d g(u i)
4  ag - bg - cg (u i(u i))  ag  bg    cg  u i(u i)
5             ag - bg (ui)  ag  bg  None        ui
6  ag - bg - cg (ATO) - dg  ag  bg    cg        dg
7  ag - bg - cg (ATO) (dg)  ag  bg    cg  ATO) (dg

我确信必须有一种更简单的方法来实现我所拥有的,但我目前还没有到达那里。另外我不知道如何“保存”要删除的异常(ATO)。如果我在问题的解释中遗漏了任何内容,请告诉我,并让我知道如何改进我的代码。

标签: pandasstringdataframereplacesplit

解决方案


你可以试试这个:

# split by the last space that is not preceded by -
df[['head', 'tail']] = df['R'].str.strip().str.split(r'(?<!-)\s+(?=\S*$)', expand=True)

# split the first part by -, strip trailing -
head = df['head'].str.strip(' -').str.split(' - ', expand=True)

# copy tail
tail = df['tail'].copy()

# drop head and tail from original data
df = df.drop(['head', 'tail'], 1)

# join head and tail
data = pd.concat((head, tail), axis=1)

# use only the first not None value
data.iloc[:, 3] = data.iloc[:, 3].combine_first(tail)

# drop tail from data
data = data.drop('tail', axis=1)

# nested parenthesis mask
mask = data.iloc[:, 3].str.startswith('(', na=False)

# remove parenthesis only when nested
data.loc[mask, 3] = data.loc[mask, 3].str[1:-1]

# concat with original data
res = pd.concat((df, data), axis=1)

print(res)

输出

                         R   0   1         2       3
0        ga - bg - cg - dg  ga  bg        cg      dg
1        ag - bg - cg (ui)  ag  bg        cg      ui
2            ag - bg - cg   ag  bg        cg    None
3    ag - bg - cg - dg(ui)  ag  bg        cg  dg(ui)
4    ag - bg - cg (ui(ui))  ag  bg        cg  ui(ui)
5             ag - bg (ui)  ag  bg      None      ui
6  ag - bg - cg (ATO) - dg  ag  bg  cg (ATO)      dg
7  ag - bg - cg (ATO) (dg)  ag  bg  cg (ATO)      dg

推荐阅读