首页 > 解决方案 > 如何根据多个条件用字符串拆分 Pandas 数据框列



    ID       Col.A

28654      This is a dark chocolate which is sweet 
39876      Sky is blue 1234 Sky is cloudy 3423
88776      Stars can be seen in the dark sky
35491      Schools are closed 4568 but shops are open

我试图Col.A在 worddarkdigits. 我想要的结果如下所示。

     ID             Col.A                             Col.B
    28654      This is a                  dark chocolate which is sweet 
    39876      Sky is blue                1234 Sky is cloudy 3423
    88776      Stars can be seen in the   dark sky
    35491      Schools are closed         4568 but shops are open


df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet', 
                                                            'Sky is blue 1234 Sky is cloudy 3423', 
                                                            'Stars can be seen in the dark sky',
                                                            'Schools are closed 4568 but shops are open']})

df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)


      0                              1
0   This is a                   chocolate which is sweet
2   Stars can be seen in the     sky    
1   Sky is blue                  Sky is cloudy  
3   Schools are closed           but shops are open




s = df['Col.A'].str.split(r'\s+(?=\b(?:dark|\d+)\b)', n=1, expand=True)
df[['ID']].join(s.set_axis(['Col.A', 'Col.B'], 1))

      ID                     Col.A                          Col.B
0  28654                 This is a  dark chocolate which is sweet
1  39876               Sky is blue        1234 Sky is cloudy 3423
2  88776  Stars can be seen in the                       dark sky
3  35491        Schools are closed        4568 but shops are open


  • \s+: 匹配任何空白字符一次或多次
  • (?=\b(?:dark|\d+)\b): 积极的前瞻
    • \b:字边界以防止部分匹配
    • (?:dark|\d+): 非捕获组
      • dark: First Alternative 从字面上匹配暗字符
      • \d+:第二种选择,匹配任何数字一次或多次
    • \b:字边界以防止部分匹配

见网上regex demo
