python - 如何根据多个条件用字符串拆分 Pandas 数据框列
问题描述
我有一个熊猫数据框,如下所示:
ID Col.A
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
我试图Col.A
在 worddark
或digits
. 我想要的结果如下所示。
ID Col.A Col.B
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
我试图将包含单词dark
的行分组到一个数据帧,并将带有数字的行分组到另一个数据帧,然后相应地拆分它们。之后,我可以连接生成的数据帧以获得预期的结果。代码如下:
df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet',
'Sky is blue 1234 Sky is cloudy 3423',
'Stars can be seen in the dark sky',
'Schools are closed 4568 but shops are open']})
df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)
得到的结果与预期的不同。那是,
0 1
0 This is a chocolate which is sweet
2 Stars can be seen in the sky
1 Sky is blue Sky is cloudy
3 Schools are closed but shops are open
我错过了字符串中的数字和dark
结果中的单词。
那么如何解决这个问题并在不丢失拆分单词和数字的情况下获得结果呢?
有没有办法在不删除它们的情况下“在预期的单词或数字之前切片”?
解决方案
Series.str.split
s = df['Col.A'].str.split(r'\s+(?=\b(?:dark|\d+)\b)', n=1, expand=True)
df[['ID']].join(s.set_axis(['Col.A', 'Col.B'], 1))
ID Col.A Col.B
0 28654 This is a dark chocolate which is sweet
1 39876 Sky is blue 1234 Sky is cloudy 3423
2 88776 Stars can be seen in the dark sky
3 35491 Schools are closed 4568 but shops are open
正则表达式详细信息:
\s+
: 匹配任何空白字符一次或多次(?=\b(?:dark|\d+)\b)
: 积极的前瞻\b
:字边界以防止部分匹配(?:dark|\d+)
: 非捕获组dark
: First Alternative 从字面上匹配暗字符\d+
:第二种选择,匹配任何数字一次或多次
\b
:字边界以防止部分匹配
见网上regex demo
推荐阅读
- robotframework - 机器人框架中@和$的区别
- php - WordPress精确搜索查询
- firebase - 从firebase中显示一个Image.file
- r - “如何根据已知的中位数(四分位数)模拟一组原始数据”
- python - 使用“\n”时如何修复打印出的标识
- azure - 在 Azure hdinsight 群集上启用 kerberos
- python - Anaconda 的 `current_repodata.json` 文件导致系统在 ubuntu 18.04 中冻结
- c# - 如何在 C# WPF 桌面应用程序中实现 VirusTotal API?
- xslt - 选择包含相同名称的相同节点
- pygame - 尝试安装pygame时不断出现错误