python - 如何通过标点符号拆分熊猫中的字符串
问题描述
我有一个看起来像这样的数据框:
word start stop speaker
0 but, 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay, 9.19 10.01 2
7 sure. 10.02 11.01 2
8 what? 11.02 12.00 1
9 i 12.01 13.00 2
10 agree, 13.01 14.00 2
11 but 14.01 15.00 2
12 i 15.01 16.00 2
13 disagree 16.01 17.00 2
14 that's 17.01 18.00 1
15 fine, 18.01 19.00 1
16 however 19.01 20.00 1
17 you 20.01 21.00 1
18 are 21.01 22.00 1
每当有说话人更改或存在标点符号(不包括撇号)时,我想将“单词”中的所有单词组合在一起。除了对单词进行分组外,我还希望将第一个单词“start”和最后一个单词“stop”分配给该组。我想要的如下所示:
word start stop speaker
0 but, 2.72 2.85 2
1 that's alright 2.85 3.47 2
2 we'll have to 8.43 9.07 1
3 okay, 9.19 10.01 2
4 sure. 10.02 11.01 2
5 what? 11.02 12.00 1
6 I agree, 12.01 14.00 2
7 but i disagree 14.01 17.00 2
8 that's fine, 17.01 19.00 1
9 however you are 19.01 22.00 1
任何有关完成此任务的建议将不胜感激。
解决方案
您可以检查最后一个字符是否在标点符号列表中并按反向 cumsum 分组:
punctuation = list(',.?!')
s = (df['word'].str.strip().str[-1].isin(punctuation) # punctuation
| df['speaker'].ne(df['speaker'].shift(-1)) # speaker change
)
s = s.iloc[::-1].cumsum().iloc[::-1]
# reverse order of s
s = s.max()-s
df.groupby(s).agg({'word':' '.join, 'start':'min', 'stop':'max', 'speaker': 'min'})
输出:
word start stop speaker
0 but, 2.72 2.85 2
1 that's alright 2.85 3.47 2
2 we'll have to 8.43 9.07 1
3 okay, 9.19 10.01 2
4 sure. 10.02 11.01 2
5 what? 11.02 12.00 1
6 i agree, 12.01 14.00 2
7 but i disagree 14.01 17.00 2
8 that's fine, 17.01 19.00 1
9 however you are 19.01 22.00 1
推荐阅读
- android - 如何解决 FirebaseUiException:代码 10,消息 10
- javascript - react.js 中的“.map 不是函数”错误
- mysql - 如何通过 PHPmyadmin 将小数插入/更新到 mysql?
- oracle - 如何格式化oracle连接字符串并将查询输出到文件
- javascript - 如何让同一个状态响应多个“元素”?
- api - 从其他服务器获取 json frile 并在销售人员中创建案例
- javascript - 可定制的马头像制作器,带有 javascript 和画布
- typescript - 如何在 jest 测试文件中重命名模块
- laravel - 如何在 Livewire 的其他组件中注册组件?
- webpack-dev-server - webpack-dev-server 不接受 'Accept: */*'