首页 > 解决方案 > 如何通过标点符号拆分熊猫中的字符串

问题描述

我有一个看起来像这样的数据框:

      word    start  stop      speaker
0      but,   2.72  2.85        2
1    that's   2.85  3.09        2
2   alright   3.09  3.47        2
3     we'll   8.43  8.69        1
4      have   8.69  8.97        1
5        to   8.97  9.07        1
6     okay,   9.19 10.01        2
7     sure.  10.02 11.01        2
8     what?  11.02 12.00        1
9         i  12.01 13.00        2
10    agree, 13.01 14.00        2
11       but 14.01 15.00        2
12       i   15.01 16.00        2
13  disagree 16.01 17.00        2
14   that's  17.01 18.00        1
15    fine,  18.01 19.00        1 
16   however 19.01 20.00        1         
17       you 20.01 21.00        1
18       are 21.01 22.00        1

每当有说话人更改或存在标点符号(不包括撇号)时,我想将“单词”中的所有单词组合在一起。除了对单词进行分组外,我还希望将第一个单词“start”和最后一个单词“stop”分配给该组。我想要的如下所示:

       word        start  stop speaker
0                but,  2.72  2.85  2
1      that's alright  2.85  3.47  2
2       we'll have to  8.43  9.07  1
3               okay,  9.19  10.01 2
4               sure. 10.02  11.01 2
5               what? 11.02  12.00 1
6            I agree, 12.01  14.00 2
7      but i disagree 14.01  17.00 2
8        that's fine, 17.01  19.00 1
9     however you are 19.01  22.00 1

任何有关完成此任务的建议将不胜感激。

标签: pythonpandas

解决方案


您可以检查最后一个字符是否在标点符号列表中并按反向 cumsum 分组:

punctuation = list(',.?!')

s = (df['word'].str.strip().str[-1].isin(punctuation) # punctuation
     | df['speaker'].ne(df['speaker'].shift(-1))      # speaker change
    )
s = s.iloc[::-1].cumsum().iloc[::-1]

# reverse order of s
s = s.max()-s

df.groupby(s).agg({'word':' '.join, 'start':'min', 'stop':'max', 'speaker': 'min'})

输出:

              word  start   stop  speaker
0             but,   2.72   2.85        2
1   that's alright   2.85   3.47        2
2    we'll have to   8.43   9.07        1
3            okay,   9.19  10.01        2
4            sure.  10.02  11.01        2
5            what?  11.02  12.00        1
6         i agree,  12.01  14.00        2
7   but i disagree  14.01  17.00        2
8     that's fine,  17.01  19.00        1
9  however you are  19.01  22.00        1

推荐阅读