首页 > 解决方案 > 如何将分组的文本一一合并

问题描述

我有一个如下所示的数据框,

        text  group
0      hello      1
1      world      1
2       it's      2
3       time      2
4         to      2
5    explore      2
6        one      3
7       more      3
8       line      3

我想将文本中的每个单词一一合并到新列中,如下所示,

        text  group                     result
0      hello      1                      hello
1      world      1                hello world
2       it's      2                       it's
3       time      2                  it's time
4         to      2               it's time to
5    explore      2       it's time to explore
6        one      3                        one
7       more      3                   one more
8       line      3              one more line

到目前为止,我尝试过,

df['res']=df.groupby('group')['text'].transform(lambda x: ' '.join(x))
df['result']=df[['text','res']].apply(lambda x: ' '.join( x['res'].split()[:x['res'].split().index(x['text'])+1]),axis=1)

上面的代码适用于上述问题。但它有一些问题。

如果我有重复的文本索引会给我第一个元素的位置,它会在这个数据上失败

        text  group                     result
0      hello      1                      hello
1      world      1                hello world
2       it's      2                       it's
3       time      2                  it's time
4         to      2               it's time to
5    explore      2       it's time to explore
6        one      3                        one
7       more      3                   one more
8       line      3              one more line
9      hello      4                      hello
10  repeated      4             hello repeated
11     hello      4                      hello #this must be hello repeated hello
12      came      4  hello repeated hello came

注意:它在第 4 组失败。

而且我的脚本显然无效。

有人可以提出一种解决我的索引问题和性能问题的方法吗?

任何帮助都是不言而喻的。

标签: pythonpandas

解决方案


cumsum使用s使用函数并不容易string,但这是一种可能的解决方案 - 首先在末尾添加空间,使用cumsum并​​最后从右侧删除空间rstrip

df['text'] = df['text'] + ' '
df['res'] = df.groupby('group')['text'].transform(pd.Series.cumsum).str.rstrip()

选择:

df['res'] = df['text'].add(' ').groupby(df['group']).transform(pd.Series.cumsum).str.rstrip()

print (df)
       text  group                   res
0    hello       1                 hello
1    world       1           hello world
2     it's       2                  it's
3     time       2             it's time
4       to       2          it's time to
5  explore       2  it's time to explore
6      one       3                   one
7     more       3              one more
8     line       3         one more line

另一种解决方案:

f = lambda x: [' '.join(x[:i]) for i in range(1, len(x)+1)]
df['res'] = df.groupby('group')['text'].transform(f)

推荐阅读