首页 > 解决方案 > 如何从回复 ID(Python)中获取线程/对话?

问题描述

我是 python 的相对新手,我正在尝试从具有 ID 列表的数据框中重建对话/线程。

我目前有一个推文/reddit帖子的pandas数据框,其格式大致如下:

ID 文本 parent_id 回复
id1 胡说八道 _ 邮政 _ id2、id3、id4、id5、id6、id7
id2 胡说八道 id1 id4、id5、id6、id7
id3 胡说八道 id1
id4 胡说八道 id2 id6, id7
id5 胡说八道 id2
id6 胡说八道 id4 id7
id7 胡说八道 id6

我的目标是根据 id 将数据分成线程/对话。这意味着,从上面的例子中,得到以下序列作为输出:

[id1,id2,id4,id6],

[id1,id2,id4,id7],

[id1, id2, id5], &

[id1,id3]。

拥有这些列表将使我能够完整地查看线程。目前我的代码非常复杂,看起来像这样:

out_list = []
for i, row in df.iterrows():
    id_ = row["id"]
    # create our output file 
    sequence = [id_]
    replies = list(row['replies'])
    # creates a new dataframe from the replies to the topline comment in question
    reply_df= df.loc[df['id'].isin(replies)]
    reply_df = reply_df[reply_df.Parent_id2 == id_]
    #check if ends at topline
    if reply_df.empty == False:
        
        def turn_recursion(df, reply_df):
            for j, row_ in reply_df.iterrows():
                replies_2 = reply_df.loc[j, 'replies']
                id_2 = row_["id"]

                reply_df2 =  df.loc[df['id'].isin(replies_2)]
                reply_df2 = reply_df2[reply_df2.Parent_id2 == id_2]

                nonlocal sequence
                nonlocal out_list
                            
                if reply_df2.empty == False:
                    sequence.append(id_2)
                    return(turn_recursion(df, reply_df2))
                
                else:
                    sequence.append(id_2)
                    out_list.append(sequence)
        
        turn_recursion(test2, reply_df)
    else:
        out_list.append(sequence)
    

这目前给了我半准确的结果,但不是得到:[[id1,id2,id4,id6],[id1,id2,id4,id7]],我得到:[id1,id2,id4,id6,id7] .

我意识到我可能有点昏暗并且有一个简单的解决方案,但是对于我的一生,我似乎无法找到一种方法来做到这一点,以便它可以正常工作并适用于任何线程长度。

提前感谢您的任何建议。:)

标签: pythonpandasmultithreadingtwittertree

解决方案


用来networkx实现你想要的:

import pandas as pd
import networkx as nx
from collections import defaultdict

data = defaultdict(list)

# Build graph from pandas
G = nx.from_pandas_edgelist(df, source='parent_id', target='id', 
                            create_using=nx.DiGraph)

# Find leaves (id3, id5, id7)
leaves = [node for node, degree in G.out_degree() if degree == 0]

# Enumerate all possible paths
for node in df['id']:
    for leaf in leaves:
        for path in nx.all_simple_paths(G, node, leaf):
            data[node].append(path)

输出:

>>> data
defaultdict(list,
            {'id1': [['id1', 'id3'],
              ['id1', 'id2', 'id5'],
              ['id1', 'id2', 'id4', 'id6', 'id7']],
             'id2': [['id2', 'id5'], ['id2', 'id4', 'id6', 'id7']],
             'id4': [['id4', 'id6', 'id7']],
             'id6': [['id6', 'id7']]})

如果要将字典合并到数据框:

df['replies'] = df['id'].map(data)
print(df)

# Output:
    id       text parent_id                                            replies
0  id1  blah blah  _ post _  [[id1, id3], [id1, id2, id5], [id1, id2, id4, ...
1  id2  blah blah       id1                 [[id2, id5], [id2, id4, id6, id7]]
2  id3  blah blah       id1                                                 []
3  id4  blah blah       id2                                  [[id4, id6, id7]]
4  id5  blah blah       id2                                                 []
5  id6  blah blah       id4                                       [[id6, id7]]
6  id7  blah blah       id6                                                 []

现在您可以分解您的数据框:

df = df.explode('replies')
print(df)

# Output:
    id       text parent_id                    replies
0  id1  blah blah  _ post _                 [id1, id3]
0  id1  blah blah  _ post _            [id1, id2, id5]
0  id1  blah blah  _ post _  [id1, id2, id4, id6, id7]
1  id2  blah blah       id1                 [id2, id5]
1  id2  blah blah       id1       [id2, id4, id6, id7]
2  id3  blah blah       id1                        NaN
3  id4  blah blah       id2            [id4, id6, id7]
4  id5  blah blah       id2                        NaN
5  id6  blah blah       id4                 [id6, id7]
6  id7  blah blah       id6                        NaN

推荐阅读