首页 > 解决方案 > 从 pandas 数据帧有效地创建边缘列表

问题描述

我有一些我想对其进行合着分析的出版数据。数据框如下所示:

Author     Title     Pub_date     City
John A.    Paper 1   2020-01-01   Boston
Joan B.    Paper 1   2020-01-01   Boston
Jeff C.    Paper 2   2020-02-01   Chicago
Joan B.    Paper 2   2020-02-01   Chicago
Jose D.    Paper 2   2020-02-01   Chicago

我想创建一个未加权、无向的边缘列表,将发布数据保留为边缘属性,如下所示:

Node1    Node2       Title     Pub_date     City
John A.  Joan B.     Paper 1   2020-01-01   Boston
Joan B.  John A.     Paper 1   2020-01-01   Boston
Jeff C.  Joan B.     Paper 2   2020-02-01   Chicago
Jeff C.  Jose D.     Paper 2   2020-02-01   Chicago
Joan B.  Jeff C.     Paper 2   2020-02-01   Chicago
Joan B.  Jose D.     Paper 2   2020-02-01   Chicago
Jose D.  Jeff C.     Paper 2   2020-02-01   Chicago
Jose D.  Joan B.     Paper 2   2020-02-01   Chicago

我可以通过以下方式了解基本思想:

edgelist = pd.merge(left=df, right=df, how='outer', on='Title')

但是我必须做很多修复来删除重复的列、重命名并删除没有共同作者的行。对我来说似乎效率低下。当数据集非常大或有很多列时,我不知道这种方法的可扩展性如何。

非常感谢一些改进建议。

标签: pythonpandas

解决方案


it is the data you provided:

import pandas as pd

data = pd.DataFrame([["John A.",    "Paper 1",   "2020-01-01",   "Boston"],
["Joan B." ,   "Paper 1",   "2020-01-01",   "Boston"],
["Jeff C." ,   "Paper 2" ,  "2020-02-01" ,  "Chicago"],
["Joan B." ,   "Paper 2" ,  "2020-02-01" ,  "Chicago"]],
columns=["Author", "Title", "Pub_date", "City"])

it is the solution:

first = data.groupby(by=["Title", "Pub_date", "City"]).first().reset_index().rename(columns={"Author": "Node1"})
last = data.groupby(by=["Title", "Pub_date", "City"]).last().reset_index().rename(columns={"Author": "Node2"})
edgelist = pd.merge(first, last, how='left', on=["Title", "Pub_date", "City"])

推荐阅读