python - 从 pandas 数据帧有效地创建边缘列表
问题描述
我有一些我想对其进行合着分析的出版数据。数据框如下所示:
Author Title Pub_date City
John A. Paper 1 2020-01-01 Boston
Joan B. Paper 1 2020-01-01 Boston
Jeff C. Paper 2 2020-02-01 Chicago
Joan B. Paper 2 2020-02-01 Chicago
Jose D. Paper 2 2020-02-01 Chicago
我想创建一个未加权、无向的边缘列表,将发布数据保留为边缘属性,如下所示:
Node1 Node2 Title Pub_date City
John A. Joan B. Paper 1 2020-01-01 Boston
Joan B. John A. Paper 1 2020-01-01 Boston
Jeff C. Joan B. Paper 2 2020-02-01 Chicago
Jeff C. Jose D. Paper 2 2020-02-01 Chicago
Joan B. Jeff C. Paper 2 2020-02-01 Chicago
Joan B. Jose D. Paper 2 2020-02-01 Chicago
Jose D. Jeff C. Paper 2 2020-02-01 Chicago
Jose D. Joan B. Paper 2 2020-02-01 Chicago
我可以通过以下方式了解基本思想:
edgelist = pd.merge(left=df, right=df, how='outer', on='Title')
但是我必须做很多修复来删除重复的列、重命名并删除没有共同作者的行。对我来说似乎效率低下。当数据集非常大或有很多列时,我不知道这种方法的可扩展性如何。
非常感谢一些改进建议。
解决方案
it is the data you provided:
import pandas as pd
data = pd.DataFrame([["John A.", "Paper 1", "2020-01-01", "Boston"],
["Joan B." , "Paper 1", "2020-01-01", "Boston"],
["Jeff C." , "Paper 2" , "2020-02-01" , "Chicago"],
["Joan B." , "Paper 2" , "2020-02-01" , "Chicago"]],
columns=["Author", "Title", "Pub_date", "City"])
it is the solution:
first = data.groupby(by=["Title", "Pub_date", "City"]).first().reset_index().rename(columns={"Author": "Node1"})
last = data.groupby(by=["Title", "Pub_date", "City"]).last().reset_index().rename(columns={"Author": "Node2"})
edgelist = pd.merge(first, last, how='left', on=["Title", "Pub_date", "City"])
推荐阅读
- java - 加法数字 java 并通过输入 0 停止
- java - 正则表达式解析字符串以映射
- html - Link to a file outside public repository on Github
- java - How to make a subclass accessible from instance only?
- c++ - Is this code a thread-safe way to add and remove items from a vector?
- r - Is it possibe to color specific regions under the line graph in ggplot2 based on a binary variable?
- armeria - Accessing response headers using a decorator in Armeria
- c++ - How to use strings with switch
- strapi - 侧栏中不存在 Strapi 角色和权限插件?
- curve-fitting - 拟合曲线