python - 将 pandas 数据框列转换为具有源和目标的 networkx 图
问题描述
我在 pandas 中有一个 DataFrame,其中包含有关人员及时位置的信息。它大约有 300+ 百万行。
这是一个示例,其中每个 Name 都分配给一个唯一index
的 by并按group.by
and 排序:Name
Year
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])
输出:
+-------+-------+------+---------------+----------------------+
| Index | Name | Year | Address | Name_Grouped_Index |
+-------+-------+------+---------------+----------------------+
| 5 | Steve | 2018 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 15 | Steve | 2018 | NewYork | 1 |
+-------+-------+------+---------------+----------------------+
| 16 | Steve | 2018 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 6 | Steve | 2019 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 7 | Steve | 2019 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 8 | Steve | 2020 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 9 | Steve | 2020 | Canada | 1 |
+-------+-------+------+---------------+----------------------+
| 13 | Steve | 2021 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 14 | Steve | 2022 | California | 1 |
+-------+-------+------+---------------+----------------------+
| 17 | Steve | 2022 | NewYork | 1 |
+-------+-------+------+---------------+----------------------+
| 0 | John | 2018 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 1 | John | 2018 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 2 | John | 2019 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
| 3 | John | 2019 | Orange county | 0 |
+-------+-------+------+---------------+----------------------+
| 4 | John | 2019 | New York | 0 |
+-------+-------+------+---------------+----------------------+
| 10 | John | 2020 | Canada | 0 |
+-------+-------+------+---------------+----------------------+
| 11 | John | 2021 | Canada | 0 |
+-------+-------+------+---------------+----------------------+
| 12 | John | 2021 | Beverly hills | 0 |
+-------+-------+------+---------------+----------------------+
我想获取网络图矩阵(邻接矩阵),以查看地址之间的总变化。换句话说,例如,有多少人在 2018 年从“加拿大”搬到“加利福尼亚”。
理想输出:
1)地址列的直接图。从技术上讲,将地址列转换为“源”和“目标”两列,其中“目标”值是下一行的“源”。最好计算另一列“重量”中的配对,而不是重复配对。
+------------+------------+------+--------+
| Source | Target | Year | Weight |
+------------+------------+------+--------+
| Canada | NewYork | 2018 | |
+------------+------------+------+--------+
| NewYork | California | 2018 | |
+------------+------------+------+--------+
| California | Canada | 2019 | |
+------------+------------+------+--------+
| Canada | Canada | 2019 | |
+------------+------------+------+--------+
| Canada | California | 2020 | |
+------------+------------+------+--------+
| California | Canada | 2020 | |
+------------+------------+------+--------+
| Canada | California | 2021 | |
+------------+------------+------+--------+
| California | California | 2022 | |
+------------+------------+------+--------+
| California | NewYork | 2022 | |
+------------+------------+------+--------+
或者
2)一个矩阵来说明地址之间的总变化。
+---------------+--------+---------+------------+---------------+---------------+
| From \ To | Canada | NewYork | California | Beverly hills | Orange county |
+---------------+--------+---------+------------+---------------+---------------+
| Canada | 2 | 2 | 2 | 2 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| NewYork | 1 | 0 | 1 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| California | 2 | 1 | 1 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
| Beverly hills | 0 | 0 | 0 | 2 | 1 |
+---------------+--------+---------+------------+---------------+---------------+
| Orange county | 0 | 1 | 0 | 0 | 0 |
+---------------+--------+---------+------------+---------------+---------------+
解决方案
这不是最漂亮的代码,但至少您可以按照每个步骤进行操作。我选择了第二个选项,因为您可以轻松地从此连接矩阵制作图表。您在制作 networkx 图表方面需要帮助吗?矩阵的行和列是:['Beverly hills', 'Orange County', 'New York', 'Canada', 'California', 'NewYork'] 您对每个人的 newyork 拼写不同,所以它出现了两次。
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])
print (df)
dictionary_ = {} # where each person went
places = [] # all of the places
for index, row in df.iterrows():
if row['Author_Grouped_Index'] not in dictionary_:
dictionary_[row['Author_Grouped_Index']] = []
dictionary_[row['Author_Grouped_Index']].append(row["Address"])
else:
dictionary_[row['Author_Grouped_Index']].append(row["Address"])
if row["Address"] not in places:
places.append(row["Address"])
print (dictionary_)
new_dictionary = {} #number of times each place visited
for key, value in dictionary_.items():
for x in range(len(value)-1):
move = value[x] + "-" + value[x+1]
if not move in new_dictionary:
new_dictionary[move] = 1
else:
new_dictionary[move] += 1
print (new_dictionary)
print (places)
import numpy as np
array = np.zeros((len(places),len(places)), dtype=int)
for x, place in enumerate(places):
for y, place_2 in enumerate(places):
move_2 = (place + "-" + place_2)
try:
array[x,y] = (new_dictionary[move_2])
except:
array[x,y] = 0
print (array)
推荐阅读
- java - 无法从另一个活动传递值(仅传递空值)
- python-3.x - 计算句子中的字母(Python)
- java - 如何将 setContentView(View view) 方法移动到片段
- html - 我想要一个高度相同的 img 旁边的 div,同时使用 100% 的页面宽度
- sml - SML NJ 中的 insertSorted 比较函数
- python - 从数据框中删除特定字符串
- parsing - 无法为递归下降解析器获得 LL(1) 形式的语法
- javascript - 通过 jest 测试一个类时,它是否处理导入?
- azure-devops - 如何从 YAML 文件传递 Azure ARM 模板对象值?
- c# - Azure Functions V1 用户机密