首页 > 解决方案 > Pandas - 按查看的最后一页排序

问题描述

任何人都可以帮我排序查看的最后一页的顺序吗?

我有一个数据框,我试图通过查看的上一页对其进行排序,我很难想出一个使用 Pandas 的有效方法。

例如从这个:

+------------+------------------+----------+
|  Customer  | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | A                | D        |
| 1051471580 | C                | B        |
| 1051471580 | A                | exit     |
| 1051471580 | B                | A        |
| 1051471580 | D                | A        |
| 1051471580 | entrance         | C        |
+------------+------------------+----------+

对此:

 +------------+------------------+----------+
 |  Customer  | previousPagePath | pagePath |
 +------------+------------------+----------+
 | 1051471580 | entrance         | C        |
 | 1051471580 | C                | B        |
 | 1051471580 | B                | A        |
 | 1051471580 | A                | D        |
 | 1051471580 | D                | A        |
 | 1051471580 | A                | exit     |
 +------------+------------------+----------+

然而,对于成千上万的不同客户来说,它可能有数百万行,所以我真的需要考虑如何提高效率。

pd.DataFrame({
    'Customer':'1051471580',
    'previousPagePath': ['E','C','B','A','D','A'],
    'pagePath': ['C','B','A','D','A','F']
})

谢谢!

标签: pythonpandasgoogle-analytics

解决方案


您要做的是拓扑排序,这可以通过 networkx 来实现。请注意,我必须更改数据框中的一些值以防止它引发循环错误,因此我希望您处理的数据包含唯一值:

import networkx as nx
import pandas as pd

data = [ [1051471580, "Z", "D"], [1051471580,"C","B"  ], [1051471580,"A","exit"  ], [1051471580,"B","Z"  ], [1051471580,"D","A"  ], [1051471580,"entrance","C"  ] ]
df = pd.DataFrame(data, columns=['Customer', 'previousPagePath', 'pagePath'])

edges = df[df.pagePath != df.previousPagePath].reset_index()
dg = nx.from_pandas_edgelist(edges, source='previousPagePath', target='pagePath', create_using=nx.DiGraph())
order = list(nx.lexicographical_topological_sort(dg))
result = df.set_index('previousPagePath').loc[order[:-1], :].dropna().reset_index()
result = result[['Customer', 'previousPagePath', 'pagePath']]

输出:

|    |   Customer | previousPagePath   | pagePath   |
|---:|-----------:|:-------------------|:-----------|
|  0 | 1051471580 | entrance           | C          |
|  1 | 1051471580 | C                  | B          |
|  2 | 1051471580 | B                  | Z          |
|  3 | 1051471580 | Z                  | D          |
|  4 | 1051471580 | D                  | A          |
|  5 | 1051471580 | A                  | exit       |

推荐阅读