python - 以单列为中心，但具有对角线历史客户访问视图

问题描述

数据反映为：

import pandas as pd
df = pd.DataFrame({'CustID': [1,2,3, 2,4,5,1, 6,5,2,7,3,8,9,5,4], 'YearVisited': [2013,2013, 2014, 2014, 2015, 2018, 2019, 2019, 2019, 2020, 2020, 2020,  2020, 2020, 2020,2020]})
sorted_df = df.sort_values(['YearVisited','CustID'], ascending=[True, True])
sorted_df

A）我一直无法获得以下观点：

年	2013	2014	2015	2018	2019	2020
2013	2	1	0	0	1	1
2014	0	2	0	0	0	1
2015	0	0	1	0	0	1
2016 年	0	0	0	0	0	0
2017	0	0	0	0	0	0
2018	0	0	0	1	1	1
2019	0	0	0	0	3	0
2020	0	0	0	0	0	7

因此，对角线反映了每个给定年份的客户总数。但是回头客显示在他们第一次访问的行的总数之上。

通过解释：

in 2014, we had 2 customers, 1 of whom was a returning one from 2013
in 2015 only new customers visited, no returning from previous years
in 2019 3 customers, 2 were returning 1x2013, 1x2018
in 2020 we had 7 customers, 4 were returning, 1x2013, 1x2014, 1x2015, 1x2018

这些看起来和我看到的一样接近我的目标，但我一直在努力将其应用到我自己的目标上：i）通过对角线旋转数据框 ii）具有相同行和列的 pandas 数据透视表

我已经尝试过对此进行更改，但这显然不是我真正需要的：

df['YearVisitedCopy'] = df['YearVisited']
result = (df.assign(count=df.groupby("YearVisited").cumcount())
            .pivot(index='YearVisited', columns='count'))

result.columns = ["_".join(str(x) for x in i) for i in result.columns]

print (result)

更新/附加信息：

在 3.8.5 上运行（默认，2020 年 9 月 3 日，21:29:08）[MSC v.1916 64 位（AMD64）]（感谢 Alex 改进了我的问题中的格式。我会注意未来。 )
行数不多@ ~1035431，优化不是优先事项。
额外的图书馆使用很好。例如 numpy 等的解决方案

- 如果更容易实现，那么对角线只是那一年的那些新客户，而不是包括前几年在内的总数也可以，我对省略或出现在其他地方的总数感到满意，例如 B.** B） **

年	2013	2014	2015	2018	2019	2020
2013	2	1	0	0	1	1
2014	0	1	0	0	0	1
2015	0	0	1	0	0	1
2016 年	0	0	0	0	0	0
2017	0	0	0	0	0	0
2018	0	0	0	1	1	1
2019	0	0	0	0	1	0
2020	0	0	0	0	0	3
总计	2	2	1	1	3	7

标签： pythonpandasdataframepivot

解决方案

它仍然有点混乱，但这里有一个解决您的问题的方法：

import numpy as np
import pandas as pd

cust_ids = [1,2,3, 2,4,5,1, 6,5,2,7,3,8,9,5,4]
years = [2013,2013, 2014, 2014, 2015, 2018, 2019, 2019, 2019, 2020, 2020, 2020,  2020, 2020, 2020,2020]

min_year, max_year = min(years), max(years)
n_years = max_year - min_year + 1
year_matrix = np.zeros((n_years, n_years), dtype=int)

first_year = {}
for c, y in zip(cust_ids, years):
    year_matrix[min(y, first_year.setdefault(c, y)) - min_year, y - min_year] += 1

totals = year_matrix.sum(axis=0)

解决方案 B)

df_b = pd.DataFrame(np.vstack((year_matrix, totals)), columns=range(min_year, max_year + 1), index=list(range(min_year, max_year + 1)) + ['total'])
print(df_b)

       2013  2014  2015  2016  2017  2018  2019  2020
2013      2     1     0     0     0     0     1     1
2014      0     1     0     0     0     0     0     1
2015      0     0     1     0     0     0     0     1
2016      0     0     0     0     0     0     0     0
2017      0     0     0     0     0     0     0     0
2018      0     0     0     0     0     1     1     1
2019      0     0     0     0     0     0     1     0
2020      0     0     0     0     0     0     0     3
total     2     2     1     0     0     1     3     7

解决方案 A)

np.fill_diagonal(year_matrix, totals)
df_a = pd.DataFrame(year_matrix, columns=range(min_year, max_year + 1), index=range(min_year, max_year + 1))
print(df_a)

      2013  2014  2015  2016  2017  2018  2019  2020
2013     2     1     0     0     0     0     1     1
2014     0     2     0     0     0     0     0     1
2015     0     0     1     0     0     0     0     1
2016     0     0     0     0     0     0     0     0
2017     0     0     0     0     0     0     0     0
2018     0     0     0     0     0     1     1     1
2019     0     0     0     0     0     0     3     0
2020     0     0     0     0     0     0     0     7

python - 以单列为中心，但具有对角线历史客户访问视图

问题描述

解决方案

推荐阅读