首页 > 解决方案 > 以单列为中心,但具有对角线历史客户访问视图

问题描述

数据反映为:

import pandas as pd
df = pd.DataFrame({'CustID': [1,2,3, 2,4,5,1, 6,5,2,7,3,8,9,5,4], 'YearVisited': [2013,2013, 2014, 2014, 2015, 2018, 2019, 2019, 2019, 2020, 2020, 2020,  2020, 2020, 2020,2020]})
sorted_df = df.sort_values(['YearVisited','CustID'], ascending=[True, True])
sorted_df

A)我一直无法获得以下观点:

2013 2014 2015 2016 年 2017 2018 2019 2020
2013 2 1 0 0 0 0 1 1
2014 0 2 0 0 0 0 0 1
2015 0 0 1 0 0 0 0 1
2016 年 0 0 0 0 0 0 0 0
2017 0 0 0 0 0 0 0 0
2018 0 0 0 0 0 1 1 1
2019 0 0 0 0 0 0 3 0
2020 0 0 0 0 0 0 0 7

因此,对角线反映了每个给定年份的客户总数。但是回头客显示在他们第一次访问的行的总数之上。

通过解释:

in 2014, we had 2 customers, 1 of whom was a returning one from 2013
in 2015 only new customers visited, no returning from previous years
in 2019 3 customers, 2 were returning 1x2013, 1x2018
in 2020 we had 7 customers, 4 were returning, 1x2013, 1x2014, 1x2015, 1x2018

这些看起来和我看到的一样接近我的目标,但我一直在努力将其应用到我自己的目标上:i)通过对角线旋转数据框 ii)具有相同行和列的 pandas 数据透视表

我已经尝试过对此进行更改,但这显然不是我真正需要的:

df['YearVisitedCopy'] = df['YearVisited']
result = (df.assign(count=df.groupby("YearVisited").cumcount())
            .pivot(index='YearVisited', columns='count'))

result.columns = ["_".join(str(x) for x in i) for i in result.columns]

print (result)

更新/附加信息:

- 如果更容易实现,那么对角线只是那一年的那些新客户,而不是包括前几年在内的总数也可以,我对省略或出现在其他地方的总数感到满意,例如 B.** B) **

2013 2014 2015 2016 年 2017 2018 2019 2020
2013 2 1 0 0 0 0 1 1
2014 0 1 0 0 0 0 0 1
2015 0 0 1 0 0 0 0 1
2016 年 0 0 0 0 0 0 0 0
2017 0 0 0 0 0 0 0 0
2018 0 0 0 0 0 1 1 1
2019 0 0 0 0 0 0 1 0
2020 0 0 0 0 0 0 0 3
总计 2 2 1 0 0 1 3 7

标签: pythonpandasdataframepivot

解决方案


它仍然有点混乱,但这里有一个解决您的问题的方法:

import numpy as np
import pandas as pd

cust_ids = [1,2,3, 2,4,5,1, 6,5,2,7,3,8,9,5,4]
years = [2013,2013, 2014, 2014, 2015, 2018, 2019, 2019, 2019, 2020, 2020, 2020,  2020, 2020, 2020,2020]

min_year, max_year = min(years), max(years)
n_years = max_year - min_year + 1
year_matrix = np.zeros((n_years, n_years), dtype=int)

first_year = {}
for c, y in zip(cust_ids, years):
    year_matrix[min(y, first_year.setdefault(c, y)) - min_year, y - min_year] += 1

totals = year_matrix.sum(axis=0)

解决方案 B)

df_b = pd.DataFrame(np.vstack((year_matrix, totals)), columns=range(min_year, max_year + 1), index=list(range(min_year, max_year + 1)) + ['total'])
print(df_b)
       2013  2014  2015  2016  2017  2018  2019  2020
2013      2     1     0     0     0     0     1     1
2014      0     1     0     0     0     0     0     1
2015      0     0     1     0     0     0     0     1
2016      0     0     0     0     0     0     0     0
2017      0     0     0     0     0     0     0     0
2018      0     0     0     0     0     1     1     1
2019      0     0     0     0     0     0     1     0
2020      0     0     0     0     0     0     0     3
total     2     2     1     0     0     1     3     7

解决方案 A)

np.fill_diagonal(year_matrix, totals)
df_a = pd.DataFrame(year_matrix, columns=range(min_year, max_year + 1), index=range(min_year, max_year + 1))
print(df_a)
      2013  2014  2015  2016  2017  2018  2019  2020
2013     2     1     0     0     0     0     1     1
2014     0     2     0     0     0     0     0     1
2015     0     0     1     0     0     0     0     1
2016     0     0     0     0     0     0     0     0
2017     0     0     0     0     0     0     0     0
2018     0     0     0     0     0     1     1     1
2019     0     0     0     0     0     0     3     0
2020     0     0     0     0     0     0     0     7

推荐阅读