首页 > 解决方案 > 如何在 pandas 中将行数据转换为列数据,同时重复和移动日期?

问题描述

为这个模糊的问题道歉,很难知道如何表达它。

我有一个如下所示的 DataFrame,其中每个STORE/PRODUCT组合都有一个与之关联的连续日期范围:

DATE       | STORE | PRODUCT | UNITS_SOLD |  
-------------------------------------------
2020-01-01 |     1 | Beans   | 10         |
2020-01-02 |     1 | Beans   | 5          |
2020-01-03 |     1 | Beans   | 6          |
2020-01-01 |     2 | Carrots | 6          |
2020-01-02 |     2 | Carrots | 7          |
2020-01-03 |     2 | Carrots | 3          |
...        |   ... | ...     | ...        |

我想将其转换为以下 DataFrame,其中:

这个 DataFrame 看起来像这样:

STORE | PRODUCT | DATE       | UNITS_SOLD_NEXT_1_DAYS | UNITS_SOLD_NEXT_2_DAYS | 
--------------------------------------------------------------------------------
1     | Beans   | 2020-01-01 | 5                      | 6                      |
1     | Beans   | 2020-01-02 | 6                      | NaN                    |
1     | Beans   | 2020-01-03 | NaN                    | NaN                    |
2     | Carrots | 2020-01-01 | 7                      | 3                      |
2     | Carrots | 2020-01-02 | 3                      | NaN                    |
2     | Carrots | 2020-01-03 | NaN                    | NaN                    |
...   | ...     | ...        | ...                    | ...                    |

STORE/PRODUCT我可以通过使用为每个组合获得一行,df.pivot_table()但我在使用交错值的重复行时遇到了问题。

标签: pythonpandasdataframe

解决方案


可能不是最强大或最漂亮的解决方案,但似乎一些花哨的 numpy 索引会让你到达那里。

def waterfall(series):
    units_sold = series.to_numpy()
    n_days = units_sold.shape[0]
      
    shape = (n_days, n_days)
    out = np.full(shape, np.nan)
    r, c = np.triu_indices(n_days)    
    out[r, c-r] = np.take(units_sold, c)
    
    columns = ["next_{}_days".format(i) for i in range(1, n_days)]
    columns.insert(0, "units_sold")
    return pd.DataFrame(out, columns=columns)

    

new_df = (df.groupby(["store", "product"])["units_sold"]
          .apply(waterfall)
          .reset_index(level=-1, drop=True)
          .set_index(df.date, append=True)
         )

print(new_df)
                          units_sold  next_1_days  next_2_days  next_3_days
store product date                                                         
1     Beans   2020-01-01        10.0          5.0          6.0          NaN
              2020-01-02         5.0          6.0          NaN          NaN
              2020-01-03         6.0          NaN          NaN          NaN
2     Carrots 2020-01-01         6.0          7.0          3.0          9.0
              2020-01-02         7.0          3.0          9.0          NaN
              2020-01-03         3.0          9.0          NaN          NaN
              2020-01-04         9.0          NaN          NaN          NaN

该解决方案不考虑连续天数,只考虑连续观察。因此,如果商店内的日期不是连续的,那么您的列标题可能会关闭。希望这可以解决问题!


推荐阅读