首页 > 解决方案 > Python:将 Dataframe 从日期列表转换为 Date From & Date To 格式

问题描述

我有一个如下所示的数据框:

+------------+------+-------+
| Date       | Item | Value |
+------------+------+-------+
| 2020-01-01 | A    | 100   |
+------------+------+-------+
| 2020-01-01 | B    | 80    |
+------------+------+-------+
| 2020-01-01 | C    | 70    |
+------------+------+-------+
| 2020-01-02 | A    | 102   |
+------------+------+-------+
| 2020-01-02 | B    | 82    |
+------------+------+-------+
| 2020-01-02 | C    | 65    |
+------------+------+-------+
| 2020-01-05 | B    | 81    |
+------------+------+-------+
| 2020-01-05 | C    | 70    |
+------------+------+-------+
| 2020-01-05 | D    | 20    |
+------------+------+-------+

我想转换成以下格式:

+------+------------+------------+------------+----------+
| Item | Date From  | Date To    | Value From | Value To |
+------+------------+------------+------------+----------+
| A    | 2020-01-01 | 2020-01-02 | 100        | 102      |
+------+------------+------------+------------+----------+
| B    | 2020-01-01 | 2020-01-02 | 80         | 82       |
+------+------------+------------+------------+----------+
| C    | 2020-01-01 | 2020-01-02 | 70         | 65       |
+------+------------+------------+------------+----------+
| A    | 2020-01-02 | 2020-01-05 | 102        | NAN      |
+------+------------+------------+------------+----------+
| B    | 2020-01-02 | 2020-01-05 | 82         | 81       |
+------+------------+------------+------------+----------+
| C    | 2020-01-02 | 2020-01-05 | 65         | 70       |
+------+------------+------------+------------+----------+
| D    | 2020-01-02 | 2020-01-05 | NAN        | 20       |
+------+------------+------------+------------+----------+

因此,将“一系列”值转换为范围格式,但我一生都无法弄清楚如何做到这一点。我试过使用 shift 运算符,但不能完全正确。需要注意的几点:

对此的一些帮助将不胜感激。

标签: pythonpandas

解决方案


data = [
    ['2020-01-01', 'A', 100],
    ['2020-01-01', 'B', 80],
    ['2020-01-01', 'C', 70],
    ['2020-01-02', 'A', 102],
    ['2020-01-02', 'B', 82],
    ['2020-01-02', 'C', 65],
    ['2020-01-05', 'B', 81],
    ['2020-01-05', 'C', 70],
    ['2020-01-05', 'D', 20],
]

df = pd.DataFrame(data, columns=['date', 'Item', 'Value'],)
df['date'] = pd.to_datetime(df['date'])

为仅存在一次的项目填写缺少的先前日期:

dates = sorted(set(pd.to_datetime(df['date'].values)))

value_counts = df.Item.value_counts()
single_items = value_counts[value_counts==1].index
for item in single_items: 
    last_date = df[df['Item']==item]['date'].iloc[0]
    previous_date = dates[dates.index(last_date) - 1]
    df = df.append(pd.DataFrame([[previous_date, item, np.nan]], columns=['date', 'Item', 'Value']))

加入数据并删除/重命名不需要的列

dates = sorted(set(pd.to_datetime(df['date'].values)))

df['next_date'] = df.apply(
    lambda row: dates[dates.index(row['date']) + 1] 
    if dates.index(row['date']) != len(dates) - 1 else None,
    axis=1
)
df2 = df.copy()

result = df.merge(df2, left_on=['next_date', 'Item'], right_on=['date', 'Item'], how='left')
result.drop(columns=['date_y', 'next_date_y'], inplace=True)
result.rename(columns={
    'date_x': 'Date From', 
    'next_date_x': 'Date To',
    'Item_x': 'Item',
    'Value_x': 'Value From',
    'Value_y': 'Value To'
}, inplace = True)
result = result[['Item', 'Date From', 'Date To', 'Value From', 'Value To']]
result.dropna(subset=['Date To'], inplace=True)
result.sort_values(['Date From', 'Item'])

结果:

  Item  Date From    Date To  Value From  Value To
0    A 2020-01-01 2020-01-02       100.0     102.0
1    B 2020-01-01 2020-01-02        80.0      82.0
2    C 2020-01-01 2020-01-02        70.0      65.0
3    A 2020-01-02 2020-01-05       102.0       NaN
4    B 2020-01-02 2020-01-05        82.0      81.0
5    C 2020-01-02 2020-01-05        65.0      70.0
9    D 2020-01-02 2020-01-05         NaN      20.0

推荐阅读