首页 > 解决方案 > 如何根据列值和行值连接两个df?

问题描述

我有一列 df ( df1),比如说,column1由格式中的随机月份和年份%b-%y组成column2,并由一些数字组成。我有另一个 df ( df2),它具有与 df1 格式相同的随机月份和年份的列标题,为了简单起见,假设为 1 月 21 日至 12 月 21 日。两个df也有一些columns共同点。

有没有办法加入df1df2基于列并用来自的数字填充 df2 column2

df1,

    column1 column3 column4 column5 column6
0   ABCD    CT      AA      Apr-21  123
1   ABCD    CT      EE      Jun-21  24
2   ABCD    CT      CS      Jul-21  123
3   ABCD    CT      UUU     Jan-21  123
4   ABCD    CT      MMM     Apr-20  432
5   ABCD    CT      CCC     Aug-21  312
6   ABCD    CP      AA      Jul-21  4
7   ABCD    CP      EE      Jun-21  2
8   ABCD    CP      CS      Nov-21  0
9   ABCD    CP     UUU      Mar-21  34
10  ABCD    CP     MMM      Sep-21  234
11  ABCD    CP     CCC      Oct-21  123

df2如下,格式固定,需要在column5的基础上用column6填充,df2中的1,3,4列相同。

{'column1': ['ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD'], 'column2': ['V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V'], 'column3': ['CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP'], 'column4': ['OCP', 'AC', 'TC', 'AA', 'EE', 'CS', 'UUU', 'MMM', 'CCC', 'TOQ', 'OC', 'FR', 'OCP', 'AC', 'TC', 'AA', 'EE', 'CS', 'UUU', 'MMM', 'CCC', 'TOQ', 'OC', 'FR'], 'Jan-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'May-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Jul-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Aug-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Sep-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Oct-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Nov-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Dec-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]}

最终输出应该是这样的,

在此处输入图像描述 我不能使用数据透视,因为我想在 df2 中保留月份列标题。

标签: pythonpandas

解决方案


您不需要引入“占位符”月份来实现这一点。您可以设置 DatetimeIndex 并重新索引您的 DataFrame 以填充缺失的日期。之后,它只是一个数据透视表。

这是一个例子:

import pandas as pd
import io
from datetime import (date, datetime)

df1_text = """
index  column1 column3 column4 column5 column6
0   ABCD    CT      AA      Apr-21  123
1   ABCD    CT      EE      Jun-21  24
2   ABCD    CT      CS      Jul-21  123
3   ABCD    CT      UUU     Jan-21  123
4   ABCD    CT      MMM     Apr-20  432
5   ABCD    CT      CCC     Aug-21  312
6   ABCD    CP      AA      Jul-21  4
7   ABCD    CP      EE      Jun-21  2
8   ABCD    CP      CS      Nov-21  0
9   ABCD    CP     UUU      Mar-21  34
10  ABCD    CP     MMM      Sep-21  234
11  ABCD    CP     CCC      Oct-21  123
"""


def s2d(s):
    return(datetime.strptime(s, '%b-%y'))


df1 = pd.read_csv(io.StringIO(df1_text), sep='\s+',
                  index_col=[0], converters={'column5': s2d})

# set datetime as index and group by columns
df1 = df1.set_index('column5').groupby(['column1', 'column3', 'column4'])

# reindex by full daterange
df1 = df1.apply(lambda x: x.reindex(pd.date_range(
    date(2021, 1, 1), date(2021, 12, 31), freq='MS')))

# drop extra columns
df1.drop(['column1', 'column3', 'column4'], axis=1, inplace=True)

# reset group by index
df1.reset_index(['column1', 'column3', 'column4'], inplace=True)

# reset datetime index
df1.reset_index(inplace=True)

# pivot table
df1 = df1.pivot_table(
    index=['column1', 'column3', 'column4'],
    columns='index',
    values='column6',
    dropna=False  # make sure empty months do not get dropped
)

# reformat the columns back to '%b-&y'
df1.columns = [datetime.strftime(d, '%b-%y') for d in df1.columns]

# reset_index
df1.reset_index(inplace=True)

# df2 without months
df2 = pd.DataFrame({
    'column1': ['ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD',
                'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD',
                'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD', 'ABCD'],
    'column2': ['V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V',
                'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V'],
    'column3': ['CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT', 'CT',
                'CT', 'CT', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP', 'CP',
                'CP', 'CP', 'CP', 'CP'],
    'column4': ['OCP', 'AC', 'TC', 'AA', 'EE', 'CS', 'UUU', 'MMM', 'CCC', 'TOQ',
                'OC', 'FR', 'OCP', 'AC', 'TC', 'AA', 'EE', 'CS', 'UUU', 'MMM',
                'CCC', 'TOQ', 'OC', 'FR']})

# merge df2 with df1
df2 = df2.merge(df1, on=['column1', 'column3', 'column4'], how='left')

推荐阅读