首页 > 解决方案 > Merge_asof 但只让最近的键合并

问题描述

我目前正在尝试使用 merge_asof 方法合并两个数据帧。但是,在使用此方法时,我偶然发现了一个问题,即如果我的任何数据中都有空白,那么合并数据框中的重复单元格就会出现问题。为了澄清起见,我有两个如下所示的数据框:

1.

index     Meter_Indication (km)     Fuel1 (l)
0         35493                     245
1         35975                     267
2         36000                     200
3         36303                     160
4         36567                     300
5         38653                     234
index     Meter_Indication (km)     Fuel2 (l)
0         35494                     300
1         35980                     203
2         36573                     323
3         38656                     233

这两个数据框包含有关加油车辆的数据,其中燃料列以升为单位加油,而 Meter_Indication 指示汽车总共行驶了多少公里(随着时间的推移不可能变得更少,这就是为什么它是一个很好的关键合并)。但是,如您所见, df2 中的行数少于当前 df1 中的行数(在我的情况下,它使值合并到最接近的值上。像这样:

  1. (合并df)
index     Meter_Indication (km)     Fuel1 (l)     Fuel2(l)
0         35493                     245           300
1         35975                     267           203
2         36000                     200           203
3         36303                     160           323
4         36567                     300           323
5         38653                     234           233

如您所见,有重复的值203323。我的目标是代替包含所有 5 行的数据框,而不是排除那些没有“最近”匹配的数据框。我只想要实际最接近的值与值合并。换句话说,我想要的数据框是:

index     Meter_Indication (km)     Fuel1 (l)     Fuel2(l)
0         35493                     245           300
1         35975                     267           203
3         36567                     300           323
4         38653                     234           233

您可以在此处看到,与另一个值不“最接近”匹配的值已被删除。

我试过到处寻找这个,但找不到任何符合我想要的结果的东西。

我目前的代码是:

    #READS PROVIDED DOCUMENTS.
    df1 = pd.read_excel(
        filepathname1, "CWA107 Event", na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
    df2 = pd.read_excel(
        filepathname2,
        na_values=["NA"],
        skiprows=1,
        usecols=["Fuel2 (l)", "Unnamed: 3", "Meter_Indication"],)

    # Drop NaN rows.
    df2.dropna(inplace=True)
    df1.dropna(inplace=True)

    #Filters out rows with the keywords listed in 'blacklist'.
    df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
    df1 = df1[~df1.Bränsle.isin(blacklist)]
    df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)

    #Creates new column for the difference in fuellevel column.
    df1["Difference (%)"] = df1["Bränslenivå (%)"]
    df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()

    # Renames time-column so that they match.
    df2.rename(columns={"Unnamed: 3": "Tid"}, inplace=True)

    # Drops rows where the difference is equal to 0.
    df1filt = df1[(df1["Difference (%)"] != 0)]

    # Converts time-column to only year, month and date.
    df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)

    df1filt.reset_index(level=0, inplace=True)

    #Renames the index column to "row" in order to later use the "row" column 
    df1filt.rename(columns={"index": "row"}, inplace=True)

    # Creates a new column for the difference in total driven kilometers (used for matching)
    df1filt["Match"] = df1filt["Vägmätare (km)"]
    df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()

    #Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
    ROWRANGE = len(df1filt)+1
    thevalue = 0
    for currentrow in range(ROWRANGE-1):
        if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
            thevalue = 0
            thevalue += df1filt.loc[currentrow,'Difference (%)']
            df1filt.loc[currentrow,'Match'] = "SUMMED"
        if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
            thevalue += df1filt.loc[currentrow,'Difference (%)']
        if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
            df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
            df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"        
            thevalue = 0

    #Removes single "refuels" that are lower than 5
    df1filt = df1filt[(df1filt['Difference (%)'] > 5)]

    #Creates a new dataframe for the summed values 
    df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]

    #Creates a estimated refueled amount column for the automatic
    df1filt2["Fuel1 (l)"] = df1filt2["Difference (%)"]
    df1filt2["Fuel1 (l)"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity

    #Renames total kilometer column so that the two documents can match
    df1filt2.rename(columns={"Vägmätare (km)": "Meter_Indication"}, inplace=True)

    #Filters out rows where refuel and kilometer = NaN (Manual)
    df2filt = df2[(df2['Fuel2 (l)'] != NaN) & (df2['Meter_Indication'] != NaN)]

    #Drops first row
    df2filt.drop(df2filt.index[0], inplace=True)

    #Adds prefix for the time column so that they match (not used anymore because km is used to match)
    df2filt['Tid'] = '20' + df2filt['Tid'].astype(str) 

    #Rounds numeric columns
    decimals = 0
    df2filt['Meter_Indication'] = pd.to_numeric(df2filt['Meter_Indication'],errors='coerce')
    df2filt['Fuel2 (l)'] = pd.to_numeric(df2filt['Fuel2 (l)'],errors='coerce')
    df2filt['Meter_Indication'] = df2filt['Meter_Indication'].apply(lambda x: round(x, decimals))
    df2filt['Fuel2 (l)'] = df2filt['Fuel2 (l)'].apply(lambda x: round(x, decimals))

    #Removes last number (makes the two excels matchable)
    df2filt['Meter_Indication'] //= 10
    df1filt2['Meter_Indication'] //= 10

    #Creates merged dataframe with the two
    merged_df = df1filt2.merge(df2filt, on='Meter_Indication')

希望这是足够的信息!先感谢您。

标签: pythonpandas

解决方案


尝试这个:

# Assign new column to keep meter indication from df2
df = pd.merge_asof(df1, df2.assign(meter_indication_2=df2['Meter_Indication (km)']), on='Meter_Indication (km)', direction='nearest')

# Calculate absolute difference
df['meter_indication_diff'] = df['Meter_Indication (km)'].sub(df['meter_indication_2']).abs()

# Sort values, drop duplicates (keep the ones with the smallest diff) and do some clean up
df = df.sort_values(by=['meter_indication_2', 'meter_indication_diff']).drop_duplicates(subset=['meter_indication_2']).sort_index().drop(['meter_indication_2', 'meter_indication_diff'], axis=1)

# Output

   Meter_Indication (km)  Fuel1 (l)  Fuel2 (l)
0                  35493        245        300
1                  35975        267        203
4                  36567        300        323
5                  38653        234        233

推荐阅读