python - Merge_asof 但只让最近的键合并
问题描述
我目前正在尝试使用 merge_asof 方法合并两个数据帧。但是,在使用此方法时,我偶然发现了一个问题,即如果我的任何数据中都有空白,那么合并数据框中的重复单元格就会出现问题。为了澄清起见,我有两个如下所示的数据框:
1.
index Meter_Indication (km) Fuel1 (l)
0 35493 245
1 35975 267
2 36000 200
3 36303 160
4 36567 300
5 38653 234
index Meter_Indication (km) Fuel2 (l)
0 35494 300
1 35980 203
2 36573 323
3 38656 233
这两个数据框包含有关加油车辆的数据,其中燃料列以升为单位加油,而 Meter_Indication 指示汽车总共行驶了多少公里(随着时间的推移不可能变得更少,这就是为什么它是一个很好的关键合并)。但是,如您所见, df2 中的行数少于当前 df1 中的行数(在我的情况下,它使值合并到最接近的值上。像这样:
- (合并df)
index Meter_Indication (km) Fuel1 (l) Fuel2(l)
0 35493 245 300
1 35975 267 203
2 36000 200 203
3 36303 160 323
4 36567 300 323
5 38653 234 233
如您所见,有重复的值203
和323
。我的目标是代替包含所有 5 行的数据框,而不是排除那些没有“最近”匹配的数据框。我只想要实际最接近的值与值合并。换句话说,我想要的数据框是:
index Meter_Indication (km) Fuel1 (l) Fuel2(l)
0 35493 245 300
1 35975 267 203
3 36567 300 323
4 38653 234 233
您可以在此处看到,与另一个值不“最接近”匹配的值已被删除。
我试过到处寻找这个,但找不到任何符合我想要的结果的东西。
我目前的代码是:
#READS PROVIDED DOCUMENTS.
df1 = pd.read_excel(
filepathname1, "CWA107 Event", na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
df2 = pd.read_excel(
filepathname2,
na_values=["NA"],
skiprows=1,
usecols=["Fuel2 (l)", "Unnamed: 3", "Meter_Indication"],)
# Drop NaN rows.
df2.dropna(inplace=True)
df1.dropna(inplace=True)
#Filters out rows with the keywords listed in 'blacklist'.
df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
df1 = df1[~df1.Bränsle.isin(blacklist)]
df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)
#Creates new column for the difference in fuellevel column.
df1["Difference (%)"] = df1["Bränslenivå (%)"]
df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()
# Renames time-column so that they match.
df2.rename(columns={"Unnamed: 3": "Tid"}, inplace=True)
# Drops rows where the difference is equal to 0.
df1filt = df1[(df1["Difference (%)"] != 0)]
# Converts time-column to only year, month and date.
df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)
df1filt.reset_index(level=0, inplace=True)
#Renames the index column to "row" in order to later use the "row" column
df1filt.rename(columns={"index": "row"}, inplace=True)
# Creates a new column for the difference in total driven kilometers (used for matching)
df1filt["Match"] = df1filt["Vägmätare (km)"]
df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()
#Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
ROWRANGE = len(df1filt)+1
thevalue = 0
for currentrow in range(ROWRANGE-1):
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
thevalue = 0
thevalue += df1filt.loc[currentrow,'Difference (%)']
df1filt.loc[currentrow,'Match'] = "SUMMED"
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
thevalue += df1filt.loc[currentrow,'Difference (%)']
if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"
thevalue = 0
#Removes single "refuels" that are lower than 5
df1filt = df1filt[(df1filt['Difference (%)'] > 5)]
#Creates a new dataframe for the summed values
df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]
#Creates a estimated refueled amount column for the automatic
df1filt2["Fuel1 (l)"] = df1filt2["Difference (%)"]
df1filt2["Fuel1 (l)"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity
#Renames total kilometer column so that the two documents can match
df1filt2.rename(columns={"Vägmätare (km)": "Meter_Indication"}, inplace=True)
#Filters out rows where refuel and kilometer = NaN (Manual)
df2filt = df2[(df2['Fuel2 (l)'] != NaN) & (df2['Meter_Indication'] != NaN)]
#Drops first row
df2filt.drop(df2filt.index[0], inplace=True)
#Adds prefix for the time column so that they match (not used anymore because km is used to match)
df2filt['Tid'] = '20' + df2filt['Tid'].astype(str)
#Rounds numeric columns
decimals = 0
df2filt['Meter_Indication'] = pd.to_numeric(df2filt['Meter_Indication'],errors='coerce')
df2filt['Fuel2 (l)'] = pd.to_numeric(df2filt['Fuel2 (l)'],errors='coerce')
df2filt['Meter_Indication'] = df2filt['Meter_Indication'].apply(lambda x: round(x, decimals))
df2filt['Fuel2 (l)'] = df2filt['Fuel2 (l)'].apply(lambda x: round(x, decimals))
#Removes last number (makes the two excels matchable)
df2filt['Meter_Indication'] //= 10
df1filt2['Meter_Indication'] //= 10
#Creates merged dataframe with the two
merged_df = df1filt2.merge(df2filt, on='Meter_Indication')
希望这是足够的信息!先感谢您。
解决方案
尝试这个:
# Assign new column to keep meter indication from df2
df = pd.merge_asof(df1, df2.assign(meter_indication_2=df2['Meter_Indication (km)']), on='Meter_Indication (km)', direction='nearest')
# Calculate absolute difference
df['meter_indication_diff'] = df['Meter_Indication (km)'].sub(df['meter_indication_2']).abs()
# Sort values, drop duplicates (keep the ones with the smallest diff) and do some clean up
df = df.sort_values(by=['meter_indication_2', 'meter_indication_diff']).drop_duplicates(subset=['meter_indication_2']).sort_index().drop(['meter_indication_2', 'meter_indication_diff'], axis=1)
# Output
Meter_Indication (km) Fuel1 (l) Fuel2 (l)
0 35493 245 300
1 35975 267 203
4 36567 300 323
5 38653 234 233
推荐阅读
- python-3.x - 在熊猫中将多类别列转换为两个类别
- python - Discord Bot-AttributeError:“str”对象没有属性“loop”
- elixir - 在 Elixir 中构建枚举有哪些好的做法?
- sql - 在 impala 中找到连续值
- google-cloud-platform - 我可以使用什么地方 appcfg.py 下载/更新谷歌云?
- git-bash - Git Bash 中的 .cmd 有什么用?
- laravel-5 - 如何解决 Freescout 中的附件/存储问题
- bash - 检查睡眠后删除的文件大小
- rundeck - 在使用 rundeck-cli 之前是否需要安装 rundeck 软件
- php - 为什么我的 Docker 应用程序不会显示在我的本地服务器上?