首页 > 解决方案 > 在熊猫合并中使用“外部”时如何避免丢失数据?

问题描述

我正在使用看起来像下面这样的 pandas 中的两个数据框(不过,我正在使用的数据框有更多列):

>>> df_download
    date        |  app_name           | downloads | app_id
__________________________________________________________
0   2021-01-01  |  music app          | 500       | 100
1   2021-01-01  |  food delivery app  | 900       | 110
2   2021-01-01  |  fitness app        | 1200      | 120
3   2021-01-02  |  music app          | 700       | 100
4   2021-01-02  |  food delivery app  | 750       | 110
5   2021-01-02  |  fitness app        | 3000      | 120
6   2021-01-03  |  music app          | 800       | 100
7   2021-01-03  |  food delivery app  | 100       | 110
8   2021-01-03  |  fitness app        | 400       | 120
>>> df_active_users
    date        |  app_name           | active_users  | app_id
_______________________________________________________________
0   2021-01-01  |  music app          | 0             | 100
1   2021-01-01  |  food delivery app  | 30000         | 110
2   2021-01-01  |  fitness app        | 90000         | 120
3   2021-01-02  |  music app          | 15000         | 100
4   2021-01-02  |  food delivery app  | 0             | 110
5   2021-01-02  |  fitness app        | 80000         | 120
6   2021-01-03  |  music app          | 20000         | 100
7   2021-01-03  |  food delivery app  | 50000         | 110 
8   2021-01-03  |  fitness app        | 40000         | 120

我的目标是合并它们。我设法做到了,但不知何故,我丢失了一个应用程序。送餐应用突然消失。我使用了这段代码:

df_merged = pd.merge(df_active_users[['date', 'active_users', 'app_id', 'app_name']],
 df_downloads[['date', 'downloads', 'app_id']], on=['date', 'app_id', 'app_name'], how='outer')

结果:

df_merged['app_name'].unique()

 array(['music app','fitness app'], dtype=object)

有谁知道如何在不丢失任何应用程序数据的情况下从两个数据框中获取所有数据?非常感谢你的帮助!

标签: pythonpandasdataframemerge

解决方案


你失去了任何东西,你的merge功能(几乎)有效:

out = pd.merge(df_active_users, df_downloads,
               on=['date', 'app_id', 'app_name'],
               how='outer')
>>> out
        date           app_name  active_users  app_id  downloads
0 2021-01-01          music app             0     100        500
1 2021-01-01  food delivery app         30000     110        900
2 2021-01-01        fitness app         90000     120       1200
3 2021-01-02          music app         15000     100        700
4 2021-01-02  food delivery app             0     110        750
5 2021-01-02        fitness app         80000     120       3000
6 2021-01-03          music app         20000     100        800
7 2021-01-03  food delivery app         50000     110        100
8 2021-01-03        fitness app         40000     120        400

>>> out['app_name'].unique()
array(['music app', 'food delivery app', 'fitness app'], dtype=object)

>>> out['app_name'].value_counts()
music app            3
food delivery app    3
fitness app          3
Name: app_name, dtype: int64

推荐阅读