首页 > 解决方案 > 比较 2 个大小不均匀的 Pandas DataFrames 的匹配值并组合在一起,然后用最接近的数据替换 NaN 值

问题描述

我是 python 和 Stackoverflow 的新手。我在处理数据时遇到问题。我有两组不同大小的数据。df1 的大小为 1000,df2 的大小为 100000。这里是 df1 和 df2 的样本。

df1=
        Date                 x     y   
    0   2020-01-01 01:01    1.1   2.4 
    1   2020-01-01 01:05    4.2   5.5  
    2   2020-01-01 01:08    7.3   8.6  

 

df2=
        Date                 x     y
    0   2020-01-01 01:00    NaN   NaN
    1   2020-01-01 01:01    NaN   NaN
    2   2020-01-01 01:02    NaN   NaN
    3   2020-01-01 01:03    NaN   NaN
    4   2020-01-01 01:04    NaN   NaN
    5   2020-01-01 01:05    NaN   NaN
    6   2020-01-01 01:06    NaN   NaN
    7   2020-01-01 01:07    NaN   NaN 
    8   2020-01-01 01:08    NaN   NaN
    9   2020-01-01 01:09    NaN   NaN
   10   2020-01-01 01:10    NaN   NaN
 

我想做的是将它们组合在一起作为一个新的数据框,如果df1['Date']=df2['Date'],df3 将显示如下。

df3= 
        Date                 x     y
    0   2020-01-01 01:00    NaN   NaN     
    1   2020-01-01 01:01    1.1   2.4 
    2   2020-01-01 01:02    NaN   NaN
    3   2020-01-01 01:03    NaN   NaN
    4   2020-01-01 01:04    NaN   NaN
    5   2020-01-01 01:05    4.2   5.5  
    6   2020-01-01 01:06    NaN   NaN
    7   2020-01-01 01:07    NaN   NaN 
    8   2020-01-01 01:08    7.3   8.6  
    9   2020-01-01 01:09    NaN   NaN
   10   2020-01-01 01:10    NaN   NaN 

然后,NaN 值将等于上面最接近的值

df3=
        Date                 x     y
    0   2020-01-01 01:00    NaN   NaN     
    1   2020-01-01 01:01    1.1   2.4 
    2   2020-01-01 01:02    1.1   2.4 
    3   2020-01-01 01:03    1.1   2.4 
    4   2020-01-01 01:04    1.1   2.4 
    5   2020-01-01 01:05    4.2   5.5  
    6   2020-01-01 01:06    4.2   5.5  
    7   2020-01-01 01:07    4.2   5.5  
    8   2020-01-01 01:08    7.3   8.6  
    9   2020-01-01 01:09    7.3   8.6  
   10   2020-01-01 01:10    7.3   8.6 

多谢!

标签: pythonpandasdataframe

解决方案


One way, would be to use update on your complete df (assuming it includes all indices). Then use fillna to get the previous values for all your missings:

    a = pd.DataFrame(
        {
            "date": pd.date_range(start="2020-01-01", periods=3),
            "x": [1, np.nan, 3],
            "y": [5, np.nan, 6],
        }
    ).set_index("date")
    
    b = pd.DataFrame(
        {
            "date": pd.date_range(start="2020-01-01", periods=5),
            "x": [np.nan] * 5,
            "y": [np.nan] * 5,
        }
    ).set_index("date")
    print(a, b)

| date                |   x |   y |
|:--------------------|----:|----:|
| 2020-01-01 00:00:00 |   1 |   5 |
| 2020-01-02 00:00:00 | nan | nan |
| 2020-01-03 00:00:00 |   3 |   6 |


| date                |   x |   y |
|:--------------------|----:|----:|
| 2020-01-01 00:00:00 | nan | nan |
| 2020-01-02 00:00:00 | nan | nan |
| 2020-01-03 00:00:00 | nan | nan |
| 2020-01-04 00:00:00 | nan | nan |
| 2020-01-05 00:00:00 | nan | nan |
    b.update(a)
    b = b.fillna(method="ffill")
    print(b)
| date                |   x |   y |
|:--------------------|----:|----:|
| 2020-01-01 00:00:00 |   1 |   5 |
| 2020-01-02 00:00:00 |   1 |   5 |
| 2020-01-03 00:00:00 |   3 |   6 |
| 2020-01-04 00:00:00 |   3 |   6 |
| 2020-01-05 00:00:00 |   3 |   6 |

推荐阅读