首页 > 解决方案 > Python:合并 2 列

问题描述

我正在处理一个大型数据集。以下是一个示例,使用较小的数据集进行计算。

在这个例子中,我得到了 3 条河流在不同时间跨度的污染测量值。每年,河流的污染量都会在下游的测量站进行测量(“污染”)。已经计算出河水在哪一年被上游污染(“year_of_upstream_pollution”)。我的目标是创建一个新列 [“result_of_upstream_pollution”],其中包含与“year_of_upstream_pollution”相关的污染量。为此,必须重新分配“污染”列中的​​数据。

ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
          20,22,20,25,18,21,
          30,19,15,10,26,28]

dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)

    river_id  year  pollution  year_of_upstream_pollution
0          1  2000         10                      2002.0
1          1  2001         14                      2002.0
2          1  2002         20                      2003.0
3          1  2003         11                      2005.0
4          1  2004          8                      2005.0
5          1  2005         11                         NaN
6          2  1990         20                      1991.0
7          2  1991         22                      1992.0
8          2  1992         20                      1993.0
9          2  1993         25                      1994.0
10         2  1994         18                         NaN
11         2  1995         21                         NaN
12         3  2000         30                      2002.0
13         3  2001         19                      2002.0
14         3  2002         15                      2003.0
15         3  2003         10                      2004.0
16         3  2004         26                      2005.0
17         3  2005         28                         NaN

示例:river_id = 1,年份 = 2000,year_of_upstream_pollution = 2002

结果列应如下所示:

    result_of_upstream_pollution  
0                           20.0  
1                           20.0  
2                           11.0  
3                           11.0  
4                           11.0  
5                            NaN  
6                           22.0  
7                           20.0  
8                           25.0  
9                           18.0  
10                           NaN  
11                           NaN  
12                          15.0  
13                          15.0  
14                          10.0  
15                          26.0  
16                          28.0  
17                           NaN  

我自己的做法:

### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])

# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)

len(listr) # = 28

这会导致以下 ValueError:

我还没有找到解决这个错误的方法。请记住,真实的数据集比这个大得多。任何帮助将非常感激!

标签: pythonpandasdataframemerge

解决方案


正如您在标题中所说,这是在两列上合并:

dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
                                                  right_on=['river_id','year_of_upstream_pollution'], 
                                                  how='right')['pollution_x']
print(df)

输出:

    result_of_upstream_pollution  
0                           20.0  
1                           20.0  
2                           11.0  
3                           11.0  
4                           11.0  
5                            NaN  
6                           22.0  
7                           20.0  
8                           25.0  
9                           18.0  
10                           NaN  
11                           NaN  
12                          15.0  
13                          15.0  
14                          10.0  
15                          26.0  
16                          28.0  
17                           NaN  

推荐阅读