首页 > 解决方案 > 检查一个数据帧的项目是否在一个范围内,在另一个数据帧中定义并且具有相同的索引

问题描述

我有两个从文件创建的数据框: 在此处输入图像描述 在此处输入图像描述

我想检查哪个范围由列组成DOY installedDOY removed列是列的值Bias start,但它需要在Station ID由两个dfs的匹配索引组成的组中完成。之后,我想创建第三个数据框,它将由第二个 df 中的所有列组成,并Receiver type根据范围条件进行选择。这是所需的输出:

在此处输入图像描述

和代码:

'input: df1, df2'
df1 = pd.DataFrame([['ABMF', 'ASTECH', 'GPS', '2008-07-15', '2009-10-15', 2008.20, 2009.29],
                    ['ABMF', 'LEICA', 'GPS+GLO', '2009-10-15', '2011-11-15', 2009.29, 2011.32],
                    ['ABMF', 'SEPT', 'GPS+GLO', '2011-11-15', '2015-04-28', 2011.32, 2015.12],
                    ['ABMF', 'TRIMBLE', 'GPS', '2015-04-28', '2019-04-15', 2015.12, 2019.11],
                    ['ZIMM', 'ASTECH', 'GPS', '1993-05-01', '1997-08-06', 1993.12, 1997.22],
                    ['ZIMM', 'SEPT', 'GPS', '1997-08-06', '2003-08-12', 1997.22, 2003.22],
                    ['ZIMM', 'TRIMBLE', 'GPS', '2003-08-12', '2015-04-27', 2003.22, 2015.12]],
                    columns=['Station ID','Receiver type','Satellite system','Date installed', 
                    'Date removed','DOY installed','DOY removed'])
df1.set_index(['Station ID','Receiver type'], inplace=True)

df2 = pd.DataFrame([['ABMF', 'C1P', 'C2P', 2013.09, 2013.09, -1.25, 0.15],
                    ['ABMF', 'C2W', 'C2X', 2013.10, 2013.10, -1.1, 0.1],
                    ['ABMF', 'C2C', 'C2P', 2013.14, 2013.14, -1.115, 0.123],
                    ['ABMF', 'C2W', 'C2X', 2013.22, 2013.22, -1.23, 0.12],
                    ['ABMF', 'C2W', 'C2X', 2013.42, 2013.42, -1.7, 0.124],
                    ['ZIMM', 'C2W', 'C2X', 2013.10, 2013.10, -1.21, 0.11],
                    ['ZIMM', 'C2W', 'C2X', 2013.12, 2013.12, -1.14, 0.11],
                    ['ZIMM', 'C2W', 'C2X', 2013.14, 2013.14, -1.41, 0.31]],
                    columns=['Station ID','OBS1','OBS2','Bias start','Bias end','Value','Std'])
df2.set_index('Station ID', inplace=True)

'desired output: df3'
df3 = pd.DataFrame([['ABMF', 'C1P', 'C2P', 2013.09, 2013.09, -1.25, 0.15, 'SEPT'],
                    ['ABMF', 'C2W', 'C2X', 2013.10, 2013.10, -1.1, 0.1, 'SEPT'],
                    ['ABMF', 'C2C', 'C2P', 2013.14, 2013.14, -1.115, 0.123, 'SEPT'],
                    ['ABMF', 'C2W', 'C2X', 2013.22, 2013.22, -1.23, 0.12, 'SEPT'],
                    ['ABMF', 'C2W', 'C2X', 2013.42, 2013.42, -1.7, 0.124, 'SEPT'],
                    ['ZIMM', 'C2W', 'C2X', 2013.10, 2013.10, -1.21, 0.11, 'TRIMBLE'],
                    ['ZIMM', 'C2W', 'C2X', 2013.12, 2013.12, -1.14, 0.11, 'TRIMBLE'],
                    ['ZIMM', 'C2W', 'C2X', 2013.14, 2013.14, -1.41, 0.31, 'TRIMBLE']],
                    columns=['Station ID','OBS1','OBS2','Bias start','Bias end','Value','Std', 'Receiver type'])
df3.set_index('Station ID', inplace=True)

标签: pythonpandas

解决方案


这是一个merge_asof使用by参数按 StationID 工作的操作。请注意,要执行此操作,需要对使用的列进行排序。其余的都是装饰性的,以适应预期的输出。

df_ = (pd.merge_asof(
              df2.reset_index().sort_values(by='Bias start'),
              df1.reset_index().sort_values(by='DOY installed'), 
              by='Station ID',
              left_on='Bias start', right_on='DOY installed', 
              direction='backward'
              )
          [['Station ID'] + df2.columns.tolist() + ['Reciever type']]
          .sort_values(by=['Station ID', 'Bias start'])
          .set_index('Station ID')
      )
print(df_)
           OBS1 OBS2  Bias start  Bias end  Value    Std Reciever type
Station ID                                                            
ABMF        C1P  C2P     2013.09   2013.09 -1.250  0.150          SEPT
ABMF        C2W  C2X     2013.10   2013.10 -1.100  0.100          SEPT
ABMF        C2C  C2P     2013.14   2013.14 -1.115  0.123          SEPT
ABMF        C2W  C2X     2013.22   2013.22 -1.230  0.120          SEPT
ABMF        C2W  C2X     2013.42   2013.42 -1.700  0.124          SEPT
ZIMM        C2W  C2X     2013.10   2013.10 -1.210  0.110       TRIMBLE
ZIMM        C2W  C2X     2013.12   2013.12 -1.140  0.110       TRIMBLE
ZIMM        C2W  C2X     2013.14   2013.14 -1.410  0.310       TRIMBLE

推荐阅读