首页 > 解决方案 > 如何在熊猫中将数据合并和分组为条件块

问题描述

我有两个数据框,我试图以某种方式合并。它们如下:

df_A

    time expression_w expression_h
     0      neutral      neutral
     1      neutral      neutral
     2      neutral      neutral
     3      neutral      neutral
     4      neutral      neutral
     5      neutral      neutral
     6      neutral      neutral
     7      neutral      neutral
     8      neutral      neutral
     9      neutral      neutral
    10      neutral      neutral
    11      neutral      neutral
    12      neutral      neutral
    13      neutral      neutral
    14      neutral      neutral
    15      neutral      neutral
    16      neutral      neutral
    17      neutral      neutral
    18      neutral      neutral
    19      neutral      neutral

和df_B:

   start   stop
  12.12  12.47
  13.44  20.82

我只想在和之间合并df_A.expression_wdf_A.expression_h基于df_A.timedf_B 。聚合统计信息应为 MODE。df_A.timedf_B.startdf_B.stop

它应该如下所示:

   start   stop expression_w  expression_h
  12.12  12.47      neutral     neutral
  13.44  20.82      neutral     neutral

如果我将数据框扩展到

df_A

    time expression_w expression_h
     0      neutral      neutral
     1      neutral      neutral
     2      neutral      neutral
     3      neutral      neutral
     4      neutral      neutral
     5      neutral      neutral
     6      neutral      neutral
     7      neutral      neutral
     8      neutral      neutral
     9      neutral      neutral
    10      neutral      neutral
    11      neutral      neutral
    12      neutral      neutral
    13      neutral      neutral
    14      neutral      neutral
    15      neutral      neutral
    16      neutral      neutral
    17      neutral      neutral
    18      neutral      neutral
    19      neutral      neutral
    20      neutral      neutral
    21      neutral      neutral
    22      neutral      neutral
    23      neutral      neutral
    24      neutral      neutral
    25      neutral      neutral
    26      neutral      neutral
    27      neutral      neutral
    28      neutral      neutral
    29      neutral      neutral
    30      neutral      neutral
    31      neutral      neutral
    32      neutral      neutral
    33      neutral      neutral
    34      neutral      neutral
    35      neutral      neutral
    36      neutral      neutral
    37      neutral      neutral
    38      neutral      neutral
    39      neutral      neutral
    30      neutral      neutral
    41      neutral      neutral
    42      neutral      neutral
    43      neutral      neutral
    44      neutral      neutral
    45      neutral      neutral
    46      neutral      neutral
    47      neutral      neutral
    48      neutral      neutral
    49      neutral      neutral
    50      neutral      neutral

    start   stop
0  12.12  12.47
1  13.44  20.82
2  20.88  29.63
3  31.61  33.33
4  33.44  42.21

下面推荐的代码中断并给出此错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-117-47594fba3999> in <module>
      1 # df_B= d_s[['start', 'stop']].head(3)
      2 # df_A = df.head(500)
----> 3 y = (df_B.assign(dummy=1, idx=np.arange(df_B.shape[1]))
      4     .merge(df_A.assign(dummy=1), on='dummy')
      5     .query('start<=time<=stop')

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/frame.py in assign(self, **kwargs)
   3687 
   3688         for k, v in kwargs.items():
-> 3689             data[k] = com.apply_if_callable(v, data)
   3690         return data
   3691 

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3035         else:
   3036             # set column
-> 3037             self._set_item(key, value)
   3038 
   3039     def _setitem_slice(self, key: slice, value):

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3111         """
   3112         self._ensure_valid_index(value)
-> 3113         value = self._sanitize_column(key, value)
   3114         NDFrame._set_item(self, key, value)
   3115 

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3756 
   3757             # turn me into an ndarray
-> 3758             value = sanitize_index(value, self.index)
   3759             if not isinstance(value, (np.ndarray, Index)):
   3760                 if isinstance(value, list) and len(value) > 0:

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
    746     if len(data) != len(index):
    747         raise ValueError(
--> 748             "Length of values "
    749             f"({len(data)}) "
    750             "does not match length of index "

ValueError: Length of values (2) does not match length of index (5)

有想法该怎么解决这个吗?

标签: pythonpandas

解决方案


让我们尝试交叉合并和查询

(df_B.assign(dummy=1, idx=np.arange(len(df_B))
    .merge(df_A.assign(dummy=1), on='dummy')
    .query('start<=time<=stop')
    .drop('dummy',axis=1)
    .groupby('idx', as_index=False)
    .agg(pd.Series.mode)
)

输出(请注意,第一行之间没有time中间startstop第一行):

   idx  start   stop expression_w expression_h
0    1  13.44  20.82      neutral      neutral

推荐阅读