首页 > 解决方案 > Pandas 数据框切片和操作

问题描述

我的数据框df1如下

+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
|      | a        |   1 |
|      | a        |   2 |
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   3 |
|      | c        |   4 |
|      | c        |   4 |
|      | b        |   5 |
|      | b        |   6 |
|      | d        |   7 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   9 |
+------+----------+-----+

下面df2是从那里切下来的。

+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   5 |
|      | b        |   6 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   9 |
|      | b        |   9 |
+------+----------+-----+

目标是找出变化之间的时间差Keydf2比如从最后 3 到 5、从 5 到 6、从 6 到前 8、从最后 8 到前 9 等等),将它们相加,对每个Location项目重复此操作并对它们进行平均。

这个过程可以向量化还是我们需要为每台机器分割数据帧并手动计算平均值?

[编辑]:

Traceback (most recent call last):

  File "<ipython-input-1142-b85a122735aa>", line 1, in <module>
    s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 930, in apply
    return self._python_apply_general(f)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 936, in _python_apply_general
    self.axis)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 2273, in apply
    res = f(group)

  File "<ipython-input-1142-b85a122735aa>", line 1, in <lambda>
    s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 1995, in diff
    result = algorithms.diff(com._values_from_object(self), periods)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py", line 1823, in diff
    out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]

TypeError: unsupported operand type(s) for -: 'str' and 'str'

标签: pythonpandasvectorization

解决方案


你可以尝试做

g=df.groupby(['Location','Key'])
(g.first()-g.last().groupby('Location').shift()).mean(level=0)

推荐阅读