首页 > 解决方案 > Pandas 数据框的滚动状态

问题描述

+------+--------+------------+------------+---+---+---+
| area | locale |    date    |  end date  | i | t | o |
+------+--------+------------+------------+---+---+---+
| abc  | abc25  | 2001-03-01 | 2001-04-01 | 1 |   |   |
| abc  | abc25  | 2001-04-01 | 2001-05-01 | 1 |   |   |
| abc  | abc25  | 2001-05-01 | 2001-06-01 | 1 |   |   |
| abc  | abc25  | 2001-06-01 | 2001-07-01 |   | 1 |   |
| abc  | abc25  | 2001-07-01 | 2001-08-01 |   |   | 1 |
| abc  | abc25  | 2001-08-01 | 2001-09-01 |   | 1 |   |
| abc  | abc25  | 2001-09-01 | 2001-05-01 |   | 1 |   |
| abc  | abc25  | 2001-10-01 | 2001-11-01 |   | 1 |   |
| abc  | abc25  | 2001-11-01 | 2001-12-01 |   |   | 1 |
| abc  | abc25  | 2001-12-01 |            |   |   | 1 |
| def  | def25  | 2001-03-01 | 2001-04-01 |   |   | 1 |
| def  | def25  | 2001-04-01 | 2001-05-01 |   |   | 1 |
| def  | def25  | 2001-05-01 | 2001-06-01 |   |   | 1 |
| def  | def25  | 2001-06-01 | 2001-07-01 |   | 1 |   |
| def  | def25  | 2001-07-01 | 2001-08-01 |   | 1 |   |
| def  | def25  | 2001-08-01 | 2001-09-01 | 1 |   |   |
| def  | def25  | 2001-09-01 | 2001-05-01 | 1 |   |   |
| def  | def25  | 2001-10-01 | 2001-11-01 |   | 1 |   |
| def  | def25  | 2001-11-01 | 2001-12-01 |   |   | 1 |
| def  | def25  | 2001-12-01 |            |   |   | 1 |
+------+--------+------------+------------+---+---+---+

这是我正在使用的数据表示例。我正在尝试做的是在此处添加状态列。状态栏虽然有点棘手,但标准如下:

  1. 如果任何 2 个时间段是相同的 i/t/o,那么它们将获得它们的关联状态(比如说 R/Y/G)
  2. 如果您有两种不同的状态,请选择“最佳”

示例输出:

+------+--------+------------+------------+---+---+---+--------+
| area | locale |    date    |  end date  | i | t | o | Status |
+------+--------+------------+------------+---+---+---+--------+
| abc  | abc25  | 2001-03-01 | 2001-04-01 | 1 |   |   |  NONE  |
| abc  | abc25  | 2001-04-01 | 2001-05-01 | 1 |   |   |  R     |
| abc  | abc25  | 2001-05-01 | 2001-06-01 | 1 |   |   |  R     |
| abc  | abc25  | 2001-06-01 | 2001-07-01 |   | 1 |   |  Y     |
| abc  | abc25  | 2001-07-01 | 2001-08-01 |   |   | 1 |  G     |
| abc  | abc25  | 2001-08-01 | 2001-09-01 |   | 1 |   |  G     |
| abc  | abc25  | 2001-09-01 | 2001-05-01 |   | 1 |   |  Y     |
| abc  | abc25  | 2001-10-01 | 2001-11-01 |   | 1 |   |  Y     |
| abc  | abc25  | 2001-11-01 | 2001-12-01 |   |   | 1 |  G     |
| abc  | abc25  | 2001-12-01 |            |   |   | 1 |  G     |
| def  | def25  | 2001-03-01 | 2001-04-01 |   |   | 1 |  NONE  |
| def  | def25  | 2001-04-01 | 2001-05-01 |   |   | 1 |  G     |
| def  | def25  | 2001-05-01 | 2001-06-01 |   |   | 1 |  G     |
| def  | def25  | 2001-06-01 | 2001-07-01 |   | 1 |   |  G     |
| def  | def25  | 2001-07-01 | 2001-08-01 |   | 1 |   |  Y     |
| def  | def25  | 2001-08-01 | 2001-09-01 | 1 |   |   |  Y     |
| def  | def25  | 2001-09-01 | 2001-05-01 | 1 |   |   |  R     |
| def  | def25  | 2001-10-01 | 2001-11-01 |   | 1 |   |  Y     |
| def  | def25  | 2001-11-01 | 2001-12-01 |   |   | 1 |  G     |
| def  | def25  | 2001-12-01 |            |   |   | 1 |  G     |
+------+--------+------------+------------+---+---+---+--------+

现在我查看了 pandas rolling,但这可能不是最好的方法;我尝试了以下方法:

df.groupby('locale')['o'].rolling(2).sum()

它自己工作,但我似乎无法从中创建一个列,所以我可以说如果 == 2 那么它是任何状态。我还尝试在if声明中使用它:

if df.groupby('locale')['o'].rolling(2).sum() == 2.0 : 
    df['locale_status'] = 'Green'

这给出了一个关于系列真值的错误

我也试过:

if df.groupby('locale')['o'] == df.groupby('locale')['o'].shift() : df['test'] = 'Green' 

这会导致无效的类型比较。

标签: pythonpandas

解决方案


我认为这个问题不适合矢量化/熊猫效率,但我很想被这里的一位忍者证明是错误的。我的解决方案涉及pd.read_clipboard()您可能不需要的一些准备工作。

基本上我用 0 替换了空白,用于idxmax获取“当前”字母,并发现是否有条纹。然后,我遍历行以在groupby.

#data cleaning - from clipboard, prob irrelevant to OP
df=pd.read_clipboard(sep='|', engine='python', header=1)
df=df.reset_index().iloc[1:-1,1:-1]
df=df.rename(columns={ ' i ':'i',' t ':'t',' o ':'o',})
df=df.drop('Unnamed: 0',1)
df=df.replace('   ', 0)

df['current'] = df[['i','t','o']].astype(int).idxmax(1)
df['streak'] = df['current'] == df['current'].shift(1)

weights = {'i':0, 't':1, 'o':2}
results = []
for val in df[' area '].unique():
    temp = df.loc[df.groupby(' area ').groups[val]].reset_index(drop=True)
    winner = []
    for idx, row in temp.iterrows():
        if idx == 0:
            winner.append(np.nan)
        else:
            current = row['current']
            if row['streak']:
                winner.append(current)
            else:
                last = temp.loc[idx-1, 'current']
                if weights[last] > weights[current]:
                    winner.append(last)
                else:
                    winner.append(current)
    temp['winner'] = winner
    results.append(temp)

res = pd.concat(results)
res['winner'] = res['winner'].map({'i':'R','t':'Y','o':'G'})

推荐阅读