首页 > 解决方案 > 如何删除除数据框中最后一行之外的特定单词之前的所有行以计算“增量”

问题描述

我有一个包含 [3599 行 x 5 列] 的数据框。我需要删除特定单词之前的所有行,除了数据框中的最后一行来计算“增量”。这是我拥有的数据框的示例: 

df = 
             daytime       ProductCase       timestamp           delta 
 
0     2020-01-01 00:00:01    Product  2020-01-01 00:00:01             NaT   

1     2020-01-01 00:00:02    Product  2020-01-01 00:00:02 0 days 00:00:01   

2     2020-01-01 00:00:03    Product  2020-01-01 00:00:03 0 days 00:00:01   

3     2020-01-01 00:00:04    NoProduct  2020-01-01 00:00:04 0 days 00:00:01   

4     2020-01-01 00:00:05    NoProduct  2020-01-01 00:00:05 0 days 00:00:01  

5     2020-01-01 00:00:06    NoProduct  2020-01-01 00:00:06 0 days 00:00:01   

6     2020-01-01 00:00:07    NoProduct 2020-01-01 00:00:07 0 days 00:00:01

7     2020-01-01 00:00:08    Product  2020-01-01 00:00:08 0 days 00:00:01   

8     2020-01-01 00:00:09    Product  2020-01-01 00:00:09 0 days 00:00:01   

9     2020-01-01 00:00:10    NoProduct  2020-01-01 00:00:10 0 days 00:00:01  

10     2020-01-01 00:00:11    NoProduct  2020-01-01 00:00:11 0 days 00:00:01   

11     2020-01-01 00:00:12    NoProduct 2020-01-01 00:00:12 0 days 00:00:01 

12     2020-01-01 00:00:13    NoProduct 2020-01-01 00:00:13 0 days 00:00:01 

13     2020-01-01 00:00:14    NoProduct 2020-01-01 00:00:14 0 days 00:00:01 

      time  

0      NaN  

1      1.0  

2      1.0  

3      1.0  

4      1.0  

5      1.0  

6      1.0  

7      1.0

8      1.0  

9      1.0  

10      1.0    

11      1.0  

12     1.0 

13     1.0 

我想知道如何删除包含“NoProduct”的所有行,除了最后一行,计算增量以获得下面的数据框。

             daytime       ProductCase       timestamp           delta  
0     2020-01-01 00:00:01    Product  2020-01-01 00:00:01             NaT   

1     2020-01-01 00:00:02    Product  2020-01-01 00:00:02 0 days 00:00:01   

2     2020-01-01 00:00:03    Product  2020-01-01 00:00:03 0 days 00:00:01   

3     2020-01-01 00:00:07    NoProduct 2020-01-01 00:00:07 0 days 00:00:04

4     2020-01-01 00:00:08    Product  2020-01-01 00:00:08 0 days 00:00:01   

5     2020-01-01 00:00:09    Product  2020-01-01 00:00:09 0 days 00:00:01   

6     2020-01-01 00:00:14    NoProduct 2020-01-01 00:00:14 0 days 00:00:05 


      time  

0      NaN  

1      1.0  

2      1.0  

3      4.0  

4      1.0  

5      1.0  

6      5.0  

标签: python-3.xpandasdataframedictionary

解决方案


要删除除最后一行后跟“Product”行之外的所有“NoProduct”行,请使用二进制掩码:

# Ensure dtypes are corrects
df['daytime'] = pd.to_datetime(df['daytime'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['delta'] = pd.to_timedelta(df['delta'])

# Create the boolean mask
m1 = df['ProductCase'].eq('Product')
m2 = df['ProductCase'].eq('NoProduct') \
         .eq(df['ProductCase'].shift(-1, fill_value='Product').eq('Product'))
mask = m1 | m2

# Do a copy of selected rows
out = df[mask].copy()

# Updates column values
out['delta'] = out['timestamp'].diff()
out['time'] = out['delta'].dt.total_seconds()

输出:

               daytime ProductCase           timestamp           delta  time
0  2020-01-01 00:00:01     Product 2020-01-01 00:00:01             NaT   NaN
1  2020-01-01 00:00:02     Product 2020-01-01 00:00:02 0 days 00:00:01   1.0
2  2020-01-01 00:00:03     Product 2020-01-01 00:00:03 0 days 00:00:01   1.0
6  2020-01-01 00:00:07   NoProduct 2020-01-01 00:00:07 0 days 00:00:04   4.0
7  2020-01-01 00:00:08     Product 2020-01-01 00:00:08 0 days 00:00:01   1.0
8  2020-01-01 00:00:09     Product 2020-01-01 00:00:09 0 days 00:00:01   1.0
13 2020-01-01 00:00:14   NoProduct 2020-01-01 00:00:14 0 days 00:00:05   5.0

细节:

>>> pd.concat([df['ProductCase'], mask.rename('m1|m2'), 
               m1.rename('m1'), m2.rename('m2')], axis=1)

   ProductCase  m1|m2     m1     m2
0      Product   True   True  False  # m1, keep ('Product')
1      Product   True   True  False  # m1, keep ('Product')
2      Product   True   True   True  # m1, keep ('Product')
3    NoProduct  False  False  False
4    NoProduct  False  False  False
5    NoProduct  False  False  False
6    NoProduct   True  False   True  # m2, keep (followed by 'Product')
7      Product   True   True  False  # m1, keep ('Product')
8      Product   True   True   True  # m1, keep ('Product')
9    NoProduct  False  False  False
10   NoProduct  False  False  False
11   NoProduct  False  False  False
12   NoProduct  False  False  False
13   NoProduct   True  False   True  # m2, Special case of 'shift'

对于特殊情况,fill_value='Product'在移动时使用创建一个虚拟行 (idx 14) 以确保它始终被保留。


推荐阅读