首页 > 解决方案 > 通过与段中第一行的差异将数据帧切割成段

问题描述

.      col_0          col_1          col_3
0  57342  122877889  25.524446
1  57343  122878889  25.527077
2  57344  122879889  26.582283
3  57345  122880889  27.594110
4  57346  122881889  28.612511
5  57347  122882889  28.517876
6  57348  122883889  29.521818
7  57349  122884889  28.517876
8  57350  122885889  28.473185
9  57351  122886889  28.483698

我有一个像上面这样的数据框。(有更多行)。

我想以这种方式拆分数据框:

Each group's col_3 values have a distance of 2 or less from the group's first row's col_3 value. (So if row 0 has a col_3 of 25.0, all the members of that group have col_3 values in the range 22.0 to 27.0)

For the first row which does not meet that criteria, that row becomes a new group.

因此,上面的数据框将被分组为 [0 到 2] 行和 [3 到 9] 行。

所以输出可以是两个数据帧:

.      col_0          col_1          col_3
0  57342  122877889  25.524446
1  57343  122878889  25.527077
2  57344  122879889  26.582283

.      col_0          col_1          col_3    
3  57345  122880889  27.594110
4  57346  122881889  28.612511
5  57347  122882889  28.517876
6  57348  122883889  29.521818
7  57349  122884889  28.517876
8  57350  122885889  28.473185
9  57351  122886889  28.483698

或者,只是值 [0, 3](每个 bin 的开头)。

除了逐行遍历数据框外,我该怎么做?这是cut可以做到的吗?

标签: pythonpandas

解决方案


这是使用 numpy 广播和循环的方法,请在代码中的注释中找到描述

## dummy data
df = pd.DataFrame([['57342', '122877889', 25.524446], ['57343', '122878889', 25.527077], ['57344', '122879889', 26.582283], ['57345', '122880889', 27.59411], ['57346', '122881889', 28.612511], ['57347', '122882889', 28.517876], ['57348', '122883889', 29.521818], ['57349', '122884889', 29.517876], ['57350', '122885889', 32.473185], ['57351', '122886889', 32.483698]], columns=('col_0', 'col_1', 'col_3'))

## use numpy broadcast to find difference between each pair of numbers
## result will be matrix with each cell representing difference of pair
diff = np.abs(df["col_3"].values - df["col_3"].values[:, np.newaxis])
distance_gt2 = (diff>2).astype(int)
print(distance_gt2)
## loop though the matrix and find contigious block where difference is <= 2
j=1
segments=[]
for i in range(len(df)):
    s = np.sum(distance_gt2[j:i,j:i])
    ## when is sum is greater that 0 that is next segment
    if s>0:
        segments.append(df[j-1:i-1])
        j=i

segments.append(df[j-1:len(df)])  

[print(segment) for segment in segments]

注意指示距离小于 2 的连续零块

## print(distance_gt2)
[[0 0 0 1 1 1 1 1 1 1]
 [0 0 0 1 1 1 1 1 1 1]
 [0 0 0 0 1 0 1 1 1 1]
 [1 1 0 0 0 0 0 0 1 1]
 [1 1 1 0 0 0 0 0 1 1]
 [1 1 0 0 0 0 0 0 1 1]
 [1 1 1 0 0 0 0 0 1 1]
 [1 1 1 0 0 0 0 0 1 1]
 [1 1 1 1 1 1 1 1 0 0]
 [1 1 1 1 1 1 1 1 0 0]]

结果

   col_0      col_1      col_3
0  57342  122877889  25.524446
1  57343  122878889  25.527077
2  57344  122879889  26.582283

   col_0      col_1      col_3
3  57345  122880889  27.594110
4  57346  122881889  28.612511
5  57347  122882889  28.517876
6  57348  122883889  29.521818
7  57349  122884889  29.517876

   col_0      col_1      col_3
8  57350  122885889  32.473185
9  57351  122886889  32.483698

推荐阅读