python - 通过与段中第一行的差异将数据帧切割成段
问题描述
. col_0 col_1 col_3
0 57342 122877889 25.524446
1 57343 122878889 25.527077
2 57344 122879889 26.582283
3 57345 122880889 27.594110
4 57346 122881889 28.612511
5 57347 122882889 28.517876
6 57348 122883889 29.521818
7 57349 122884889 28.517876
8 57350 122885889 28.473185
9 57351 122886889 28.483698
我有一个像上面这样的数据框。(有更多行)。
我想以这种方式拆分数据框:
Each group's col_3 values have a distance of 2 or less from the group's first row's col_3 value. (So if row 0 has a col_3 of 25.0, all the members of that group have col_3 values in the range 22.0 to 27.0)
For the first row which does not meet that criteria, that row becomes a new group.
因此,上面的数据框将被分组为 [0 到 2] 行和 [3 到 9] 行。
所以输出可以是两个数据帧:
. col_0 col_1 col_3
0 57342 122877889 25.524446
1 57343 122878889 25.527077
2 57344 122879889 26.582283
和
. col_0 col_1 col_3
3 57345 122880889 27.594110
4 57346 122881889 28.612511
5 57347 122882889 28.517876
6 57348 122883889 29.521818
7 57349 122884889 28.517876
8 57350 122885889 28.473185
9 57351 122886889 28.483698
或者,只是值 [0, 3](每个 bin 的开头)。
除了逐行遍历数据框外,我该怎么做?这是cut
可以做到的吗?
解决方案
这是使用 numpy 广播和循环的方法,请在代码中的注释中找到描述
## dummy data
df = pd.DataFrame([['57342', '122877889', 25.524446], ['57343', '122878889', 25.527077], ['57344', '122879889', 26.582283], ['57345', '122880889', 27.59411], ['57346', '122881889', 28.612511], ['57347', '122882889', 28.517876], ['57348', '122883889', 29.521818], ['57349', '122884889', 29.517876], ['57350', '122885889', 32.473185], ['57351', '122886889', 32.483698]], columns=('col_0', 'col_1', 'col_3'))
## use numpy broadcast to find difference between each pair of numbers
## result will be matrix with each cell representing difference of pair
diff = np.abs(df["col_3"].values - df["col_3"].values[:, np.newaxis])
distance_gt2 = (diff>2).astype(int)
print(distance_gt2)
## loop though the matrix and find contigious block where difference is <= 2
j=1
segments=[]
for i in range(len(df)):
s = np.sum(distance_gt2[j:i,j:i])
## when is sum is greater that 0 that is next segment
if s>0:
segments.append(df[j-1:i-1])
j=i
segments.append(df[j-1:len(df)])
[print(segment) for segment in segments]
注意指示距离小于 2 的连续零块
## print(distance_gt2)
[[0 0 0 1 1 1 1 1 1 1]
[0 0 0 1 1 1 1 1 1 1]
[0 0 0 0 1 0 1 1 1 1]
[1 1 0 0 0 0 0 0 1 1]
[1 1 1 0 0 0 0 0 1 1]
[1 1 0 0 0 0 0 0 1 1]
[1 1 1 0 0 0 0 0 1 1]
[1 1 1 0 0 0 0 0 1 1]
[1 1 1 1 1 1 1 1 0 0]
[1 1 1 1 1 1 1 1 0 0]]
结果
col_0 col_1 col_3
0 57342 122877889 25.524446
1 57343 122878889 25.527077
2 57344 122879889 26.582283
col_0 col_1 col_3
3 57345 122880889 27.594110
4 57346 122881889 28.612511
5 57347 122882889 28.517876
6 57348 122883889 29.521818
7 57349 122884889 29.517876
col_0 col_1 col_3
8 57350 122885889 32.473185
9 57351 122886889 32.483698
推荐阅读
- android - 有没有办法检查应用程序是否被重新安装?
- asp.net-mvc - 如何从控制器调用 SQL 函数
- c - 我的 shell 无法阻止 ctrl+c 杀死自己
- python-3.x - Python为每组多级索引选择不同的行数
- powershell - 在要执行的 PowerShell 脚本中提供凭据
- jstl - 在循环内调用 jstl 变量
- r - 奇怪的data.table列选择
- php - 根据最后未更改的列数据从 mysql 获取行
- r - 如何在 R 中用插入符号绘制 PCA
- sql-server - 如果我的 SQL Server 存储过程在添加 ssl 后未执行,则某些部分会在之前执行