python - python - 如何使用python pandas为超过1M点的点集中的每个点找到最近的8个点
问题描述
我有数百个 gz 文件,每个包含大约 0.5M~1M 矩形框的坐标,每个框都有一个唯一的索引,称为localIdx
,每个框的坐标是llx, lly, urx, ury,
我可以得到每个框的 x/yx=(llx+urx)/2, y=(lly+ury)/2
以便我转换框到点,现在我想为每个返回其 localIdx 的点(框)找到最近的 8 个点(框)。
这是我所做的:
1. read in the gz files with python pandas
2. set the column 'localIdx' for each point as index
3. get the height and width for each box by h=ury-lly, w=urx-llx
4. for each point, filter in points that x is in range current_point_x +/- 20*w, y is in range current_point_y +/- 20*h
5. convert to the filtered_in_points x/y and current_point x/y into two 2D numpy array
6. get the Euclidean Distance by scipy.spatial.distance.cdist
7. merge the result of step6 to the filtered_in pandas to map the localIdx
8. selected 8 nearest localIdx and combine them as a string
9. give the localIdx string for each point
这是我的代码中的核心功能:
def seek_norm_list(line, target_df=None, rmax=None, nmax=None, keycol=None):
if line.padType == 'DUT':
res_id = []
key_value = line[keycol]
current_pad = np.array([[line.xbbox, line.ybbox]])
h, w = line['h'], line['w']
h1, h2 = line.ybbox - h*20, line.ybbox + h*20
w1, w2 = line.xbbox - w*20, line.xbbox + w*20
target_mask = (target_df['xbbox'] > h1) & (target_df['xbbox'] < h2) & (target_df['ybbox'] > w1) & (target_df['ybbox'] < w2)
target_df = target_df[target_mask]
nbh_blks = line.nbh_blk.split(":")
a = np.array(list(zip(target_df.xbbox, target_df.ybbox)))
if len(a) > 0:
d = scipy.spatial.distance.cdist(a, current_pad)
target_df['dist'] = d
key_target = target_df[target_df[keycol] == key_value]
key_target.sort_values(by='dist', inplace=True)
res_target = key_target[key_target.dist < rmax]
keep_id = list(res_target['localIdx'])
if line['localIdx'] in keep_id:
keep_id.remove(line['localIdx'])
if len(keep_id) > int(nmax):
keep_id = keep_id[:int(nmax)]
for bk in nbh_blks:
for id in keep_id:
if bk in id:
res_id.append(id)
line['normList'] = ":".join(res_id)
line['refCount'] = len(res_id)
if len(res_id) > 0:
min, max = keep_id[0], keep_id[-1]
line['minDist'] = res_target.loc[min, 'dist']
line['maxDist'] = res_target.loc[max, 'dist']
else:
line['minDist'] = ''
line['maxDist'] = ''
else:
line['normList'], line['refCount'] = '', ''
line['minDist'], line['maxDist'] = '', ''
return line
else:
line['normList'], line['refCount'] = '', ''
line['minDist'], line['maxDist'] = '', ''
return line
对于每个 gz 文件,这非常非常慢,就我而言,大约有 600 个文件。所有文件的总行> 120M行。我在我的 16 核机器上使用了多处理。
我想让它在 3 小时内得到结果,这可以用 python 吗?
解决方案
K 最近邻。可以使用sklearn。
推荐阅读
- r - 使用 ggplot2 绘制变量的平均值
- typescript - 打字稿:使用字符串的子集作为[key:subset]
- c++ - Google 测试 - 未定义对 testing::internal::AssertHelper::AssertHelper 的引用
- reactjs - 使用 Object.entries 检索值时,如何在 typescript react 应用程序中正确键入从 JSON 对象(api 调用)检索的空值?
- r - 如何在 Windows 10 中安装“不可用”的 R 包(MCMCpack)?
- javascript - 尝试在 React JS 中破坏数组时显示错误
- regex - 如何使用 Powershell 打印包含匹配字符串的整个函数?
- c - 如何在C中将用户输入句子的每个字符分别存储在数组中
- c++ - 用 {" ", " "} 初始化对象数组
- java - 在java中按升序合并2个排序数组