python - 如何优化 Shapely 和 Sklearn 代码?
问题描述
我正在处理一个 420 万点的数据集,我的代码已经需要一段时间来处理,但是下面的代码需要几个小时来处理(代码是在其他公共问题中提供的,基本上它需要最近的线串到一个点,找到距离该线串最近的点并计算距离)
这些代码实际上做得很棒,但它的目的太长了,我如何在最短的时间内优化或做同样的事情?
import geopandas as gpd
import numpy as np
from shapely.geometry import Point, LineString
from shapely.ops import nearest_points
from sklearn.neighbors import DistanceMetric
EARTH_RADIUS_IN_MILES = 3440.1 #NAUTICAL MILES
panama = gpd.read_file("/Users/Danilo/Documents/Python/panama_coastline/panama_coastline.shp")
for c in range(b):
#p = Point(-77.65325423107359,9.222038196656131)
p=Point(data['longitude'][c],data['latitude'][c])
def closest_line(point, linestrings):
return np.argmin( [p.distance(linestring) for linestring in panama.geometry] )
closest_linestring = panama.geometry[ closest_line(p, panama.geometry) ]
closest_linestring
closest_point = nearest_points(p, closest_linestring)
dist = DistanceMetric.get_metric('haversine')
points_as_floats = [ np.array([p.x, p.y]) for p in closest_point ]
haversine_distances = dist.pairwise(np.radians(points_as_floats), np.radians(points_as_floats) )
haversine_distances *= EARTH_RADIUS_IN_MILES
dtc1=haversine_distances[0][1]
dtc.append(dtc1)
解决方案
编辑:使用 BallTree 简化为单一计算
进口
import pandas as pd
import geopandas as gpd
import numpy as np
from shapely.geometry import Point, LineString
from shapely.ops import nearest_points
阅读巴拿马
panama = gpd.read_file("panama_coastline/panama_coastline.shp")
获取所有点,long,lat 格式:
def get_points_as_numpy(geom):
work_list = []
for g in geom:
work_list.append( np.array(g.coords) )
return np.concatenate(work_list)
all_coastline_points = get_points_as_numpy(panama.geometry)
创建球树
from sklearn.neighbors import BallTree
import numpy as np
panama_radians = np.radians(np.flip(all_coastline_points,axis=1))
tree = BallTree(panama_radians, leaf_size=12, metric='haversine')
创建 1M 随机点:
mean = [8.5,-80]
cov = [[1,0],[0,5]] # diagonal covariance, points lie on x or y-axis
random_gps = np.random.multivariate_normal(mean,cov,(10**6))
random_points = pd.DataFrame( {'lat' : random_gps[:,0], 'long' : random_gps[:,1]})
random_points.head()
计算最近的海岸点(在我的机器上 <30 秒)
distances, index = tree.query( np.radians(random_gps), k=1)
将结果放入 DataFrame
EARTH_RADIUS_IN_MILES = 3440.1
random_points['distance_to_coast'] = distances * EARTH_RADIUS_IN_MILES
random_points['closest_lat'] = all_coastline_points[index][:,0,1]
random_points['closest_long'] = all_coastline_points[index][:,0,0]
推荐阅读
- ios - UIView 过渡和 DispatchQueue.main.async 计时问题(在 tableViewController 中)
- speech-recognition - 在 SRGS 语法中,如何指定将短语替换为单词
- html - Flex 页眉、内容、页脚布局
- wcf - 在同一台机器上服务时,UWP 应用程序不连接到 WCF 服务
- java - 重置后,以不同方式声明和初始化的持久数组会发生什么?
- apache-camel - 连接两个数据库的骆驼路线
- regex - 长时间无效输入崩溃 - 角度表单验证
- asynchronous - Clojure - 永久运行的应用程序?
- html - Django 不使用 CSS 读取我的 HTML
- java - 尝试使用命令提示符执行 TestNG 测试时出现配置失败