首页 > 解决方案 > Pandas - 比较 2 个数据框并找到最接近的值

问题描述

我在 Pandas 中有 2 个数据框,其中包含longitudelatitude。我试图遍历第一行中的每一行,longitudelatitude在第二个数据框中找到最接近的匹配。

python到目前为止,我有这个,我在另一个 SO 帖子中找到了......

from math import cos, asin, sqrt

def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(a))

def closest(data, v):
    return min(data, key=lambda p: distance(v['lat'],v['lon'],p['lat'],p['lon']))

tempDataList = [{'lat': 39.7612992, 'lon': -86.1519681}, 
                {'lat': 39.762241,  'lon': -86.158436 }, 
                {'lat': 39.7622292, 'lon': -86.1578917}]

v = {'lat': 39.7622290, 'lon': -86.1519750}
print(closest(tempDataList, v))

我将尝试修改它以与我的熊猫数据框一起使用,但是有没有更有效的方法来做到这一点PyProj

有人有示例或类似代码吗?

标签: pythonpandasgispyproj

解决方案


如果您使用 GIS 库,我认为您将能够更轻松地做到这一点。所以,如果你使用 geopandas 和 shapely,它会更舒服。(也使用 pyproj。)从下面的代码开始。

import pandas as pd
import geopandas as gpd
from shapely.ops import Point, nearest_points

tempDataList = [{'lat': 39.7612992, 'lon': -86.1519681}, 
                {'lat': 39.762241,  'lon': -86.158436 }, 
                {'lat': 39.7622292, 'lon': -86.1578917}]

df = pd.DataFrame(tempDataList)

#make point geometry for geopandas
geometry = [Point(xy) for xy in zip(df['lon'], df['lat'])]

#use a coordinate system that matches your coordinates. EPSG 4326 is WGS84
gdf = gpd.GeoDataFrame(df, crs = "EPSG:4326", geometry = geometry) 

#change point geometry
v = {'lat': 39.7622290, 'lon': -86.1519750}
tp = Point(v['lon'], v['lat'])

#now you can calculate the distance between v and others.
gdf.distance(tp)

#If you want to get nearest points
multipoints = gdf['geometry'].unary_union
queried_point, nearest_point = nearest_points(tp, multipoints)
print(nearest_point)

推荐阅读