首页 > 解决方案 > 如何优化 pysapark 代码以按用户计算距离?

问题描述

我想计算ID每个中的平均距离zone。我正在醒来,pyspark我正在使用geospark.

我的桌子看起来像:

+--------------------+--------+----------+--------------------+--------------------+
|                  ID|    zone|      date|               point|              point1|
+--------------------+--------+----------+--------------------+--------------------+
|04607f5b-746e-455...|00295753|2020-03-18|POINT (-80.161590...|POINT (-80.161590...|
|05df916c-6269-485...|01383864|2020-03-17|POINT (-95.581115...|POINT (-95.581115...|
|1973aa17-863f-4de...|01383847|2020-03-17|POINT (-96.864837...|POINT (-96.864837...|
|1bba1026-dcb3-42f...|00465266|2020-03-17|POINT (-95.823860...|POINT (-95.823860...|
|2a16bc8c-a529-42e...|01266994|2020-03-18|POINT (-101.24329...|POINT (-101.24329...|
|352b142f-616e-46b...|01605066|2020-03-17|POINT (-105.73150...|POINT (-105.73150...|
|66952620-0cc2-4ba...|01383943|2020-03-17|POINT (-96.226104...|POINT (-96.226104...|
|7e901a60-9f16-4a9...|01383886|2020-03-19|POINT (-95.496803...|POINT (-95.496803...|
|80fdf1e3-92ca-4b1...|01383813|2020-03-16|POINT (-97.661605...|POINT (-97.661605...|
|81f3eb49-ef3f-48f...|00066975|2020-03-18|POINT (-93.562011...|POINT (-93.562011...|
+--------------------+--------+----------+--------------------+--------------------+

我想计算每个区域中用户的距离总和以及每天每个区域的不同用户总数。我正在使用geospark,我可以运行这样的简单查询

queryDistances = """
        SELECT ID, date,
        ST_Distance(point, point1) as distance
        FROM myTable
    """

我想测量 和 之间的距离,point并计算每个区域point1的平均距离和每天不同的总数。IDdateIDzone

我想要一张像

    zone        date        avg(distance)   tot(users)
  00295753    2020-03-18       5.5              74
  01383864    2020-03-17       7.3              117

标签: pythonsqlpysparkpyspark-sql

解决方案


你需要玩“group by”一段时间。像这样写查询

select ID, date, AVG(ST_DISTANCE(point,point1)) as avg, count(*) as total
from myTables 
group by ID,Zone,Date

推荐阅读