首页 > 解决方案 > 如何使用 spark 使 Impala 的元数据无效?

问题描述

我首先使用 PySpark 将数据插入到一个空表中,但随后我将不得不自动化该过程。使用 PySpark,我如何使元数据无效或刷新数据以便在 Impala 中正确读取?

这是我的代码示例:

spark.sql("""
select
 gps_data_adj.trip_duration
 , gps_data_adj.geometry
 , trip_summary.TRIP_HAVERSINE_DISTANCE
 , trip_summary.TRIP_GPS_DURATION
 , gps_data_adj.HAVERSINE_DISTANCE
 , gps_data_adj.GPS_INTERVAL
 , gps_data_adj.HAVERSINE_DISTANCE/trip_summary.TRIP_HAVERSINE_DISTANCE AS HAVERSINE_DISTANCE_FRACTION
 , gps_data_adj.GPS_INTERVAL/trip_summary.TRIP_GPS_DURATION AS GPS_INTERVAL_FRACTION
 , (gps_data_adj.HAVERSINE_DISTANCE/trip_summary.TRIP_HAVERSINE_DISTANCE)*gps_data_adj.trip_distance_travelled AS HAVERSINE_DISTANCE_ADJ
 , (gps_data_adj.GPS_INTERVAL/trip_summary.TRIP_GPS_DURATION)*gps_data_adj.trip_duration AS GPS_INTERVAL_ADJ
    FROM
        gps_data_adj
    INNER JOIN
        (
            SELECT
                trip_id 
                , sum(COSINES_DISTANCE) as TRIP_COSINES_DISTANCE
                , sum(HAVERSINE_DISTANCE) as TRIP_HAVERSINE_DISTANCE
                , sum(GPS_INTERVAL) AS TRIP_GPS_DURATION
            FROM
                gps_data_adj
            GROUP BY
                trip_id
        ) trip_summary
on gps_data_adj.trip_id = trip_summary.trip_id
""").write.format('parquet').mode('append').insertInto('driving_data_TEST')

标签: apache-sparkpysparkmetadataimpala

解决方案


推荐阅读