首页 > 解决方案 > 谓词下推在 spark/hive 视图表中不起作用

问题描述

根据我的阅读,火花预测下推将用于源端(减少数据扫描)。

我有另一个职能团队创建的 hive/spark 表。他们还在这张桌子的顶部创建了视图。源表和视图之间的唯一区别是create_date列数据类型

问题:谓词下推仅适用于基表查询而不是视图查询。

基表查询:

spark.sql("""select company_id, create_date 
from `db`.`table` where product_name = 'sm' and
create_date = cast('2018-07-01' as date) and country ='selva'""").explain(True)

物理计划

== 物理计划 == *(1) 项目 [company_id#1993L, create_date#1997] +- *(1) 过滤器 (((isnotnull(create_date#1997) && isnotnull(country#2003)) && (create_date#1997 = 17713)) && (country#2003 = selva)) +- *(1) FileScan parquet db.table[company_id#1993L,create_date#1997,country#2003,product_name#2080] 批处理:true,格式:Parquet,位置: PrunedInMemoryFileIndex[s3://location..., PartitionFilters: [isnotnull(product_name#2080), (product_name#2080 = sm)], PushedFilters: [IsNotNull(create_date), IsNotNull(country), EqualTo(create_date,2018-07 -01) , Equ..., ReadSchema: struct<company_id:bigint,create_date:date,country:string>

查看查询:

spark.sql(
        """
    select 
    create_date
    from db.view where 
    product_name = 'sm' and country ='United States'
    """
    ).filter("cast(create_date as date) = cast('2018-07-01' as date)").explain(True)

== 物理计划 == *(1) 项目 [cast(create_date#2176 as timestamp) AS create_date#2087] +- *(1) 过滤器 (((isnotnull(country#2182) && isnotnull(create_date#2176)) && (country#2182 = United States)) && (cast(cast(create_date#2176 as timestamp) as date) = 17713)) +- *(1) FileScan parquet db.table[create_date#2176,country#2182,product_name# 2259] 批处理:true,格式:Parquet,位置:PrunedInMemoryFileIndex[s3://location...,PartitionFilters:[isnotnull(product_name#2259),(product_name#2259 = sm)],PushedFilters:[IsNotNull(country), IsNotNull(create_date), EqualTo(country,United States)], ReadSchema: struct<create_date:date,country:string>

或者

spark.sql(
        """
    select 
    create_date
    from db.table where 
    product_name = 'sm' and country ='United States'
    """
    ).filter("create_date = cast('2018-07-01' as timestamp)").explain(True)

物理计划:

== 物理计划 == *(1) 项目 [cast(create_date#2354 as timestamp) AS create_date#2265] +- *(1) 过滤器 (((isnotnull(create_date#2354) && isnotnull(country#2360)) && (country#2360 = United States)) && (cast(create_date#2354 as timestamp) = 1530403200000000))
+- *(1) FileScan parquet db.table[create_date#2354,country#2360,product_name#2437] 批处理:true , 格式: Parquet, 位置: PrunedInMemoryFileIndex[s3:location.., PartitionFilters: [isnotnull(product_name#2437), (product_name#2437 = sm)], PushedFilters: [IsNotNull(create_date), IsNotNull(country), EqualTo(country ,United States)], ReadSchema: struct<create_date:date,country:string>

标签: apache-sparkpysparkhiveparquetamazon-emr

解决方案


推荐阅读