首页 > 解决方案 > 带有 for 循环的 Spark Dataframe:优化技术

问题描述

我试图实现打击逻辑。

    1. Taking some records from one table.
    2. based on resultant data I'm using one loop.
    3.then inside loop taking data from other tables in two different dataframe
    4. joining these two dataframes and loading data into 3rd table.

    var id_chck1 = s"select distinct id ,id1, id2  from table  WHERE type =  'N';
    val id_chck = hive.executeQuery(id_chck1)
    for (data <- id_chck) {

   var id = data(0)
    var id1 = data(1)
    var id2 = data(2)

      val values_1 = "select distinct bill, bil_num, id_num,  bill_date,process_date from table l WHERE id2 = '222';
      val values_1_data = hive.executeQuery(values_1)
      for (row <- values_1_data.collect) {
        val bill = row.mkString(",").split(",")(0)
        val bil_num = row.mkString(",").split(",")(1)
        val id_num= row.mkString(",").split(",")(2)
        val bill_date = row.mkString(",").split(",")(3)

        var df1 = s"select column name from tablename where id=222"
        val df1_data = hive.executeQuery(df1)
        var df2 = s"s"select column name from tablename2 where id=222""
        val df2_data = hive.executeQuery(df2)

      val df3="joining df1 and df2"
        df3.write.format("orc").mode("Append").save("hdfslocation")
      }
      var load1 = s"load data inpath 'hdfslocation' into table tablename"
      val load1_data = hive.executeUpdate(load1)

但是这个过程需要 6 小时以上的时间是否有任何其他方法可以做同样的事情,所以它可以在短时间内完成。有没有其他方法可以做同样的事情..比如 rdd 或设置一些 spark hive 属性来提高性能。我在 test1 表中有 5,00,000 条记录。

标签: scalaapache-sparkapache-spark-sql

解决方案


您能否添加输入和预期输出作为示例?很难看出你到底想要达到什么目标


推荐阅读