scala - Run a function parallely in Scala
问题描述
I have a Spark SQL function which generates temp file in HDFS directory. I want to print all the directory and files as the function is running.
So Here is the function:
spark.sql(s"INSERT INTO ${table} VALUES ....")
And while the function/query is running, I want to see the files generated under the HDFS directory. Since the files are temporary, I want to list out the directory several times as the query is running.
FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///mypath")).foreach( x => println(x.getPath ))
I am new to Scala programming and can't really find a way to run this parallely.
解决方案
Sure. You would wrap that spark.sql(query)
in a scala.concurrent.Future[Unit]
.
import scala.concurrent.ExecutionContext.Implicits.global
val insert = scala.concurrent.Future {
spark.sql(s"INSERT INTO ${table} VALUES ....")
} // begins to work immediately
Then while it executes, you can see the files it creates.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val fs = FileSystem.get(sc.hadoopConfiguration)
val path = new Path("hdfs:///mypath")
while(!insert.isCompleted){
Thread.sleep(1000) // Sleep to limit how often your message prints
fs.listStatus(path).foreach(x => println(x.getPath))
}
Keep in mind you'll be looking at the whole list of files each time.
推荐阅读
- grails-plugin - 通过spring security rest插件登录api重定向到主页
- git - 为什么 git 将文件重置为以前的版本?
- python - 如何多线程 3 个返回相同值并选择最快的不同函数
- .net - 如何在运行的容器中获取 AWS Fargate 任务实例元数据?
- javascript - 如何正确编写 vuepress 插件?
- javascript - 使用 Firebase 作为后端的移动应用的处理日期
- graph-theory - Strongly Connected Components : Kosaraju algorithm
- php - php mysqli 中的数据库
- pandas - Pandas 基于多个条件的新变量
- oracle - 通过 Oracle 数据库中的视图更改表的元组