首页 > 技术文章 > spark几种读文件的方式

han-guang-xue 2018-11-28 19:38 原文

spark.read.textFile和sc.textFile的区别

val rdd1 = spark.read.textFile("hdfs://han02:9000/words.txt")   //读取到的是一个RDD对象

val rdd2 = sc.textFile("hdfs://han02:9000/words.txt")  //读取到的是一个Dataset的数据集

分别进行单词统计的方法:

rdd1.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false)
rdd2.flatMap(x=>x.split(" ")).groupByKey(x=>x).count()

前者返回Array[(String,Int)],后者返回Array[(String,Long)]

 

 

TextFile(url,num)///num为设置分区个数文件超过(128)

1.从当前目录读取一个文件:

val path = "Current.txt"  //Current fold file
val rdd1 = sc.textFile(path,2)

2.从当前目录读取一个文件:

val path = "Current1.txt,Current2.txt,"  //Current fold file
val rdd1 = sc.textFile(path,2)

 

3.从本地读取一个文件:

val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/README.md"  //local file
val rdd1 = sc.textFile(path,2)

4.从本地读取一个文件夹中的内容:

val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/"  //local file
val rdd1 = sc.textFile(path,2)

5.从本地读取一个多个文件:

val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-scala.txt,file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-spire.txt"  //local file
val rdd1 = sc.textFile(path,2)

6.从本地读取多个文件夹中的内容:

val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*"  //local file
val rdd1 = sc.textFile(path,2)

val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*.txt" //local file,指定后缀名文件
val rdd1 = sc.textFile(path,2)

7.采用通配符读取相似的文件中的内容:

for (i <- 1 to 2){
      val rdd1 = sc.textFile(s"/root/application/temp/people$i*",2)
    }

eg:google中的文件读取不了

 

 

 

推荐阅读