首页 > 解决方案 > 使用 spark databricks 平台从 URL 读取数据

问题描述

尝试在 databricks 社区版平台上使用 spark 从 url 读取数据我尝试使用 spark.read.csv 并使用 SparkFiles 但仍然缺少一些简单的点

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
# sc.addFile(url)
# sqlContext = SQLContext(sc)
# df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) 

df = spark.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True)

得到路径相关的错误:

Path does not exist: dbfs:/local_disk0/spark-9f23ed57-133e-41d5-91b2-12555d641961/userFiles-d252b3ba-499c-42c9-be48-96358357fb75/adult.csv;'

我也尝试了其他方法

val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString

 # val list = content.split("\n").filter(_ != "")
   val rdd = sc.parallelize(content)
   val df = rdd.toDF

SyntaxError: invalid syntax
  File "<command-332010883169993>", line 16
    val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv").mkString
              ^
SyntaxError: invalid syntax

数据应该直接加载到databricks文件夹,或者我应该能够使用spark.read直接从url加载,任何建议

标签: scalaapache-sparkpysparkapache-spark-sqldatabricks

解决方案


试试这个。

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**

只需获取您的 csv 网址的几列。

df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age|       workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39|       State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38|         Private|215646|  HS-grad|
| 53|         Private|234721|     11th|
| 28|         Private|338409|Bachelors|
| 37|         Private|284582|  Masters|
| 49|         Private|160187|      9th|
| 52|Self-emp-not-inc|209642|  HS-grad|
| 31|         Private| 45781|  Masters|
| 42|         Private|159449|Bachelors|
+---+----------------+------+---------+

SparkFiles 获取驱动程序或工作人员本地文件的绝对路径。这就是它无法找到它的原因。


推荐阅读