首页 > 解决方案 > 如何在pyspark中读取csv文件?

问题描述

我正在尝试使用 pyspark 读取 csv 文件,但它显示了一些错误。你能告诉读取csv文件的正确过程是什么吗?

蟒蛇代码:

from pyspark.sql import *
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)

我也试过以下一个:

sqlContext = SQLContext
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")

错误:

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
NameError: name 'spark' is not defined

and

Traceback (most recent call last):
      File "<pyshell#26>", line 1, in <module>
        df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
    AttributeError: type object 'SQLContext' has no attribute 'load'

标签: pysparkpyspark-sqlpyspark-dataframes

解决方案


首先,您需要创建一个 SparkSession,如下所示

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("yarn").appName("MyApp").getOrCreate()

并且您的 csv 需要在 hdfs 上,然后您可以使用 spark.csv

df = spark.read.csv('/tmp/data.csv', header=True)

/tmp/data.csv 在 hdfs 上


推荐阅读