首页 > 解决方案 > PySpark - 读取镶木地板文件但不是同一文件夹中的另一个?

问题描述

我真的什么都不懂了……PySpark 不会读取同一文件夹中的所有文件。

ls

返回:

 Verzeichnis von C:\Users\####\Data_Projects\NPL

21.04.2020  15:41    <DIR>          .
21.04.2020  15:41    <DIR>          ..
21.04.2020  13:18    <DIR>          .ipynb_checkpoints
21.04.2020  14:50    <DIR>          IMBD_Reviews
21.04.2020  15:40    <DIR>          imdb_reviews_preprocessed
21.04.2020  14:48        13.717.398 imdb_reviews_preprocessed.parquet.zip
21.04.2020  15:38            21.738 NPL with pyspark.ipynb
23.10.2016  19:47    <DIR>          sentiments.parquet
21.04.2020  14:51            38.387 sentiments.parquet.zip
21.04.2020  14:52    <DIR>          tweets.parquet
21.04.2020  14:51           136.483 tweets.parquet.zip
               4 Datei(en),     13.914.006 Bytes
               7 Verzeichnis(se),  1.552.965.632 Bytes frei
tweets_df = sqlContext.read.parquet('tweets.parquet')

工作得很好,并且

rewievs = sqlContext.read.parquet("imdb_reviews_preprocessed.parquet")

返回错误

 An error occurred while calling o541.parquet.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:/Users/####/Data_Projects/NPL/imdb_reviews_preprocessed/imdb_reviews_preprocessed.parquet;
...

任何想法?

标签: pythonarraysapache-spark-sqlparquet

解决方案


  // get parquet files in folder
  val f1 = spark.sparkContext.wholeTextFiles("/tmp/*.parquet")
    .toDF("fileName", "dataInFile")
    .select('fileName)


  // DataFrame with files parquet in folder
  val f10 = spark.read.parquet("/tmp/*.parquet")

推荐阅读