首页 > 解决方案 > 使用 pySpark 将部分文件从 hdfs 读入数据帧

问题描述

我有多个文件存储在 hdfs 位置,如下所示

/user/project/202005/part-01798

/user/project/202005/part-01799

有 2000 个这样的零件文件。每个文件的格式

{'Name':'abc','Age':28,'Marks':[20,25,30]} 
{'Name':...} 

等等 。我有 2 个问题

1) How to check whether these are multiple files or multiple partitions of the same file
2) How to read these in a data frame using pyspark

标签: pysparkapache-spark-sqlhdfspartitioning

解决方案


  1. As these files are in one directory, and these are named as part-xxxxx files, so you can safely assume these are multiple part files of the same dataset. If these are partitions, they should be saved like this /user/project/date=202005/*
  2. You can specify the dir "/user/project/202005" as input for spark like below assuming these are csv files
df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)

推荐阅读