首页 > 解决方案 > 火花读取 csv 行为,其中文件在一种情况下包含标题,而没有用于休息的标题

问题描述

航空公司是来自 Databricks 的一组公共数据集之一。

在这里,part-00000 有一个标题,其他没有。

这个:

val paths = Seq(
   "/databricks-datasets/airlines/part-00000"  
   ,"/databricks-datasets/airlines/part-00001"
   ,"/databricks-datasets/airlines/part-00011"
   ,"/databricks-datasets/airlines/part-00071"
   ,"/databricks-datasets/airlines/part-00084"
   ,"/databricks-datasets/airlines/part-00101"
   ,"/databricks-datasets/airlines/part-00105"
   ,"/databricks-datasets/airlines/part-00178")   

val df = 
   spark.read.format("csv")
  .option("sep", ",")
  .option("inferSchema", "true")
  .option("header", "true")
  .load(paths: _*)

df.show(10, false)

返回架构:

1996:string
8:string
24:string
6:string
1739:string
...

然而:

val paths = Seq(
   "/databricks-datasets/airlines/part-00000"  
   ,"/databricks-datasets/airlines/part-00001"
   ,"/databricks-datasets/airlines/part-00011"
   ,"/databricks-datasets/airlines/part-00071"
   ,"/databricks-datasets/airlines/part-00084"
   ,"/databricks-datasets/airlines/part-00101"
   ,"/databricks-datasets/airlines/part-00105")
...

返回:

Year:integer
Month:integer
DayofMonth:integer
DayOfWeek:integer
...

是什么解释了这种差异?

也以 1.0 SamplingRatio 运行。

似乎是随机的,注意到具有更多列的文件具有偏好。如果列数相同,那么将选择什么作为标题似乎是随机的。

标签: dataframecsvapache-sparkapache-spark-sql

解决方案


推荐阅读