首页 > 解决方案 > 使用 aws 胶水创建存在于 aws s3 中的 .dat.gz 文件的 spark 数据框

问题描述

我编写了一个 pyspark 代码,它在 aws 胶水中运行并试图读取一个 dat.gz 文件。数据框已成功创建,但Trim(BOTH FROM)已添加到列名中。下面是我的代码片段。

df = spark.read.format("csv").option("header", 'false').option("delimiter", '|').load("s3://xxxxxx/xxxx/xxxxx/xxx/xxxxxxxxxx.dat.gz")

输出


+----------------------+------------------------+-------------------------+-------------------------+--------------------------+----------------------+----------------------------+----------------------------+----------------------------------+--------------------------------+--------------------------+---------------------------+-----------------------+-----------------------+--------------------------+-------------------------+---------------------------+------------------------+-------------------------+-----------------------+-----------------------+--------------------------+---------------------------+
|Trim(BOTH FROM EFF_DT)|Trim(BOTH FROM SITE_NUM)|Trim(BOTH FROM ARTCL_NUM)|Trim(BOTH FROM SL_UOM_CD)|Trim(BOTH FROM COND_TY_CD)|Trim(BOTH FROM EXP_DT)|Trim(BOTH FROM COND_REC_NUM)|Trim(BOTH FROM MAIN_SCAN_CD)|Trim(BOTH FROM PRC_COND_PRRTY_NUM)|Trim(BOTH FROM PRC_COND_WIN_IND)|Trim(BOTH FROM PRC_RSN_CD)|Trim(BOTH FROM PRC_METH_CD)|Trim(BOTH FROM PRC_AMT)|Trim(BOTH FROM PRC_QTY)|Trim(BOTH FROM UT_PRC_AMT)|Trim(BOTH FROM PROMO_NUM)|Trim(BOTH FROM BNS_BUY_NUM)|Trim(BOTH FROM CURRN_CD)|Trim(BOTH FROM BBY_TY_CD)|Trim(BOTH FROM BBY_AMT)|Trim(BOTH FROM BBY_PCT)|Trim(BOTH FROM BBY_LEV_CD)|Trim(BOTH FROM BBY_PRC_QTY)|
+----------------------+------------------------+-------------------------+-------------------------+--------------------------+----------------------+----------------------------+----------------------------+----------------------------------+--------------------------------+--------------------------+---------------------------+-----------------------+-----------------------+--------------------------+-------------------------+---------------------------+------------------------+--

但是在读取任何其他文件时,我得到了正确的输出。谁可以帮我这个事?这不是文件问题,因为我在本地机器上尝试了相同的代码并且运行良好。

标签: pythonamazon-web-servicesapache-sparkamazon-s3aws-glue

解决方案


推荐阅读