首页 > 解决方案 > 如何处理包含 newLine 字符的表或 csv 文件列并在具有 hdfs 文件系统的 pyspark 数据帧中操作

问题描述

如何使用 hdfs 文件系统处理包含 pyspark df 中的换行符的表/csv 文件的列。

我实际上需要对具有换行符的列数据进行操作,这是无法通过下面提到的步骤实现的。

面临的挑战是带有换行符的元组数据正在创建新记录,并且无法在 df 中对其进行解析和操作

>>> df=spark.read.csv("hdfs://cluster-04d4-m/user/veerayyakumar_g/Cleansdata_Input_Test.csv", header = True, inferSchema = True).show()
+--------------------+----------+--------------------+---------------+-------------+---------+
|                item|item_group|     item_group_desc|item_group_qlty|product_group|   run_dt|
+--------------------+----------+--------------------+---------------+-------------+---------+
|            I1229422|        G1|"<?xml version=""...|           null|         null|     null|
|      <Instructions>|      null|                null|           null|         null|     null|
|    <Instruction ...|      null|                null|           null|         null|     null|
|        Instructi...|      null|                null|           null|         null|     null|
|     </Instructions>|      null|                null|           null|         null|     null|
| ,,P130872,4/22/2019|      null|                null|           null|         null|     null|

>>> df=spark.read.csv("hdfs://cluster-04d4-m/user/veerayyakumar_g/Cleansdata_Input_Test.csv", header = True, inferSchema = True).show()

tried manipluate using lambda function but it's not working 

>>> newdf["item_group_desc"]=df["item_group_desc"].apply(lambda x: x.replace("\n",""))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object has no attribute '__getitem__'


Tried with column funtion still not able to achive that my scenario where i wanted to bring the tuple data in single line in the column "item_group_desc"
>>> newdf=df["item_group_desc"].withColumn('item_group_desc',regexp_replace('item_group_desc','[\\n]',''))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object has no attribute '__getitem__'

删除文件系统中的字符item_group_desc后,需要单行中的列数据。newLinepysparkhdfs

标签: pythondatabasedataframepysparkhdfs

解决方案


推荐阅读