首页 > 解决方案 > 如何从pyspark的数据框中删除空列

问题描述

名称数据

我们有一个数据框:

names = spark.read.csv("name.csv", header="true", inferSchema="true").rdd

我想做这个:

res=names.filter(lambda f: f['Name'] == "Diwakar").map(lambda name: (name['Name'], name['Age']))
res.toDF(['Name','Age']).write.csv("final", mode="overwrite", header="true")

但空列正在造成问题。

标签: pysparkpyspark-dataframes

解决方案


只需使用一个简单的选择,我假设空列是“”。

用于输入

df = sqlContext.createDataFrame([(1,"", "x"," "), (2,"", "b"," "), (5,"", "c"," "), (8,"", "d"," ")], ("st"," ", "ani"," "))

+---+---+---+---+
| st|   |ani|   |
+---+---+---+---+
|  1|   |  x|   |
|  2|   |  b|   |
|  5|   |  c|   |
|  8|   |  d|   |
+---+---+---+---+

a=list(set(df.columns))
a.remove(" ")
df=df.select(a)
df.show()

+---+---+
|ani| st|
+---+---+
|  x|  1|
|  b|  2|
|  c|  5|
|  d|  8|
+---+---+
""" 
Do your Operations
"""

完成上述步骤后,继续您的任务。这将删除空白列

新编辑:

阅读时没有这种方法可以删除空列,您必须自己做。

你可以这样做:

a = list(set(df.columns))
new_col = [x for x in a if not x.startswith("col")] #or what ever they start with

df=df.select(new_col)

推荐阅读