python - 如何迭代pyspark中的数据框多列?
问题描述
所以,假设我有一个数据框 df ,它只有一列,其中df.show()
|a,b,c,d,....| |a,b,c,d,....| 所以我想得到一个df1,df1.show()
|a|b|c.....| 所以简而言之,我想将具有一列的数据框分解为具有多列的数据框。所以,我明白了
split_col = pyspark.sql.functions.split(df['x'], ' '),
df=df.withColumn('0',split_col.getItem(0))
df=df.withColumn('1',split_col.getItem(1))
,依此类推,但如果我有很多列。除了对此进行大量迭代之外,还有什么方法可以在 pyspark 中做到这一点?谢谢
解决方案
所以你可以iterate and set name
使用select clause
如下所示:
在这个循环中,您将在hitting split
每次循环运行时都会这样做,因此效率会降低。
from pyspark.sql import functions as F
df.select(*[(F.split("x",' ')[i]).alias(str(i)) for i in range(100)]).explain()
#== Physical Plan ==
#*(1) Project [split(x#200, )[0] AS 0#1708, split(x#200, )[1]
AS 1#1709, split(x#200, )[2] AS 2#1710, split(x#200, )[3] AS
3#1711, split(x#200, )[4] AS 4#1712, split(x#200, )[5] AS
5#1713, split(x#200, )[6] AS 6#1714, split(x#200, )[7] AS
7#1715, split(x#200, )[8] AS 8#1716, split(x#200, )[9] AS
9#1717, split(x#200, )[10] AS 10#1718, split(x#200, )[11] AS
11#1719, split(x#200, )[12] AS 12#1720, split(x#200, )[13] AS
13#1721, split(x#200, )[14] AS 14#1722, split(x#200, )[15] AS
15#1723, split(x#200, )[16] AS 16#1724, split(x#200, )[17] AS
17#1725, split(x#200, )[18] AS 18#1726, split(x#200, )[19] AS
19#1727, split(x#200, )[20] AS 20#1728, split(x#200, )[21] AS
21#1729, split(x#200, )[22] AS 22#1730, split(x#200, )[23] AS
23#1731, ... 76 more fields]
#+- *(1) Scan ExistingRDD[x#200]
相反,您可以拆分它,并只once
允许火花project
one split operation as opposed to many.
from pyspark.sql import functions as F
df\
.withColumn("x", F.split('x',' '))\
.select(*[(F.col("x")[i]).alias(str(i)) for i in range(100)]).drop("x").explain()
#== Physical Plan ==
#*(1) Project [x#1908[0] AS 0#1910, x#1908[1] AS 1#1911,
x#1908[2] AS 2#1912, x#1908[3] AS 3#1913, x#1908[4] AS 4#1914,
x#1908[5] AS 5#1915, x#1908[6] AS 6#1916, x#1908[7] AS 7#1917,
x#1908[8] AS 8#1918, x#1908[9] AS 9#1919, x#1908[10] AS 10#1920,
x#1908[11] AS 11#1921, x#1908[12] AS 12#1922, x#1908[13] AS
13#1923, x#1908[14] AS 14#1924, x#1908[15] AS 15#1925, x#1908[16]
AS 16#1926, x#1908[17] AS 17#1927, x#1908[18] AS 18#1928,
x#1908[19] AS 19#1929, x#1908[20] AS 20#1930, x#1908[21] AS
21#1931, x#1908[22] AS 22#1932, x#1908[23] AS 23#1933, ... 76
more fields]
+- *(1) Project [split(x#200, ) AS x#1908]
+- *(1) Scan ExistingRDD[x#200]
推荐阅读
- algorithm - 将数组编码为固定长度的字符串
- java - 1 次测试运行后无法从文件中读取,org.apache.poi.EmptyFileException:提供的文件为空(零字节长)
- game-maker - 星际一代 - 游戏制作者
- vba - VBA word,将页码添加到页脚文本
- python - Pyspark 错误:TypeError:无法处理类型
作为向量 - jsf - WildFly 从 CLI 而不是 web.xml 配置 JSF 上下文参数
- microsoft-cognitive - 401 - 访问被拒绝(Microsoft TTS API)
- javascript - 如何专注于 div 元素,无法在 firefox 上正常工作
- php - 如何在xls中增加文件生成中的列数 - Laravel-Excel
- python - 矩形之间的 Pygame 碰撞