首页 > 解决方案 > 从列表创建数据框

问题描述

我尝试创建火花数据框,我想在其中将列表转换为列。

代码:

def create_id(n):
    return ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(50))

list_a = [create_id(25) for x in range(100)]
list_b = [create_id(25) for x in range(100)]

df = sc.parallelize([["a", list_a], ["b", list_b]]).toDF()

这导致

    _1                                                _2
0   a   [dv2vtdl3sobadlw1svs39emp2n9ogwzzek8b6gvug7xkp...
1   b   [kdv6b9ehqx1t8kbxd77ha8435bhduyxp0ilv6e09wpejx..

这将创建 100 列,而不是 100 行:

df = sc.parallelize([list_a, list_b]).toDF()

有谁知道我如何创建一个包含两列和 100 行的 DataFrame?

标签: pythonapache-sparkpyspark

解决方案


使用 post手动创建 pyspark 数据框

def create_id(n):
    return ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(n))

list_a = [create_id(25) for _ in range(100)]
list_b = [create_id(25) for _ in range(100)]

df = spark.createDataFrame(zip(list_a,list_b), ['a', 'b'])

df.show()
+--------------------+--------------------+
|                   a|                   b|
+--------------------+--------------------+
|68blfnltq9fh81c4y...|3fl1wb5h2euy3sgd7...|
|ac37fb7qif71zzjpr...|xbqzzgiq9s6t5jiqm...|
|72rk28znzr6jjsi69...|5wvl528eg5y3p1lsk...|
|fioqnla3ijvl5769s...|1xvs2592uaxadv1o4...|
|7der8ld8fd6vl6g9d...|lrup85xitjz1uhsfl...|
|gycdap4hodaxxggw8...|h2oz370tzo6fnpke3...|
|ccvqcyzeynuks63pq...|iut82y2k1irfdvep1...|
|ngq29fnq2usghspgh...|z6j4mibrrjznoc9s8...|
|3qb6xyk5c1kbg0xq1...|l10ouv4w24d66e0ak...|
|u6dcvzede90xa7zz2...|hnh571t9szy0pwjrp...|
|3122g38k47jm24t7f...|tzbxlua574l88qtw1...|
|6pnva6ow83yxexqp1...|0nfj3v59b8jh0qv1g...|
|kl7xyftax3z32ot8o...|0sf6iyiyxpyvyd5kj...|
|36qwiiifgbzba4n8c...|xt4lpkjle8qynnlpo...|
|owsgb02rnov8qrhvw...|1zu4oisit25y2g14i...|
|bcmg0flh4d9tnvnjc...|7lfwx9kf7qens70p8...|
|6sdy1e8i3y1w0rtpr...|gw79bsrx8jlse6ixu...|
|83h5iq10clte1gcpr...|kblufuhlwabu7sv3u...|
|7g20ga0m756f0qsj7...|1fzo40vwtrp0kud8j...|
|07tw66i7dpcphczz1...|9a8c9ditp9dzomxh4...|
+--------------------+--------------------+
only showing top 20 rows


推荐阅读