python - 切片 PySpark DataFrame 中列的所有值
问题描述
我有一个数据框,我想对该列的所有值进行切片,但我不知道该怎么做?
我的数据框
+-------------+------+
| studentID|gender|
+-------------+------+
|1901000200 | M|
|1901000500 | M|
|1901000500 | M|
|1901000500 | M|
|1901000500 | M|
+-------------+------+
我已将转换studentID
为字符串,但无法从中删除前 190 个。我想要下面的输出。
+-------------+------+
| studentID|gender|
+-------------+------+
| 1000200 | M|
| 1000500 | M|
| 1000500 | M|
| 1000500 | M|
| 1000500 | M|
+-------------+------+
我尝试了以下方法,但它给了我错误。
students_data = students_data.withColumn('studentID',F.lit(students_data["studentID"][2:]))
TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'NoneType'>, respectively.
解决方案
from pyspark.sql import functions as F
# replicating the sample data from the OP.
students_data = sqlContext.createDataFrame(
[[1901000200,'M'],
[1901000500,'M'],
[1901000500,'M'],
[1901000500,'M'],
[1901000500,'M']],
["studentid", "gender"])
# unlike a simple python list transformation - we need to define the last position in the transform
# in case you aren't sure about the length one can define a random large number say 10k.
students_data = students_data.withColumn(
'studentID',
F.lit(students_data["studentID"][4:10000]).cast("string"))
students_data.show()
输出:
+---------+------+
|studentID|gender|
+---------+------+
| 1000200| M|
| 1000500| M|
| 1000500| M|
| 1000500| M|
| 1000500| M|
+---------+------+