python-3.x - pyspark 数据框:为什么我可以选择一些嵌套字段而不选择其他字段?
问题描述
我正在尝试使用 Python 3.9.1 中的 pyspark (3.0.1) 编写一些代码以将 JSON 嵌套到 Dataframes 中。
我有一些虚拟数据,其架构如下:
data.printSchema()
root
|-- recordID: string (nullable = true)
|-- customerDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- dob: string (nullable = true)
|-- familyMembers: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- relationship: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- contactNumbers: struct (nullable = true)
| | | |-- work: string (nullable = true)
| | | |-- home: string (nullable = true)
| | |-- addressDetails: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- addressType: string (nullable = true)
| | | | |-- address: string (nullable = true)
当我输入select
字段时,familyMembers
我得到了预期的以下结果:
data.select('familyMembers.contactNumbers.work').show(truncate=False)
+------------------------------------------------+
|work |
+------------------------------------------------+
|[(07) 4612 3880, (03) 5855 2377, (07) 4979 1871]|
|[(07) 4612 3880, (03) 5855 2377] |
+------------------------------------------------+
data.select('familyMembers.name').show(truncate=False)
+------------------------------------+
|name |
+------------------------------------+
|[Jane Smith, Bob Smith, Simon Smith]|
|[Jackie Sacamano, Simon Sacamano] |
+------------------------------------+
然而,当我尝试从ArrayType (下方)输入select
字段时,我收到一个错误:addressDetails
familyMembers
>>> data.select('familyMembers.addressDetails.address').show(truncate=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1421, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: cannot resolve '`familyMembers`.`addressDetails`['address']' due to data type mismatch: argument 2 requires integral type, however, ''address'' is of string type.;;
'Project [familyMembers#71.addressDetails[address] AS address#277]
+- LogicalRDD [recordID#69, customerDetails#70, familyMembers#71], false
我很困惑。两者familyMembers
和addressDetails
are ArrayTypes
,但从一个作品中选择,而不是从另一个作品中选择。有对此的解释,还是我错过了什么?是因为一个嵌套在另一个中吗?
重现代码(只有 1 条记录):
from pyspark.sql.types import StructType
from pyspark.sql import SparkSession, DataFrame
import json
rawdata = [{"recordID":"abc-123","customerDetails":{"name":"John Smith","dob":"1980-04-23"},"familyMembers":[{"relationship":"mother","name":"Jane Smith","contactNumbers":{"work":"(07) 4612 3880","home":"(08) 8271 1577"},"addressDetails":[{"addressType":"residential","address":"29 Commonwealth St, Clifton, QLD 4361 "},{"addressType":"work","address":"20 A Yeo Ave, Highgate, SA 5063 "}]},{"relationship":"father","name":"Bob Smith","contactNumbers":{"work":"(03) 5855 2377","home":"(03) 9773 2483"},"addressDetails":[{"addressType":"residential","address":"1735 Fenaughty Rd, Kyabram South, VIC 3620"},{"addressType":"work","address":"12 Haldane St, Bonbeach, VIC 3196 "}]},{"relationship":"brother","name":"Simon Smith","contactNumbers":{"work":"(07) 4979 1871","home":"(08) 9862 6017"},"addressDetails":[{"addressType":"residential","address":"6 Darren St, Sun Valley, QLD 4680"},{"addressType":"work","address":"Arthur River, WA 6315"}]}]},]
strschema = '{"fields":[{"metadata":{},"name":"recordID","nullable":true,"type":"string"},{"metadata":{},"name":"customerDetails","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"dob","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"familyMembers","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"relationship","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"contactNumbers","nullable":true,"type":{"fields":[{"metadata":{},"name":"work","nullable":true,"type":"string"},{"metadata":{},"name":"home","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"addressDetails","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"addressType","nullable":true,"type":"string"},{"metadata":{},"name":"address","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}}],"type":"struct"},"type":"array"}}],"type":"struct"}'
spark = SparkSession.builder.appName("json-un-nester").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
schema = StructType.fromJson(json.loads(strschema))
datardd = sc.parallelize(rawdata)
data = spark.createDataFrame(datardd, schema=schema)
data.show()
data.select('familyMembers.name').show(truncate=False)
data.select('familyMembers.addressDetails.address').show(truncate=False)
解决方案
要理解这一点,您可以打印以下架构:
data.select('familyMembers.addressDetails').printSchema()
#root
# |-- familyMembers.addressDetails: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- addressType: string (nullable = true)
# | | | |-- address: string (nullable = true)
请参阅此处,您有一个结构数组,这与您拥有的初始模式不同。所以你不能直接address
从根访问,但你可以选择嵌套数组的第一个元素然后访问结构字段address
:
data.selectExpr("familyMembers.addressDetails[0].address").show(truncate=False)
#+--------------------------------------------------------------------------+
#|familyMembers.addressDetails AS addressDetails#29[0].address |
#+--------------------------------------------------------------------------+
#|[29 Commonwealth St, Clifton, QLD 4361 , 20 A Yeo Ave, Highgate, SA 5063 ]|
#+--------------------------------------------------------------------------+
或者:
data.select(F.col('familyMembers.addressDetails').getItem(0).getItem("address"))
推荐阅读
- mysql - 数据库自动被删除很多次
- php - 如何在登录 5 分钟和 8 分钟后发送自动电子邮件
- reactjs - 在 React-bootstrap 点击 Navbar.link 时参考目标语言
- php - Laravel - 急切加载的数据不匹配
- javascript - 即使返回结果,AnuglarJS 中的结果也未定义
- ios - 错误:找不到与“iPhone 6s”匹配的模拟器
- c# - 根据数据库中的单元格值(减号/加号)更改表格行中的颜色
- ios - -[UINavigationController selectedViewController]:无法识别的选择器发送到实例 0x105ca6000
- clickhouse - Clickhouse 中是否有任何方法可以跳过主表的某些数据但将其留给物化视图?
- mysql - mysql导致来自同一个表的两列没有重复的行