首页 > 解决方案 > Pyspark:遍历多行嵌套 json 以构建数据框

问题描述

伙计们,我需要一些帮助来遍历 pyspark 中的以下 json ......并构建一个数据框:

{
    "success": true,
    "result": {
        "0x00e01a648ff41346cdeb873182383333d2184dd1": {
            "id": 130,
            "name": "xn--mytherwallet-fvb.com",
            "url": "http://xn--mytherwallet-fvb.com",
            "coin": "ETH",
            "category": "Phishing",
            "subcategory": "MyEtherWallet",
            "description": "Homoglyph",
            "addresses": [
                "0x00e01a648ff41346cdeb873182383333d2184dd1",
                "0x11e01a648ff41346cdeb873182383333d2184dd1"
            ],
            "reporter": "MyCrypto",
            "status": "Offline"
        },
        "0x858457daa7e087ad74cdeeceab8419079bc2ca03": {
            "id": 1200,
            "name": "myetherwallet.in",
            "url": "http://myetherwallet.in",
            "coin": "ETH",
            "category": "Phishing",
            "subcategory": "MyEtherWallet",
            "addresses": ["0x858457daa7e087ad74cdeeceab8419079bc2ca03"],
            "reporter": "MyCrypto",
            "ip": "159.8.210.35",
            "nameservers": [
                "ns2.eftydns.com",
                "ns1.eftydns.com"
            ],
            "status": "Active"
        }
    }
}

我需要构建一个代表地址列表的数据框。

标签: apache-sparkpyspark

解决方案


我将您的 JSON 格式化为SPARK-Readable格式。

{"success": true, "result": {"0x00e01a648ff41346cdeb873182383333d2184dd1": {"id": 130, "name": "xn--mytherwallet-fvb.com", "url": "http://xn--mytherwallet-fvb.com", "coin": "ETH", "category": "Phishing", "subcategory": "MyEtherWallet", "description": "Homoglyph", "addresses": ["0x00e01a648ff41346cdeb873182383333d2184dd1", "0x11e01a648ff41346cdeb873182383333d2184dd1"], "reporter": "MyCrypto", "status": "Offline"}, "0x858457daa7e087ad74cdeeceab8419079bc2ca03": {"id": 1200, "name": "myetherwallet.in", "url": "http://myetherwallet.in", "coin": "ETH", "category": "Phishing", "subcategory": "MyEtherWallet", "addresses": ["0x858457daa7e087ad74cdeeceab8419079bc2ca03"], "reporter": "MyCrypto", "ip": "159.8.210.35", "nameservers": ["ns2.eftydns.com", "ns1.eftydns.com"], "status": "Active"}}}

阅读 JSON

val df = spark.read.json("/my_data.json")

df.printSchema()
df.show(false)

输出

root
 |-- result: struct (nullable = true)
 |    |-- 0x00e01a648ff41346cdeb873182383333d2184dd1: struct (nullable = true)
 |    |    |-- addresses: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- coin: string (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- reporter: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- subcategory: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |-- 0x858457daa7e087ad74cdeeceab8419079bc2ca03: struct (nullable = true)
 |    |    |-- addresses: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- coin: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- ip: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- nameservers: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- reporter: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- subcategory: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- success: boolean (nullable = true)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                     |success|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|[[WrappedArray(0x00e01a648ff41346cdeb873182383333d2184dd1, 0x11e01a648ff41346cdeb873182383333d2184dd1),Phishing,ETH,Homoglyph,130,xn--mytherwallet-fvb.com,MyCrypto,Offline,MyEtherWallet,http://xn--mytherwallet-fvb.com],[WrappedArray(0x858457daa7e087ad74cdeeceab8419079bc2ca03),Phishing,ETH,1200,159.8.210.35,myetherwallet.in,WrappedArray(ns2.eftydns.com, ns1.eftydns.com),MyCrypto,Active,MyEtherWallet,http://myetherwallet.in]]|true   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+

推荐阅读