首页 > 解决方案 > 使用 JSON Schema 将 CSV 转换为 JSON

问题描述

如何将平面表转换为 JSON?

我之前使用自定义代码和库将 JSON 转换为平面表。但是,我在这里的目标是相反的。在继续创建自定义库之前,我想知道是否有人以前遇到过这个问题,以及是否有现有的解决方案。

当您将 JSON 扁平化为 CSV 时,您会丢失结构上的信息,因此要反转它,您需要一个描述 JSON 应该如何构建的文档,理想情况下应该是标准化的JSON Schema

以下示例显示了源 CSV、JSON 架构和预期输出。

用户 CSV

user_id, adress.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter

JSON 模式

这遵循定义的标准,并添加了“源”属性。我建议将此自定义属性添加到此特定问题,以便在 csv 列和 JSON 值(叶子)之间进行映射。

{
 "$schema": "https://json-schema.org/draft/2020-12/schema",
 "title": "User",
 "type": "object",
 "properties":{
  "user_id" : {"type":"integer", "source":"user_id"},
  "address":{
   "type":"object",
   "properties":{
    "city" : {"type":"string", "source":"adress.city"},
    "street" : {"type":"string", "source":"adress.street"},
    "number": { "type":"integer", "source":"adress.number"}
   }
  },
  "name" : {"type":"string", "source":"name"}},
  "aka":{
   "type": "array",
   "items" : {"type":"string", "source":"aka"}
  },
  "contacts":{
   "type":"array",
   "items":{
    "type":"object",
    "properties":{
     "name" : {"type":"string", "source":"contacts.name"},
     "relationship":{"type":"string", "source":"contacts.relationship"}
    },
   }
  }
 }
}

预期的 JSON

{
 "user_id":1,
 "address":{
  "city":"Seattle",
  "street":"Atomic Street",
  "number":6910
 },
 "name":"Rick Sanchez",
 "aka":[
  "Rick",
  "Grandpa",
  "Albert Ein-douche",
  "Richard"
 ],
 "contacts":[
  {
   "name":"Morty",
   "relationship":"Grandson"
  },
  {
   "name":"Beth",
   "relationship":"Daughter"
  }
 ]
}

从上面我们看到,虽然 CSV 中有 8 行,但我们生成了一个 JSON 对象(而不是 8 行),因为只有一个唯一用户(user_id = 1)。这可以从根元素是对象而不是列表的 JSON 模式中推断出来。

如果我们没有指定 JSON Schema 或其他类型的映射,您可以简单地假设没有结构,只需创建 8 个平面 json,如下所示

[
 {"user_id":1,"address.city":"Seattle", ... "aka":"Rick" ... "contacts.relationship":"Grandson"}
 ...
 {"user_id":1,"address.city":"Seattle", ... "aka":"Richard" ... "contacts.relationship":"Daughter"}
]

我正在添加 Python 标记,因为这是我最常使用的语言,但在这种情况下,解决方案不需要使用 Python。

标签: pythonjsoncsvjsonschema

解决方案


我不完全清楚为什么需要 JSON 模式,但如果你愿意,你可以轻松地创建一个便利函数,它基本上可以将你的 CSV 数据映射到的平面 JSON “展开”为嵌套如上所述的字典格式。

以下示例应演示如何工作的简化示例。注意以下两点:

  • 在 CSV 标题中,我更正了一个错字并将其中一列重命名为address.city; 以前,它是adress.city,这将导致它被映射到单独的adress键下的另一个 JSON 路径,这可能是不可取的。

  • 我不确定处理这个问题的最佳方法,但看起来csv模块只允许单字符分隔符;在 CSV 文件中,看起来你有一个逗号和一个空格, 作为分隔符,所以我刚刚用一个逗号替换了所有出现的这个,,以便分隔符上的拆分按预期工作。

from csv import DictReader
from io import StringIO
from typing import Any


csv_data = StringIO("""\
user_id, address.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter
""".replace(', ', ',')
)


def unflatten_json(json_dict: dict):
    """Unflatten a JSON dictionary object, with keys like 'a.b.c'"""
    result_dict = {}

    for k, v in json_dict.items():
        *nested_parts, field_name = k.split('.')

        obj = result_dict
        for p in nested_parts:
            obj = obj.setdefault(p, {})

        obj[field_name] = v

    return result_dict


def main():
    reader = DictReader(csv_data)
    flat_json: list[dict[str, Any]] = list(reader)

    first_obj = flat_json[0]
    nested_dict = unflatten_json(first_obj)

    print('Flat JSON:   ', first_obj)
    print('Nested JSON: ', nested_dict)


if __name__ == '__main__':
    main()

输出如下:

Flat JSON:    {'user_id': '1', 'address.city': 'Seattle', 'address.street': 'Atomic Street', 'address.number': '6910', 'name': 'Rick Sanchez', 'aka': 'Rick', 'contacts.name': 'Morty', 'contacts.relationship': 'Grandson'}
Nested JSON:  {'user_id': '1', 'address': {'city': 'Seattle', 'street': 'Atomic Street', 'number': '6910'}, 'name': 'Rick Sanchez', 'aka': 'Rick', 'contacts': {'name': 'Morty', 'relationship': 'Grandson'}}

请注意,如果您想展开列表中的所有 JSON 字典对象,则可以改用list如下推导式:

result_list = [unflatten_json(d) for d in flat_json]

我还要指出,上述解决方案并不完美,因为它会将所有内容作为字符串值传递,例如在'user_id': '1'. 要解决这个问题,您可以修改unflatten_json函数,如下所示:

...
for k, v in json_dict.items():
    ...

    try:
        v = int(v)
    except ValueError:
        pass

    obj[field_name] = v

现在未展平的 JSON 对象应如下所示。请注意,我很好地打印它,json.dumps(nested_dict, indent=2)所以它更容易看到。

{
  "user_id": 1,
  "address": {
    "city": "Seattle",
    "street": "Atomic Street",
    "number": 6910
  },
  "name": "Rick Sanchez",
  "aka": "Rick",
  "contacts": {
    "name": "Morty",
    "relationship": "Grandson"
  }
}

完整的解决方案

下面提供了实现所需输出的完整解决方案(附加到aka和的所有行的数据contacts):

from csv import DictReader
from io import StringIO
from pprint import pprint


csv_data = StringIO("""\
user_id, address.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter
""".replace(', ', ',')
)


def unflatten_json(json_dict: dict[str, str]):
    """Unflatten a JSON dictionary object, with keys like 'a.b.c'"""
    result_dict = {}

    for k, v in json_dict.items():
        *nested_parts, field_name = k.split('.')

        obj = result_dict
        for p in nested_parts:
            obj = obj.setdefault(p, {})

        obj[field_name] = int(v) if v.isnumeric() else v

    return result_dict


def main():
    reader = DictReader(csv_data)

    rows = list(map(unflatten_json, reader))

    # retrieve the first element in the (unflattened) sequence
    result_obj = rows[0]
    # define list fields that we want to merge data for
    list_fields = ('aka', 'contacts')
    # now loop through, and for all rows merge the data for these fields
    for field in list_fields:
        result_obj[field] = [row[field] for row in rows]

    print('Result object:')
    pprint(result_obj)


if __name__ == '__main__':
    main()

如问题中所述,这应该具有所需的结果:

Result object:
{'address': {'city': 'Seattle', 'number': 6910, 'street': 'Atomic Street'},
 'aka': ['Rick',
         'Grandpa',
         'Albert Ein-douche',
         'Richard',
         'Rick',
         'Grandpa',
         'Albert Ein-douche',
         'Richard'],
 'contacts': [{'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'}],
 'name': 'Rick Sanchez',
 'user_id': 1}

推荐阅读