首页 > 解决方案 > Python 将 JSON 重构为不同的 JSON 结构

问题描述

我有一堆主要是手工完成的 JSON 数据。几千行。我需要使用 Python 将其重构为完全不同的格式。

我的“东西”概述:

:我的数据的基本“单位”。每个都有属性。不要担心属性的含义,但是如果存在,则需要为每个Column保留属性。

Folder:文件夹将Column和其他Folder组合在一起。文件夹当前没有属性,它们(当前)仅包含其他FolderColumn对象(此处的对象不一定指 JSON 对象......更多的是“实体”)

Universe:Universe 将所有内容组合成大块,在我的项目的更大范围内,它们无法相互交互。这在这里并不重要,但这就是他们所做的。

一些限制:

目前,我有这种形式的Column :

"Column0Name": {
  "type": "a type",
  "dtype": "data type",
  "description": "abcdefg"
}

我需要它去:

{
  "name": "Column0Name",
  "type": "a type",
  "dtype": "data type",
  "description": "abcdefg"
}

本质上,我需要将Column键值事物转换为事物数组(我是 JSON 新手,不知道术语)。我还需要每个文件夹以两个新的 JSON 数组结尾(除了 "name": "FolderName" 键值对)。它需要一个"folders": []and"columns": []被添加。所以我有这个文件夹:

"Folder0Name": {
  "Column0Name": {
    "type": "a",
    "dtype": "b",
    "description": "c"
  },
  "Column1Name": {
    "type": "d",
    "dtype": "e",
    "description": "f"
  }
}

并且需要这样做:

{
  "name": "Folder0Name",
  "folders": [],
  "columns": [
    {"name": "Column0Name", "type": "a", "dtype": "b", "description": "c"},
    {"name": "Column1Name", "type": "d", "dtype": "e", "description": "f"}
  ]
}

这些文件夹也将最终出现在其父Universe内的一个数组中。同样,每个Universe都会以“名称”、“文件夹”和“列”事物结束。像这样:

{
  "name": "Universe0",
  "folders": [a bunch of folders in a JSON array],
  "columns": [occasionally some columns in a JSON array]
}

底线

这是我到目前为止所拥有的。我坚持让生成器返回实际字典而不是生成器对象。

import json


class AllUniverses:
    """Container to hold all the Universes found in the json file"""
    def __init__(self, filename):
        self._fn = filename
        self.data = {}
        self.read_data()

    def read_data(self):
        with open(self._fn, 'r') as fin:
            self.data = json.load(fin)
        return self

    def universe_key(self):
        """Get the next universe key from the dict of all universes

            The key will be used as the name for the universe.
        """
        yield from self.data


class Universe:
    def __init__(self, json_filename):
        self._au = AllUniverses(filename=json_filename)
        self.uni_key = self._au.universe_key()
        self._universe_data = self._au.data.copy()
        self._col_attrs = ['type', 'dtype', 'description', 'aggregation']
        self._folders_list = []
        self._columns_list = []
        self._type = "Universe"
        self._name = ""
        self.uni = dict()
        self.is_folder = False
        self.is_column = False

    def output(self):
        # TODO: Pass this to json.dump?
        # TODO: Still need to get the actual folder and column dictionaries
        #  from the generators
        out = {
            "name": self._name,
            "type": "Universe",
            "folder": [f.me for f in self._folders_list],
            "columns": [c.me for c in self._columns_list]}
        return out

    def update_universe(self):
        """Get the next universe"""
        universe_k = next(self.uni_key)
        self._name = str(universe_k)
        self.uni = self._universe_data.pop(universe_k)
        return self

    def parse_nodes(self):
        """Process all child nodes"""
        nodes = [_ for _ in self.uni.keys()]
        for k in nodes:
            v = self.uni.pop(k)
            self._is_column(val=v)
            if self.is_column:
                fc = Column(data=v, key_name=k)
                self._columns_list.append(fc)
            else:
                fc = Folder(data=v, key_name=k)
                self._folders_list.append(fc)
        return self

    def _is_column(self, val):
        """Determine if val is a Column or Folder object"""
        self.is_folder = False
        self._column = False
        if isinstance(val, dict) and not val:
            self.is_folder = True
        elif not isinstance(val, dict):
            raise TypeError('Cannot handle inputs not of type dict')
        elif any([i in val.keys() for i in self._col_attrs]):
            self._column = True
        else:
            self.is_folder = True
        return self

    def parse_children(self):
        for folder in self._folders_list:
            assert(isinstance(folder, Folder)), f'bletch idk what happened'
            folder.parse_nodes()


class Folder:
    def __init__(self, data, key_name):
        self._data = data.copy()
        self._name = str(key_name)
        self._node_keys = [_ for _ in self._data.keys()]
        self._folders = []
        self._columns = []
        self._col_attrs = ['type', 'dtype', 'description', 'aggregation']

    @property
    def me(self):
        # maybe this should force the code to parse all children of this
        # Folder? Need to convert the generator into actual dictionaries
        return {"name": self._name, "type": "Folder",
                "columns": [(c.me for c in self._columns)],
                "folders": [(f.me for f in self._folders)]}

    def parse_nodes(self):
        """Parse all the children of this Folder

            Parse through all the node names. If it is detected to be a Folder
            then create a Folder obj. from it and add to the list of Folder
            objects. Else create a Column obj. from it and append to the list
            of Column obj.

            This should be appending dictionaries
        """
        for key in self._node_keys:
            _folder = False
            _column = False
            values = self._data.copy()[key]

            if isinstance(values, dict) and not values:
                _folder = True
            elif not isinstance(values, dict):
                raise TypeError('Cannot handle inputs not of type dict')
            elif any([i in values.keys() for i in self._col_attrs]):
                _column = True
            else:
                _folder = True
            if _folder:
                f = Folder(data=values, key_name=key)
                self._folders.append(f.me)
            else:
                c = Column(data=values, key_name=key)
                self._columns.append(c.me)
        return self


class Column:
    def __init__(self, data, key_name):
        self._data = data.copy()
        self._stupid_check()
        self._me = {
            'name': str(key_name),
            'type': 'Column',
            'ctype': self._data.pop('type'),
            'dtype': self._data.pop('dtype'),
            'description': self._data.pop('description'),
            'aggregation': self._data.pop('aggregation')}

    def __str__(self):
        # TODO: pretty sure this isn't correct
        return str(self.me)

    @property
    def me(self):
        return self._me

    def to_json(self):
        # This seems to be working? I think?
        return json.dumps(self, default=lambda o: str(self.me))  # o.__dict__)

    def _stupid_check(self):
        """If the key isn't in the dictionary, add it"""
        keys = [_ for _ in self._data.keys()]
        keys_defining_a_column = ['type', 'dtype', 'description', 'aggregation']
        for json_key in keys_defining_a_column:
            if json_key not in keys:
                self._data[json_key] = ""
        return self


if __name__ == "__main__":
    file = r"dummy_json_data.json"
    u = Universe(json_filename=file)
    u.update_universe()
    u.parse_nodes()
    u.parse_children()
    print('check me')

它给了我这个:

{
    "name":"UniverseName",
    "type":"Universe",
    "folder":[
        {"name":"Folder0Name",
            "type":"Folder",
            "columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB0B0>],
            "folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB190>]
        },
        {"name":"Folder2Name",
            "type":"Folder",
            "columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB040>],
            "folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB120>]
        },
        {"name":"Folder4Name",
            "type":"Folder",
            "columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB270>],
            "folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB200>]
        },
        {"name":"Folder6Name",
            "type":"Folder",
            "columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB2E0>],
            "folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB350>]
        },
        {"name":"Folder8Name",
            "type":"Folder",
            "columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB3C0>],
            "folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB430>]
        }
    ],
    "columns":[]
}

如果存在用于这种转换的现有工具,这样我就不必编写 Python 代码,那也将是一个有吸引力的替代方案。

标签: pythonarraysjson

解决方案


让我们创建表示Columns、Folders 和Unverses 所需的 3 个类。在开始一些我想谈的话题之前,我在这里对它们做一个简短的描述,如果它们中的任何一个对你来说是新的,我可以更深入地了解:

  • 我将使用类型注释来明确每个变量是什么类型。
  • 我会__slots__用 通过告诉类它的Column实例将有一个、、、name和属性ctype,每个实例将需要更少的内存空间。缺点是它不会接受此处未列出的任何其他属性。也就是说,它节省了内存但失去了灵活性。由于我们将拥有几个(可能数百或数千个)实例,因此减少内存占用似乎比能够添加任何属性的灵活性更重要。dtypedescriptionaggragationColumn
  • 每个类都将具有标准构造函数,其中每个参数都有一个默认值,但 name 是强制性的。
  • 每个类都会有另一个名为from_old_syntax. 它将是一个类方法,接收与名称对应的字符串和与数据对应的字典作为其参数,并输出相应的实例(ColumnFolderUniverse
  • Universes 基本上Folder与不同名称的 s 相同(现在),所以它基本上会继承它(class Universe(Folder): pass)。
from typing import List


class Column:
    __slots__ = 'name', 'ctype', 'dtype', 'description', 'aggregation'

    def __init__(
        self,
        name: str,
        ctype: str = '',
        dtype: str = '',
        description: str = '',
        aggregation: str = '',
    ) -> None:
        self.name = name
        self.ctype = ctype
        self.dtype = dtype
        self.description = description
        self.aggregation = aggregation

    @classmethod
    def from_old_syntax(cls, name: str, data: dict) -> "Column":
        column = cls(name)
        for key, value in data.items():
            # The old syntax used type for column type but in the new syntax it
            # will have another meaning so we use ctype instead
            if key == 'type':
                key = 'ctype'
            try:
                setattr(column, key, value)
            except AttributeError as e:
                raise AttributeError(f"Unexpected key {key} for Column") from e
        return column


class Folder:
    __slots__ = 'name', 'folders', 'columns'

    def __init__(
        self,
        name: str,
        columns: List[Column] = None,
        folders: List["Folder"] = None,
    ) -> None:
        self.name = name
        if columns is None:
            self.columns = []
        else:
            self.columns = [column for column in columns]
        if folders is None:
            self.folders = []
        else:
            self.folders = [folder for folder in folders]

    @classmethod
    def from_old_syntax(cls, name: str, data: dict) -> "Folder":
        columns = []  # type: List[Column]
        folders = []  # type: List["Folder"]
        for key, value in data.items():
            # Determine if it is a Column or a Folder
            if 'type' in value and 'dtype' in value:
                columns.append(Column.from_old_syntax(key, value))
            else:
                folders.append(Folder.from_old_syntax(key, value))
        return cls(name, columns, folders)


class Universe(Folder):
    pass

如您所见,构造函数非常简单,将参数分配给属性并完成。在 s 的情况下Folder(因此也在Universes 中),两个参数是列和文件夹的列表。默认值是None(在这种情况下,我们初始化为一个空列表),因为使用可变变量作为默认值存在一些问题,因此最好将None可变变量(例如列表)用作默认值。

Column的类方法使用提供的名称from_old_syntax创建一个空对象。Column之后,我们遍历也提供的数据字典,并将其键值对分配给其相应的属性。有一种特殊情况,“type”键被转换为“ctype”,因为“type”将用于新语法的不同目的。分配本身由 完成setattr(column, key, value)。我们将它包含在一个try ... except ...子句中,因为正如我们上面所说,只有其中的项目__slots__可以用作属性,所以如果你忘记了一个属性,你会得到一个异常说“AttributeError:Unexpected key 'NAME'”并且你只需将该“NAME”添加到__slots__.

Folder的(以及因此Unverse的)from_old_syntax类方法更简单。创建列和文件夹列表,遍历数据检查它是文件夹还是列,并使用适当的from_old_syntax类方法。然后使用这两个列表和提供的名称返回实例。请注意,Folder.from_old_syntax符号用于创建文件夹,而不是cls.from_old_syntax因为clsmay be Universe。但是,要创建我们确实使用的实例,cls(...)我们确实想使用Universeor Folder

现在您可以执行universes = [Universe.from_old_syntax(name, data) for name, data in json.load(f).items()]where fis the file,您将在内存中获取所有Universes、Folders 和s。Column所以现在我们需要将它们编码回 JSON。为此,我们将扩展json.JSONEncoder以便它知道如何将我们的类解析为可以正常编码的字典。为此,您只需要覆盖该default方法,检查传递的对象是否属于我们的类并返回一个将被编码的字典。如果它不是我们的类之一,我们将让父default方法来处理它。

import json


# JSON fields with this values will be omitted
EMPTY_VALUES = "", [], {}


class CustomEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (Column, Folder, Universe)):
            # Make a dict with every item in their respective __slots__
            data = {
                attr: getattr(obj, attr) for attr in obj.__slots__
                if getattr(obj, attr) not in EMPTY_VALUES
            }
            # Add the type fild with the class name
            data['type'] = obj.__class__.__name__
            return data

        # Use the parent class function for any object not handled explicitly
        super().default(obj)

将类转换为字典基本上是将其中的内容__slots__作为键,将属性的值作为值。我们将过滤那些为空字符串、空列表或空字典的值,因为我们不需要将它们写入 JSON。Column最后,我们通过读取对象类名(FolderUniverse)将“type”键添加到 dict 中。

要使用它,您必须将CustomEncoder作为cls参数传递给json.dump.

所以代码看起来像这样(省略类定义以保持简短):

import json
from typing import List


# JSON fields with this values will be omitted
EMPTY_VALUES = "", [], {}


class Column:
    # ...


class Folder:
    # ...


class Universe(Folder):
    pass


class CustomEncoder(json.JSONEncoder):
    # ...


if __name__ == '__main__':
    with open('dummy_json_data.json', 'r') as f_in, open('output.json', 'w') as f_out:
        universes = [Universe.from_old_syntax(name, data)
                     for name, data in json.load(f_in).items()]
        json.dump(universes, f_out, cls=CustomEncoder, indent=4)

推荐阅读