首页 > 解决方案 > 在熊猫数据框中加载多文档 yaml 文件时缺少第一个文档

问题描述

我试图将一个多文档 yaml 文件(即,一个由多个 yaml 文档组成的 yaml 文件由“---”分隔)加载到 Pandas 数据框中。出于某种原因,第一个文档不会出现在数据框中。如果 的输出yaml.safe_load_all首先被具体化为一个列表(而不是将迭代器提供给pd.io.json.json_normalize),则所有文档最终都在数据框中。我可以使用下面的示例代码(在完全不同的 yaml 文件上)重现这一点。

import os
import yaml
import pandas as pd
import urllib.request

# public example of multi-document yaml
inputfilepath = os.path.expanduser("~/my_example.yaml")
url =  "https://raw.githubusercontent.com/kubernetes/examples/master/guestbook/all-in-one/guestbook-all-in-one.yaml"
urllib.request.urlretrieve(url, inputfilepath) 

with open(inputfilepath, 'r') as stream:
     df1 = pd.io.json.json_normalize(yaml.safe_load_all(stream))

with open(inputfilepath, 'r') as stream:
     df2 = pd.io.json.json_normalize([ x for x in yaml.safe_load_all(stream)])

print(f'Output table shape with iterator: {df1.shape}')
print(f'Output table shape with iterator materialized as list: {df2.shape}')

我希望这两个结果是相同的,但我得到:

Output table shape with iterator: (5, 18)
Output table shape with iterator materialized as list: (6, 18)

任何想法为什么这些结果不同?

标签: pythonpandaspyyaml

解决方案


有关列表理解与生成器表达式的信息,请参见此站点。

df1缺少第一行数据,因为您传递的是迭代器而不是iterable

print(yaml.safe_load_all(stream))
#Output: <generator object load_all at 0x00000293E1697750>

pandas docs,它需要一个列表:

数据:字典或字典列表

更新更多细节:

通过查看normalize.py源文件,该函数json_normalize具有此条件检查,因此您的生成器被视为您传入嵌套结构:

if any([isinstance(x, dict)
    for x in compat.itervalues(y)] for y in data):
        # naive normalization, this is idempotent for flat records
        # and potentially will inflate the data considerably for
        # deeply nested structures:
        #  {VeryLong: { b: 1,c:2}} -> {VeryLong.b:1 ,VeryLong.c:@}
        #
        # TODO: handle record value which are lists, at least error
        #       reasonably
        data = nested_to_record(data, sep=sep)
    return DataFrame(data)

函数内部nested_to_record

new_d = copy.deepcopy(d)
for k, v in d.items():
    # each key gets renamed with prefix
    if not isinstance(k, compat.string_types):
        k = str(k)
    if level == 0:
        newkey = k
    else:
        newkey = prefix + sep + k

    # only dicts gets recurse-flattend
    # only at level>1 do we rename the rest of the keys
    if not isinstance(v, dict):
        if level != 0:  # so we skip copying for top level, common case
            v = new_d.pop(k)
            new_d[newkey] = v
        continue
    else:
        v = new_d.pop(k)
        new_d.update(nested_to_record(v, newkey, sep, level + 1))
new_ds.append(new_d)

该行d.items()是您的生成器被评估的地方,循环内部是您可以看到它们跳过第一个“级别”的地方,在您的情况下是第一条记录。


推荐阅读