首页 > 解决方案 > 如何将 csv 的嵌套 dict 格式的“_source”列展平为数据框

问题描述

我有一个包含 500 多行的 csv,其中一列“_source”存储为 JSON。我想将其提取到熊猫数据框中。我需要每个键都是它自己的列。#我有一个 1 mb 的在线社交媒体数据 Json 文件,我需要将字典和键值转换为它们自己的单独列。社交媒体数据来自 Facebook、Twitter/网络爬网...等。大约有 528 行单独的帖子/推文/文本,每行在字典中都有许多字典。我在下面附上了我的 Jupyter 笔记本中的几个步骤,以提供更完整的理解。需要将字典中字典的所有键值对转换为数据框内的列非常感谢,这将是一个巨大的帮助!!!我已经尝试通过这样做将其更改为数据框

source = pd.DataFrame.from_dict(source, orient='columns')

它返回类似这样的东西......我认为它可能会解压缩字典,但它没有。

#source.head()

#_source
#0   {'sub_organization_id': 'default', 'uid': 'aba...
#1   {'sub_organization_id': 'default', 'uid': 'ab0...
#2   {'sub_organization_id': 'default', 'uid': 'ac0...

下面是形状

#source.shape (528, 1)

下面是实际的“_source”行的样子。有许多字典和键:值对,其中每个键都需要是自己的列。谢谢!出于隐私原因,实际链接已被更改/加扰。

{'sub_organization_id': 'default',
 'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
 'project_veid': 'default',
 'campaign_id': 'default',
 'organization_id': 'default',
 'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
     'source': 'Explicit',
     'version': '1.1',
     'type': 'crawl'},
    'results': [{'rule_type': 'hashtag',
      'rule_tag': 'Far',
      'description': None,
      'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
      'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
      'value': '#Far',
      'organization_id': None,
      'sub_organization_id': None,
      'appid': 'ray',
      'project_id': 'CDE2F42-5B87-C594-C900E578C',
      'rule_id': '1838',
      'node_id': None,
      'metadata': {'campaign_title': 'AF',
       'project_title': 'AF '}}]}],
  'render': [{'attribs': {'website': 'github.com/res',
     'version': '1.0',
     'type': 'Page Render'},
    'results': [{'render_status': 'success',
      'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
      'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
      'url': 'https://discooprdapp.com/',
      'load_time': 32}]}]},
 'norm_attribs': {'website': 'github.com/res',
  'version': '1.1',
  'type': 'crawl'},
 'project_id': 'default',
 'system_timestamp': '2019-02-22T19:04:53.569623',
 'doc': {'appid': 'subtter',
  'links': [],
  'response_url': 'https://discooprdapp.com',
  'url': 'https://discooprdapp.com/',
  'status_code': 200,
  'status_msg': 'OK',
  'encoding': 'utf-8',
  'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
  'timestamp': '2019-02-22T19:04:53.963',
  'crawlid': '7fd95-785-4dd259-fcc-8752f'},
 'type': 'crawl',
 'norm': {'body': '\n',
  'domain': 'discordapp.com',
  'author': 'crawl',
  'url': 'https://discooprdapp.com',
  'timestamp': '2019-02-22T19:04:53.961283+00:00',
  'id': '7fc5-685-4dd9-cc-8762f'}}

在您发布之前,请确保实际代码适用于附加的数据。谢谢!

我尝试了以下代码,但它没有工作,有一个我无法弄清楚的语法错误。

pd.io.json.json_normalize(source_data.[_source].apply(json.loads))

pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
                                      ^
SyntaxError: invalid syntax

谁能帮助我,谁就是圣人!

标签: pythonjsonpandasdataframe

解决方案


不久前我不得不做类似的事情。基本上,我使用了一个完全展平 json 的函数来识别将转换为列的键,然后遍历 json 以重建一行并将每一行附加到“结果”数据帧中。因此,使用您提供的数据,它创建了 52 列行并查看它,看起来它将所有键都包含在它自己的列中。任何嵌套的东西,例如:'meta': {'rule_matcher':[{'atribs': {'website': ...]}然后应该有一个列名meta.rule_matcher.atribs.website,其中'.'表示那些嵌套的键

data_source = {'sub_organization_id': 'default',
 'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
 'project_veid': 'default',
 'campaign_id': 'default',
 'organization_id': 'default',
 'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
     'source': 'Explicit',
     'version': '1.1',
     'type': 'crawl'},
    'results': [{'rule_type': 'hashtag',
      'rule_tag': 'Far',
      'description': None,
      'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
      'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
      'value': '#Far',
      'organization_id': None,
      'sub_organization_id': None,
      'appid': 'ray',
      'project_id': 'CDE2F42-5B87-C594-C900E578C',
      'rule_id': '1838',
      'node_id': None,
      'metadata': {'campaign_title': 'AF',
       'project_title': 'AF '}}]}],
  'render': [{'attribs': {'website': 'github.com/res',
     'version': '1.0',
     'type': 'Page Render'},
    'results': [{'render_status': 'success',
      'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
      'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
      'url': 'https://discooprdapp.com/',
      'load_time': 32}]}]},
 'norm_attribs': {'website': 'github.com/res',
  'version': '1.1',
  'type': 'crawl'},
 'project_id': 'default',
 'system_timestamp': '2019-02-22T19:04:53.569623',
 'doc': {'appid': 'subtter',
  'links': [],
  'response_url': 'https://discooprdapp.com',
  'url': 'https://discooprdapp.com/',
  'status_code': 200,
  'status_msg': 'OK',
  'encoding': 'utf-8',
  'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
  'timestamp': '2019-02-22T19:04:53.963',
  'crawlid': '7fd95-785-4dd259-fcc-8752f'},
 'type': 'crawl',
 'norm': {'body': '\n',
  'domain': 'discordapp.com',
  'author': 'crawl',
  'url': 'https://discooprdapp.com',
  'timestamp': '2019-02-22T19:04:53.961283+00:00',
  'id': '7fc5-685-4dd9-cc-8762f'}}

代码:

def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x
    flatten(y)
    return out


flat = flatten_json(data_source)


import pandas as pd
import re

results = pd.DataFrame()
special_cols = []

columns_list = list(flat.keys())
for item in columns_list:
    try:
        row_idx = re.findall(r'\_(\d+)\_', item )[0]
    except:
        special_cols.append(item)
        continue
    column = re.findall(r'\_\d+\_(.*)', item )[0]
    column = re.sub(r'\_\d+\_', '.', column)

    row_idx = int(row_idx)
    value = flat[item]

    results.loc[row_idx, column] = value

for item in special_cols:
    results[item] = flat[item]

输出:

print (results.to_string())
   atribs_website atribs_source atribs_version atribs_type results.rule_type results.rule_tag  results.description          results.project_veid           results.campaign_id results.value  results.organization_id  results.sub_organization_id results.appid           results.project_id results.rule_id  results.node_id results.metadata_campaign_title results.metadata_project_title attribs_website attribs_version attribs_type results.render_status                                       results.path               results.image_hash                results.url  results.load_time sub_organization_id                                             uid project_veid campaign_id organization_id norm_attribs_website norm_attribs_version norm_attribs_type project_id            system_timestamp doc_appid          doc_response_url                    doc_url  doc_status_code doc_status_msg doc_encoding                doc_attrs_uid            doc_timestamp                 doc_crawlid   type norm_body     norm_domain norm_author                  norm_url                    norm_timestamp                 norm_id
0  github.com/res      Explicit            1.1       crawl           hashtag              Far                  NaN  A7180EA-7078-0C7F-ED5D-86AD7  2A6DA0C-365BB-67DD-B05830920          #Far                      NaN                          NaN           ray  CDE2F42-5B87-C594-C900E578C            1838              NaN                              AF                            AF   github.com/res             1.0  Page Render               success  https://east.amanaws.com/rays-ime-store/render...  bb7674b8ea3fc05bfd027a19815f82c  https://discooprdapp.com/               32.0             default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f

推荐阅读