首页 > 解决方案 > 如何将json文件中的特定键插入Python中的数据框中

问题描述

抱歉,如果这非常简单或已经被问到,我是 Python 新手并且使用 json 文件,所以我很困惑。

我有一个从网站上抓取的 9 GB json 文件。这些数据包含大约 300 万人的信息。每个人都有属性,但并不是所有的人都具有相同的属性。一个属性对应于 json 文件中的一个键,如下所示:

{
  "_id": "in-00000001",
  "name": {
    "family_name": "Trump",
    "given_name": "Donald"
  },
  "locality": "United States",
  "skills": [
    "Twitter",
    "Real Estate",
    "Golf"
     ],
  "industry": "Government",
  "experience": [
  {
    "org": "Republican",
    "end": "Present",
    "start": "January 2017",
    "title": "President of the United States"
  },
  {
    "org": "The Apprentice",
    "end": "2015",
    "start": "2003",
    "title": "The guy that fires people"
  }]
}

所以在这里,、 、 、_idnamelocality属性(键)。另一个配置文件可能具有其他属性,例如、、,或者缺少在另一个配置文件中找到的某些属性,例如属性等。skillsindustryexperienceeducationawardsinterestsskills

我想做的是扫描 json 文件中的每个配置文件,如果配置文件包含属性和skills,我想提取该信息并将其插入数据框(我想我需要 Pandas ?)。从中,我想具体提取他们当前雇主的姓名,即最近在. 数据框如下所示:industryexperienceexperienceorg

    Industry   | Current employer | Skills
    ___________________________________________________________________
    Government | Republican       | Twitter, Real Estate, Golf
    Marketing  | Marketers R Us   | Branding, Social Media, Advertising

...对于具有这三个属性的所有配置文件,依此类推。

我正在努力寻找一个很好的资源来解释如何做这种事情,因此我的问题是。

我想粗略的伪代码是:

for each profile in open(path to .json file):
    if profile has keys "experience", "industry" AND "skills":
        on the same row of the data frame:
            insert current employer into "current employer" column of 
            data frame
            insert industry into "industry" column of data frame
            insert list of skills into "skills" column of data frame

我只需要知道如何用 Python 编写它。

标签: pythonjsonpandasdataframe

解决方案


我假设该文件包含所有配置文件,例如

{
    "profile 1" : {
        # Full object as in the example above
    },
    "profile 2" : {
        #Full object as in the example above
    }
}

在继续之前,让我展示一个使用 Pandas DataFrames 的正确方法。

更好地使用 Pandas DataFrames 的示例:

Pandas DataFrame 中的值不能是列表。因此,我们将不得不复制行,如下例所示。查看此问题和 JD Long 的答案以获取更多详细信息:如何在 pandas 数据框中使用列表作为值?

ID      |    Industry   | Current employer | Skill
___________________________________________________________________
in-01   |    Government | Republican       | Twitter
in-01   |    Government | Republican       | Real Estate
in-01   |    Government | Republican       | Golf
in-02   |    Marketing  | Marketers R Us   | Branding
in-02   |    Marketing  | Marketers R Us   | Social Media
in-02   |    Marketing  | Marketers R Us   | Advertising

在以下代码的注释中查找解释:

import json
import pandas as pd

# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill']) 

#Load the file as json. 
with open(path to .json file) as file:
    #readlines() reads the file as string and loads() loads it into a dict
    obj = json.loads(''.join(file.readlines()))
    #Then iterate its items() as key value pairs
    #But the line of code below depends on my first assumption.
    #Depending on the file format, the line below might have to differ.
    for prof_key, profile in obj.items():
        # Verify if a profile contains all the required keys
        if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
            for skill in profile["skills"]:
                df.loc[-1] = [profile["_id"],
                              profile["industry"],
                              [x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
                              skill]

上面的行在df.loc[-1] = ...数据框中插入一行作为最后一行(索引-1)。

当您稍后希望使用此信息时,您将不得不使用df.groupby('ID')

让我知道您的文件中是否有不同的格式,以及此说明是否足以让您入门或您需要更多。


推荐阅读