首页 > 解决方案 > 如何处理在 Pandas 中返回类似字典的对象列表的 JSON?

问题描述

我正在使用来自collegefootballdata.com 的API 来获取有关分数和投注线的数据。我想使用投注线来推断预期的获胜百分比,然后将其与实际结果进行比较(我觉得我的球队输掉了太多比赛,而我们是大热门并想对其进行测试。)此代码检索一场比赛以用于示例目的。

parameters = {
"gameId": 401112435,
"year": 2019
}
response = requests.get("https://api.collegefootballdata.com/lines", params=parameters)

JSON 输出是这样的:

[
{
    "awayConference": "ACC",
    "awayScore": 28,
    "awayTeam": "Virginia Tech",
    "homeConference": "ACC",
    "homeScore": 35,
    "homeTeam": "Boston College",
    "id": 401112435,
    "lines": [
        {
            "formattedSpread": "Virginia Tech -4.5",
            "overUnder": "57.5",
            "provider": "consensus",
            "spread": "4.5"
        },
        {
            "formattedSpread": "Virginia Tech -4.5",
            "overUnder": "57",
            "provider": "Caesars",
            "spread": "4.5"
        },
        {
            "formattedSpread": "Virginia Tech -4.5",
            "overUnder": "58",
            "provider": "numberfire",
            "spread": "4.5"
        },
        {
            "formattedSpread": "Virginia Tech -4.5",
            "overUnder": "56.5",
            "provider": "teamrankings",
            "spread": "4.5"
        }
    ],
    "season": 2019,
    "seasonType": "regular",
    "week": 1
}
]

然后我将使用以下内容加载到熊猫数据框中:

def jstring(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    return text

json_str = jstring(response.json())
df = pd.read_json(json_str)

这将创建一个带有“lines”列的数据框,其中包含 JSON 的整个行部分作为字符串。最终,我想在“provider”=“consensus”的块中使用“spread”值。对于我的目的而言,其他一切都是无关紧要的。我试过用

df = df.explode('lines')

这给了我 4 行,每场比赛都是这样的(如预期的那样):

{'formattedSpread': 'Virginia Tech -4.5', 'overUnder': '57.5', 'provider': 'consensus', 'spread': '4.5'}

这是我卡住的地方。我只想保留 'provider' = 'consensus' 的行,而且我需要在我的分析中将 'spread' 用作单独的变量/列。我尝试过第二次爆炸,df.split,df.replace 将 { 更改为 [ 并作为列表爆炸,都无济于事。任何帮助表示赞赏!

标签: pythonjsonpandas

解决方案


这可能是您正在寻找的 -

编辑:处理特殊情况。

import pandas as pd
import requests

params = {
    "gameId": 401112435,
    "year": 2019,
}

r = requests.get("https://api.collegefootballdata.com/lines", params=params)

df = pd.DataFrame(r.json()) # Create a DataFrame with a lines column that contains JSON
df = df.explode('lines') # Explode the DataFrame so that each line gets its own row
df = df.reset_index(drop=True) # After explosion, the indices are all the same - this resets them so that you can align the DataFrame below cleanly

def fill_na_lines(lines):
    if pd.isna(lines):
        return {k: None for k in ['provider', 'spread', 'formattedSpread', 'overUnder']}
    return lines

df.lines = df.lines.apply(fill_na_lines)

lines_df = pd.DataFrame(df.lines.tolist()) # A separate lines DataFrame created from the lines JSON column
df = pd.concat([df, lines_df], axis=1) # Concatenating the two DataFrames along the vertical axis.

# Now you can filter down to whichever rows you need.
df = df[df.provider == 'consensus']

关于以不同方式加入 DataFrames 的文档可能很有用。


推荐阅读