首页 > 解决方案 > 将数据框中的字符串转换为单独的列,然后将它们添加回主数据框

问题描述

我正在编写一个从 API 获取数据的 ETL 脚本。我的目标是将最终数据集加载到 MySQL 数据库中。下面是从邮递员的 API 端点 JSON 响应图像获取的 JSON 的快照

用于测试的 JSON 数据

{
    'count': 9840,
    'rows': [{
                'id': 1,
                'closed': 1,
                'closedAt': '2019-09-18T18:39:33Z',
                'disposition': 'unmanaged',
                'formData': '{"body":"name: kamaro\\nemail: testing@testing.com\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xxx.xx.xx","mapKey":"email","from":"testing@testing.com","firstName":"testing@testing.com","ListId":1,"email":"testing@testing.com"}',
                'createdAt': '2019-09-18T18:39:33Z'
            }, {
                'id': 2,
                'closed': 1,
                'closedAt': '2019-09-23T13:10:44Z',
                'disposition': 'unmanaged',
                'formData': '{"body":"name: KAMAZA\\nemail: 333xxxxxl@sdfa.com\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xxx.xx.xx","mapKey":"email","from":"Lmbaxxxxx.com","firstName":"yyyyyyl@sdfa.com","ListId":1,"email":"yyyyyyl@sdfa.com"}',
                'createdAt': '2019-09-23T13:05:44Z'
            }, {
                'id': 3,
                'closed': 1,
                'closedAt': '2019-10-11T11:23:56Z',
                'disposition': None,
                'formData': '{"body":"name: customer name\\nemail: yyyyyy@gmail\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xx.13.79","mapKey":"email","from":"yyyyyy@gmail"}',
                'createdAt': '2019-10-11T10:55:19Z'
            }, {
                'id': 4,
                'closed': 1,
                'closedAt': '2019-10-11T11:59:12Z',
                'disposition': 'timeout',
                'formData': '{"body":"name: Pamphtest\\nemail: pamphilexxxx.co\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xx.143.94","mapKey":"email","from":"pamphilexxxx.co","firstName":"pamphilexxxx.co","ListId":1,"email":"pamphilexxxx.co"}',
                'createdAt': '2019-10-11T11:27:32Z'
            }, {
                'id': 5,
                'closed': 1,
                'closedAt': '2019-10-11T14:06:12Z',
                'disposition': 'timeout',
                'formData': '{"body":"name: Pamphile\\nemail: Pamphile@gama.com\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xx.13.79","mapKey":"email","from":"Pamphile@gama.com","firstName":"Pamphile@gama.com","ListId":1,"email":"Pamphile@gama.com"}',
                'createdAt': '2019-10-11T13:35:11Z'
            },
            {
            "id": 9806,
            "closed": 1,
            "closedAt": "2021-03-18T13:08:10Z",
            "formData": "{\"body\":\"New chat request \\nEmail Address: daudiXXXXXC@gmail.com\\nWhat is your gender?: Female\\nWhat is your age range?: 20-24 years\\nDo you use contraceptives?: No\\nWhat is your location?: Nairobi\\nTo use this service: true\\n\",\"referer\":\"https://xxxxxxxxxx.com\",\"customerIp\":\"xxxxxxxxxx\",\"mapKey\":\"email\",\"from\":\"daudiXXXXXC@gmail.com\",\"email\":\"daudiXXXXXC@gmail.com\",\"cf_1\":\"Female\",\"cf_2\":\"20-24 years\",\"cf_13\":\"No\",\"cf_14\":\"Nairobi\",\"cf_15\":true,\"firstName\":\"daudiXXXXXC@gmail.com\",\"ListId\":1}",
            "createdAt": "2021-03-18T13:03:09Z"
        }

我已经设法使用下面的 python 脚本将 JSON 转换为数据框

import requests
from requests.auth import HTTPBasicAuth
import pandas as pd

my_response = requests.get('https://mydomain.io/api/xxx/interactions?fields=formData,closedAt,id,closed,'
                                  'disposition,createdAt,updateAT',auth = HTTPBasicAuth('myname', 'mypassword'))
my_data = dict(my_response.json())
my_dataframe = pd.DataFrame(my_data['rows'])

我面临的挑战是将formData中的键转换为列,并使这些新列成为主数据框的一部分。请看两张照片,显示我当前的结果和我想要的结果 我当前的数据框

我想要的数据框

标签: pythonjsondataframeapietl

解决方案


您有一个 json,其中相关部​​分是字典列表。在那之前没有什么难的。特殊性是这些 dicts 包含一个必须解码的 json 字符串。然后update足以将内部 dict 键和值带到包含键的级别。完成此操作后,您可以从字典列表中构建数据框。

代码可以是:

my_response = requests.get(
    'https://mydomain.io/api/xxx/interactions?fields=formData,closedAt,id,closed,'
    'disposition,createdAt,updateAT',auth = HTTPBasicAuth('myname', 'mypassword'))
my_data = my_response.json()

my_data = my_data['rows']           # only process the rows part
for d in my_data:                   # but extract the inner json strings for all dicts
    d.update(json.loads(d['formData']))

# build the dataframe
df = pd.DataFrame(my_data, columns=['id', 'closed', 'closedAt', 'email', 'cf_1'
                                   'cf_2', 'cf_14', 'cf_13', 'createdAt'])

使用显示的数据,它给出:

     id  closed              closedAt                  email  cf_1cf_2    cf_14 cf_13             createdAt
0     1       1  2019-09-18T18:39:33Z    testing@testing.com       NaN      NaN   NaN  2019-09-18T18:39:33Z
1     2       1  2019-09-23T13:10:44Z       yyyyyyl@sdfa.com       NaN      NaN   NaN  2019-09-23T13:05:44Z
2     3       1  2019-10-11T11:23:56Z                    NaN       NaN      NaN   NaN  2019-10-11T10:55:19Z
3     4       1  2019-10-11T11:59:12Z        pamphilexxxx.co       NaN      NaN   NaN  2019-10-11T11:27:32Z
4     5       1  2019-10-11T14:06:12Z      Pamphile@gama.com       NaN      NaN   NaN  2019-10-11T13:35:11Z
5  9806       1  2021-03-18T13:08:10Z  daudiXXXXXC@gmail.com       NaN  Nairobi    No  2021-03-18T13:03:09Z

如您所见,通过给出明确的列列表,很容易只从数据框中选择一些键。如果您的真实数据集中还有其他有趣的列,请调整该部分。


推荐阅读