python - 将数据框中的字符串转换为单独的列,然后将它们添加回主数据框
问题描述
我正在编写一个从 API 获取数据的 ETL 脚本。我的目标是将最终数据集加载到 MySQL 数据库中。下面是从邮递员的 API 端点 JSON 响应图像获取的 JSON 的快照
用于测试的 JSON 数据
{
'count': 9840,
'rows': [{
'id': 1,
'closed': 1,
'closedAt': '2019-09-18T18:39:33Z',
'disposition': 'unmanaged',
'formData': '{"body":"name: kamaro\\nemail: testing@testing.com\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xxx.xx.xx","mapKey":"email","from":"testing@testing.com","firstName":"testing@testing.com","ListId":1,"email":"testing@testing.com"}',
'createdAt': '2019-09-18T18:39:33Z'
}, {
'id': 2,
'closed': 1,
'closedAt': '2019-09-23T13:10:44Z',
'disposition': 'unmanaged',
'formData': '{"body":"name: KAMAZA\\nemail: 333xxxxxl@sdfa.com\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xxx.xx.xx","mapKey":"email","from":"Lmbaxxxxx.com","firstName":"yyyyyyl@sdfa.com","ListId":1,"email":"yyyyyyl@sdfa.com"}',
'createdAt': '2019-09-23T13:05:44Z'
}, {
'id': 3,
'closed': 1,
'closedAt': '2019-10-11T11:23:56Z',
'disposition': None,
'formData': '{"body":"name: customer name\\nemail: yyyyyy@gmail\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xx.13.79","mapKey":"email","from":"yyyyyy@gmail"}',
'createdAt': '2019-10-11T10:55:19Z'
}, {
'id': 4,
'closed': 1,
'closedAt': '2019-10-11T11:59:12Z',
'disposition': 'timeout',
'formData': '{"body":"name: Pamphtest\\nemail: pamphilexxxx.co\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xx.143.94","mapKey":"email","from":"pamphilexxxx.co","firstName":"pamphilexxxx.co","ListId":1,"email":"pamphilexxxx.co"}',
'createdAt': '2019-10-11T11:27:32Z'
}, {
'id': 5,
'closed': 1,
'closedAt': '2019-10-11T14:06:12Z',
'disposition': 'timeout',
'formData': '{"body":"name: Pamphile\\nemail: Pamphile@gama.com\\n","referer":"https://staging.https://xxxxxxxxxx.com/","customerIp":"xxx.xx.13.79","mapKey":"email","from":"Pamphile@gama.com","firstName":"Pamphile@gama.com","ListId":1,"email":"Pamphile@gama.com"}',
'createdAt': '2019-10-11T13:35:11Z'
},
{
"id": 9806,
"closed": 1,
"closedAt": "2021-03-18T13:08:10Z",
"formData": "{\"body\":\"New chat request \\nEmail Address: daudiXXXXXC@gmail.com\\nWhat is your gender?: Female\\nWhat is your age range?: 20-24 years\\nDo you use contraceptives?: No\\nWhat is your location?: Nairobi\\nTo use this service: true\\n\",\"referer\":\"https://xxxxxxxxxx.com\",\"customerIp\":\"xxxxxxxxxx\",\"mapKey\":\"email\",\"from\":\"daudiXXXXXC@gmail.com\",\"email\":\"daudiXXXXXC@gmail.com\",\"cf_1\":\"Female\",\"cf_2\":\"20-24 years\",\"cf_13\":\"No\",\"cf_14\":\"Nairobi\",\"cf_15\":true,\"firstName\":\"daudiXXXXXC@gmail.com\",\"ListId\":1}",
"createdAt": "2021-03-18T13:03:09Z"
}
我已经设法使用下面的 python 脚本将 JSON 转换为数据框
import requests
from requests.auth import HTTPBasicAuth
import pandas as pd
my_response = requests.get('https://mydomain.io/api/xxx/interactions?fields=formData,closedAt,id,closed,'
'disposition,createdAt,updateAT',auth = HTTPBasicAuth('myname', 'mypassword'))
my_data = dict(my_response.json())
my_dataframe = pd.DataFrame(my_data['rows'])
我面临的挑战是将formData中的键转换为列,并使这些新列成为主数据框的一部分。请看两张照片,显示我当前的结果和我想要的结果 我当前的数据框
解决方案
您有一个 json,其中相关部分是字典列表。在那之前没有什么难的。特殊性是这些 dicts 包含一个必须解码的 json 字符串。然后update
足以将内部 dict 键和值带到包含键的级别。完成此操作后,您可以从字典列表中构建数据框。
代码可以是:
my_response = requests.get(
'https://mydomain.io/api/xxx/interactions?fields=formData,closedAt,id,closed,'
'disposition,createdAt,updateAT',auth = HTTPBasicAuth('myname', 'mypassword'))
my_data = my_response.json()
my_data = my_data['rows'] # only process the rows part
for d in my_data: # but extract the inner json strings for all dicts
d.update(json.loads(d['formData']))
# build the dataframe
df = pd.DataFrame(my_data, columns=['id', 'closed', 'closedAt', 'email', 'cf_1'
'cf_2', 'cf_14', 'cf_13', 'createdAt'])
使用显示的数据,它给出:
id closed closedAt email cf_1cf_2 cf_14 cf_13 createdAt
0 1 1 2019-09-18T18:39:33Z testing@testing.com NaN NaN NaN 2019-09-18T18:39:33Z
1 2 1 2019-09-23T13:10:44Z yyyyyyl@sdfa.com NaN NaN NaN 2019-09-23T13:05:44Z
2 3 1 2019-10-11T11:23:56Z NaN NaN NaN NaN 2019-10-11T10:55:19Z
3 4 1 2019-10-11T11:59:12Z pamphilexxxx.co NaN NaN NaN 2019-10-11T11:27:32Z
4 5 1 2019-10-11T14:06:12Z Pamphile@gama.com NaN NaN NaN 2019-10-11T13:35:11Z
5 9806 1 2021-03-18T13:08:10Z daudiXXXXXC@gmail.com NaN Nairobi No 2021-03-18T13:03:09Z
如您所见,通过给出明确的列列表,很容易只从数据框中选择一些键。如果您的真实数据集中还有其他有趣的列,请调整该部分。
推荐阅读
- java - 尝试更新 jdk 后,Java 主页似乎无效,现在仅指向已卸载的 java 可执行文件并且不会更改
- javascript - 使用 AJAX 请求的用户输入查询不会更新 history.pushState,如果不重新加载页面就无法解析
- python - Urlopen() 代理关键字不存在?
- batch-file - 使用 forfiles 时如何排除具有特定条件的文件名
- javascript - 用户控件中包含的 jqte textarea 的触发更改事件,iframe 的一部分
- html - 调整平移和旋转元素的大小时如何计算顶部和左侧
- list - Ansible 列表以 dict 和递增每个元素的值
- wpf - WPF:仅当列表框中的一项被选中时才启用按钮
- ionic-framework - 为什么加载 Ion-Spinner 会在页面底部呈现
- ios - 文件“ColorAttributeTransformer.swift”无法解锁