python - 如何从嵌套的 JSON 中提取和计算值?
问题描述
我正在尝试遍历 json 列表并从每个 json 返回的字典字典中提取一些信息。大约 99% 的时间,每个 json 字典的第三层包含 5 个 'name' 值,其中 2 个是 xml 文件名。但是,文件不是每次都以相同的顺序出现,并且选择几次,只有一个xml文件。
在代码进入第二个循环之前,我构建了一个循环来使用搜索字符串计算 xml 文件的数量。这可确保xml_dict
我在每个循环中创建的值具有正确数量的值 (2)。
“预计数器”有效,但确实减慢了执行速度。有没有办法更好地结合 xml 计数器来提高性能?另外,我不知道我是否需要'else: continue'。
示例 json 链接:https ://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json
json_list = [all_forms['Link'][x] for x in all_forms.index if all_forms['Form Type'][x] == '13F-HR']
link_list = []
lcounter = 0
for json in json_list:
decode = requests.get(json).json()
xml_dict = {}
xml_count = 0
for dic in decode['directory']['item'][0:]:
for v in dic.values():
if ".xml" in v.lower():
xml_count += 1
else:
continue
for dic in decode['directory']['item'][0:]:
if "primary_doc.xml" in dic['name'] and xml_count > 1:
xml_dict['doc_xml'] = json.replace('index.json', '') + dic['name']
elif ".xml" in dic['name'].lower() and "primary_doc" not in dic['name']:
xml_dict['hold_xml'] = json.replace('index.json', '') + dic['name']
else:
continue
if xml_dict:
link_list.append(xml_dict)
lcounter += 1
if lcounter % 100 == 0:
print("Processed {} forms".format(lcounter))
解决方案
- 我认为使用
pandas
矢量化函数 会更容易和更快- 这是获取所有计数的 5 行代码,而且速度很快。
- 一旦 xml 文件计数和所有文件的路径可用
.xml
,请考虑查看如何将 XML 文件转换为漂亮的 pandas 数据框?自动处理这些文件。
import pandas as pd
# list to index.json for Archives
paths = ['https://www.sec.gov/Archives/edgar/data/1736260/000119312515118890/index.json',
'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/index.json']
# download and each json and join it into a single dataframe
# reset the index, so each row has a unique index number
df = pd.concat([pd.read_json(path, orient='index') for path in paths]).reset_index()
# item is a list of dictionaries that can be exploded to separate columns
dfe = df.explode('item').reset_index(drop=True)
# each dictionary now has a separate row
# normalize the dicts, so each key is a column name and each value is in the row
# rename 'name' to 'item_name', this is the column containing file names like .xml
# join this back to the main dataframe and drop the item row
dfj = dfe.join(pd.json_normalize(dfe.item).rename(columns={'name': 'item_name'})).drop(columns=['item'])
# find the rows with .xml in item_name
# groupby name, which is the archive path with CIK and Accession Number
# count the number of xml files
dfg = dfj.item_name[dfj.item_name.str.contains('.xml', case=False)].groupby(dfj.name).count().reset_index().rename(columns={'item_name': 'xml_count'})
# display(dfg)
name xml_count
0 /Archives/edgar/data/1736260/000173626020000004 2
1 /Archives/edgar/data/51143/000104746917001061 6
- 打印包含所有 xml 文件名的数据帧,并在数据帧中使用相应的索引
print(dfj[['name', 'item_name']][dfj.item_name.str.contains('.xml')].reset_index())
[out]:
index name item_name
0 43 /Archives/edgar/data/1736260/000173626020000004 cpia2ndqtr202013fhr.xml
1 44 /Archives/edgar/data/1736260/000173626020000004 primary_doc.xml
2 66 /Archives/edgar/data/51143/000104746917001061 FilingSummary.xml
3 74 /Archives/edgar/data/51143/000104746917001061 ibm-20161231.xml
4 76 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_cal.xml
5 77 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_def.xml
6 78 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_lab.xml
7 79 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_pre.xml
- 仅使用 xml 文件创建一个数据框,并添加一个包含这些文件的完整路径的列
xml_files = dfj[dfj.item_name.str.contains('.xml', case=False)].copy()
# add a column that creates a full path to the xml files
xml_files['file_path'] = xml_files[['name', 'item_name']].apply(lambda x: f'https://www.sec.gov{x[0]}/{x[1]}', axis=1)
# disply(xml_files)
index name parent-dir last-modified item_name type size file_path
43 directory /Archives/edgar/data/1736260/000173626020000004 /Archives/edgar/data/1736260 2020-07-24 09:38:30 cpia2ndqtr202013fhr.xml text.gif 72804 https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml
44 directory /Archives/edgar/data/1736260/000173626020000004 /Archives/edgar/data/1736260 2020-07-24 09:38:30 primary_doc.xml text.gif 1931 https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml
66 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 FilingSummary.xml text.gif 91940 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml
74 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231.xml text.gif 11684003 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml
76 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_cal.xml text.gif 185502 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml
77 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_def.xml text.gif 801568 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml
78 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_lab.xml text.gif 1356108 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml
79 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_pre.xml text.gif 1314064 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml
# create a list of just the file paths
path_to_xml_files = xml_files.file_path.tolist()
print(path_to_xml_files)
[out]:
['https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml',
'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml']
推荐阅读
- python - 计算列表列表中字符串的出现次数
- java - 我可以按键设置@Cacheable 的 TTL 吗?
- excel - 根据另一列的值从相邻列中获取立即值
- javascript - Cheerio 没有正确解析 HTML
- java - 从 Spring 表单中获取 Post 请求中的 null Set 值
- android - firebase Onclick监听器的嵌套不起作用
- wordpress - 在订单上添加自定义字段作为项目元数据
- sql - 如何在 Hive 中分解数组并创建视图?
- html - 对齐文本并将其环绕在图像旁边
- angular - 如何以角度 2 呈现从 WebAPI 获取的图像