python - 从 ClinicalTrials.Gov 的特定字段中抓取数据
问题描述
我编写了一个函数,它给出了一个 NCTID(即 ClinicalTrials.Gov ID),它从 ClinicalTrials.Gov 中抓取数据:
def clinicalTrialsGov (nctid):
data = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms']
tag_matches = data.find_all(subset)
然后我执行以下操作:
tag_dict = dict((str('ct' + tag_matches[i].name.capitalize()), tag_matches[i].text) for i in range(0, len(tag_matches)))
for key in tag_dict:
print(key + ': ' + tag_dict[key])
将此数据转换为字典。但是,在有多种干预类型的情况下(例如NCT02170532),这将只采用一种干预类型。如何调整此代码,以便当存在具有多个值的字段时,这些值将列在逗号分隔的列表中。
电流输出:
ctOfficial_title: Aerosolized Beta-Agonist Isomers in Asthma
ctPhase: Phase 4
ctStudy_type: Interventional
ctAllocation: Non-Randomized
ctIntervention_model: Crossover Assignment
ctPrimary_purpose: Treatment
ctMasking: None (Open Label)
ctPrimary_outcome:
Change in Maximum Forced Expiratory Volume at One Second (FEV1)
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment
ctSecondary_outcome:
Change in Dyspnea Response as Measured by the University of California, San Diego (UCSD) Dyspnea Scale
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment
ctNumber_of_arms: 5
ctEnrollment: 10
ctCondition: Asthma
ctIntervention_type: Drug
ctGender: All
ctMinimum_age: 18 Years
ctMaximum_age: N/A
ctHealthy_volunteers: No
期望的输出:
ctOfficial_title: Aerosolized Beta-Agonist Isomers in Asthma
ctPhase: Phase 4
ctStudy_type: Interventional
ctAllocation: Non-Randomized
ctIntervention_model: Crossover Assignment
ctPrimary_purpose: Treatment
ctMasking: None (Open Label)
ctPrimary_outcome:
Change in Maximum Forced Expiratory Volume at One Second (FEV1)
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment
ctSecondary_outcome:
Change in Dyspnea Response as Measured by the University of California, San Diego (UCSD) Dyspnea Scale
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment
ctNumber_of_arms: 5
ctEnrollment: 10
ctCondition: Asthma
ctIntervention_type: Drug, Drug, Other, Device, Device, Drug
ctGender: All
ctMinimum_age: 18 Years
ctMaximum_age: N/A
ctHealthy_volunteers: No
我怎样才能调整代码以便它会刮掉所有的干预类型?
解决方案
您的代码失败,因为它正在覆盖给定字典键的先前值。相反,您需要附加到现有条目。
您可以使用 Python 的defaultdict()
. 这可用于为每个键自动创建列表。如果有多个条目,则每个条目都会附加到该键的列表中。然后在打印时,如果需要,可以使用分隔符将列表重新连接在一起,
:
import bs4
from collections import defaultdict
from bs4 import BeautifulSoup
import requests
def clinicalTrialsGov(nctid):
data = defaultdict(list)
soup = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms']
for tag in soup.find_all(subset):
data['ct{}'.format(tag.name.capitalize())].append(tag.get_text(strip=True))
for key in data:
print('{}: {}'.format(key, ', '.join(data[key])))
clinicalTrialsGov('NCT02170532')
这将显示以下内容:
ctOfficial_title: Aerosolized Beta-Agonist Isomers in Asthma
ctPhase: Phase 4
ctStudy_type: Interventional
ctAllocation: Non-Randomized
ctIntervention_model: Crossover Assignment
ctPrimary_purpose: Treatment
ctMasking: None (Open Label)
ctPrimary_outcome: Change in Maximum Forced Expiratory Volume at One Second (FEV1)Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment
ctSecondary_outcome: Change in 8 Hour Area-under-the-curve FEV10 to 8 hours post dose, Change in Heart RateBaseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment, Change in Tremor Assessment Measured by a ScaleBaseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatmentTremor assessment will be made on outstretched hands (0 = none, 1+ = fine tremor, barely perceptible, 2+ = obvious tremor)., Change in Dyspnea Response as Measured by the University of California, San Diego (UCSD) Dyspnea ScaleBaseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment
ctNumber_of_arms: 5
ctEnrollment: 10
ctCondition: Asthma
ctIntervention_type: Drug, Drug, Other, Device, Device, Drug
ctGender: All
ctMinimum_age: 18 Years
ctMaximum_age: N/A
ctHealthy_volunteers: No
推荐阅读
- android - 如何在 android studio 中修复我的 sdk 文件夹错误
- c++ - C++ - 返回带有未知模板参数的模板
- flutter - 数据未插入多个 Hive Box
- python - 如何在到达 mainloop() 之前更新 tkinter 窗口
- python - 如何正确处理多个运行时错误?
- python - 如何实现适用于python中谐波振荡器的速度Verlet积分器?
- javascript - 如何使用 react-navigation v5 设置初始状态以将屏幕包含为历史记录?
- c# - 如何将作用域 DbContext 添加到使用 EntityFrameworkCore 和 SSH 隧道进入 MySQL 的 ASP NET Core API 控制器?
- python - 无论它们是作为位置参数还是关键字参数传递,都可以不可知地提取 unittest.mock 调用参数
- python - 变量名内的变量字符串