python - 通过字典中的 HTML 进行解析
问题描述
我正在尝试从以下网站提取表数据:https ://msih.bgu.ac.il/md-program/residency-placements/
虽然没有表格标签,但我发现将表格的各个部分拉为 div class=accord-con 的通用标签
我制作了一个字典,其中键是毕业年份(即 2019 年、2018 年等),值是每个 div class-accord con 中的 html。
我被卡住了,不知道如何解析字典中的 html。我的目标是分别列出每年的专业、医院和地点。我不知道如何前进。
以下是我的工作代码:
import numpy as np
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
这是我的字典的示例:
{'2019': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>University at Buffalo School of Medicine, Buffalo, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Aventura Hospital, Aventura, Fl</li></ul><h4>Family Medicine</h4><ul><li>Louisiana State University School of Medicine, New Orleans, LA</li><li>UT St Thomas Hospitals, Murfreesboro, TN</li><li>Sea Mar Community Health Center, Seattle, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>St Joseph Hospital, Denver, CO </li></ul><h4>Obstetrics-Gynecology</h4><ul><li>Jersey City Medical Center, Jersey City, NJ</li><li>New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY</li></ul><h4>Pediatrics</h4><ul><li>St Louis Children’s Hospital, St Louis, MO</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>St Christopher’s Hospital, Philadelphia, PA</li></ul><h4>Surgery</h4><ul><li>Mountain Area Health Education Center, Asheville, NC</li></ul><p></p></div>,
'2018': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>NYU School of Medicine, New York, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Kent Hospital, Warwick, Rhode Island</li><li>University of Connecticut School of Medicine, Farmington, CT</li><li>University of Texas Health Science Center at San Antonio, San Antonio, TX</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Family Medicine</h4><ul><li>University of Kansas Medical Center, Wichita, KS</li><li>Ellis Hospital, Schenectady, NY</li><li>Harrison Medical Center, Seattle, WA</li><li>St Francis Hospital, Wilmington, DE </li><li>University of Virginia, Charlottesville, VA</li><li>Valley Medical Center, Renton, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>Virginia Commonwealth University Health Systems, Richmond, VA</li><li>University of Chicago Medical Center, Chicago, IL</li></ul><h4>Obstetrics-Gynecology</h4><ul><li>St Francis Hospital, Hartford, CT</li></ul><h4>Pediatrics</h4><ul><li>Case Western University Hospitals Cleveland Medical Center, Cleveland, OH</li><li>Jersey Shore University Medical Center, Neptune City, NJ</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>University of Virginia, Charlottesville, VA</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Preliminary Medicine Neurology</h4><ul><li>Howard University Hospital, Washington, DC</li></ul><h4>Preliminary Medicine Radiology</h4><ul><li>Maimonides Medical Center, Bronx, NY</li></ul><h4>Preliminary Medicine Surgery</h4><ul><li>Providence Park Hospital, Southfield, MI</li></ul><h4>Psychiatry</h4><ul><li>University of Maryland Medical Center, Baltimore, MI</li></ul><p></p></div>,
我的最终目标是将这些数据提取到具有以下列的 pandas 数据框中:毕业年份、专业、医院、位置
解决方案
我不知道熊猫。下面的代码可以获取表中的数据。我不知道这是否能满足您的需求。
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://msih.bgu.ac.il/md-program/residency-placements/'
response = requests.get(url)
doc = SimplifiedDoc(response.text)
divs = doc.getElementsByClass('accord-head')
datas={}
for div in divs:
grad_year = div.h2.text[-4:]
rez_classe = div.getElementByClass('accord-con')
h4s = rez_classe.h4s # get h4
for h4 in h4s:
if not h4.next:
continue
lis = h4.next.lis
specialty = h4.text
hospital = [li.text for li in lis]
datas[grad_year]={'specialty':specialty,'hospital':hospital}
for data in datas:
print (data,datas[data])
推荐阅读
- r - 如何更改具有先前条件的向量的值?
- css - 如何在ng-multiselect-dropdown Angular 7中将选定的文本和复选框蓝色更改为灰色
- swift - 无法访问 TableView [Swift] 中单元格的对象值
- javascript - 格式化程序功能在 Highstocks 中不起作用
- python - 打印指标 RNN
- html - 媒体查询不起作用 - 在移动设备上隐藏 div
- git - repo/branch/commit 引用的规范格式
- android - 在 android gradle 3.6.1 中运行构建任务后未生成 Jar
- javascript - react chart js如何隐藏网格线
- reactjs - material-ui 中的 makeStyles 到底是什么?