首页 > 解决方案 > 通过字典中的 HTML 进行解析

问题描述

我正在尝试从以下网站提取表数据:https ://msih​​.bgu.ac.il/md-program/residency-placements/

虽然没有表格标签,但我发现将表格的各个部分拉为 div class=accord-con 的通用标签

我制作了一个字典,其中键是毕业年份(即 2019 年、2018 年等),值是每个 div class-accord con 中的 html。

我被卡住了,不知道如何解析字典中的 html。我的目标是分别列出每年的专业、医院和地点。我不知道如何前进。

以下是我的工作代码:

import numpy as np
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
    grad_yr_list.append(header.h2.text[-4:])

rez_classes = soup.find_all('div', class_={'accord-con'})

data_dict = dict(zip(grad_yr_list, rez_classes))

这是我的字典的示例:

{'2019': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>University at Buffalo School of Medicine, Buffalo, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Aventura Hospital, Aventura, Fl</li></ul><h4>Family Medicine</h4><ul><li>Louisiana State University School of Medicine, New Orleans, LA</li><li>UT St Thomas Hospitals, Murfreesboro, TN</li><li>Sea Mar Community Health Center, Seattle, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>St Joseph Hospital, Denver, CO </li></ul><h4>Obstetrics-Gynecology</h4><ul><li>Jersey City Medical Center, Jersey City, NJ</li><li>New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY</li></ul><h4>Pediatrics</h4><ul><li>St Louis Children’s Hospital, St Louis, MO</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>St Christopher’s Hospital, Philadelphia, PA</li></ul><h4>Surgery</h4><ul><li>Mountain Area Health Education Center, Asheville, NC</li></ul><p></p></div>,
 '2018': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>NYU School of Medicine, New York, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Kent Hospital, Warwick, Rhode Island</li><li>University of Connecticut School of Medicine, Farmington, CT</li><li>University of Texas Health Science Center at San Antonio, San Antonio, TX</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Family Medicine</h4><ul><li>University of Kansas Medical Center, Wichita, KS</li><li>Ellis Hospital, Schenectady, NY</li><li>Harrison Medical Center, Seattle, WA</li><li>St Francis Hospital, Wilmington, DE </li><li>University of Virginia, Charlottesville, VA</li><li>Valley Medical Center, Renton, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>Virginia Commonwealth University Health Systems, Richmond, VA</li><li>University of Chicago Medical Center, Chicago, IL</li></ul><h4>Obstetrics-Gynecology</h4><ul><li>St Francis Hospital, Hartford, CT</li></ul><h4>Pediatrics</h4><ul><li>Case Western University Hospitals Cleveland Medical Center, Cleveland, OH</li><li>Jersey Shore University Medical Center, Neptune City, NJ</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>University of Virginia, Charlottesville, VA</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Preliminary Medicine Neurology</h4><ul><li>Howard University Hospital, Washington, DC</li></ul><h4>Preliminary Medicine Radiology</h4><ul><li>Maimonides Medical Center, Bronx, NY</li></ul><h4>Preliminary Medicine Surgery</h4><ul><li>Providence Park Hospital, Southfield, MI</li></ul><h4>Psychiatry</h4><ul><li>University of Maryland Medical Center, Baltimore, MI</li></ul><p></p></div>,

我的最终目标是将这些数据提取到具有以下列的 pandas 数据框中:毕业年份、专业、医院、位置

标签: pythonbeautifulsoup

解决方案


我不知道熊猫。下面的代码可以获取表中的数据。我不知道这是否能满足您的需求。

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
url = 'https://msih.bgu.ac.il/md-program/residency-placements/'
response = requests.get(url)
doc = SimplifiedDoc(response.text)
divs = doc.getElementsByClass('accord-head')
datas={}
for div in divs:
  grad_year = div.h2.text[-4:]
  rez_classe = div.getElementByClass('accord-con')
  h4s = rez_classe.h4s # get h4
  for h4 in h4s:
    if not h4.next: 
      continue
    lis = h4.next.lis
    specialty = h4.text
    hospital = [li.text for li in lis]
    datas[grad_year]={'specialty':specialty,'hospital':hospital}
for data in datas:
  print (data,datas[data])

推荐阅读