python - 如何使用 BeautifulSoup 从 div 类中的特定文本中获取数据
问题描述
我只是用 python开发Scraper 。我想在主页上刮一些文字,我这样写代码来获取具体的测试数据,但它什么也没返回。
这是我想要转义的 html 部分
<div class="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom ui-accordion-content-active" id="ui-id-94" aria-labelledby="ui-id-93" role="tabpanel" aria-hidden="false" style="display: block; height: 210px;">
<p>
<a href="/programs-courses/catalogue/programs/PBDCIS">Computer and Information Systems (Post-Baccalaureate Diploma)</a>
<a href="/programs-courses/catalogue/programs/DPCSTI">Computing Studies and Information Systems (Diploma)</a>
<a href="/programs-courses/catalogue/programs/PDDATA">Data Analytics (Post-Degree Diploma)</a>
<a href="/programs-courses/catalogue/programs/ACTCSI_DA">Data and Analytics</a>
<a href="/programs-courses/catalogue/programs/PDEMTC">Emerging Technology (Post-Degree Diploma)</a>
<a href="/programs-courses/catalogue/programs/PDICT">Information and Communication Technology (Post-Degree Diploma) </a>
<a href="/programs-courses/catalogue/programs/ACTCSI_WEB">Web and Mobile Computing</a>
</p>
我想获取程序名称,我这样编码,但它返回一个空列表。
from bs4 import BeautifulSoup
import requests
import os
import re
import sys
URL = "https://www.douglascollege.ca/programs-courses/catalogue/programs"
r = requests.get(URL, headers = self.requestHeaders())
soup = BeautifulSoup(r.text, "html.parser")
test = soup.find_all("a", class_='ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom ui-accordion-content-active')
print(test)
问题是什么...?
解决方案
第一个问题:此页面使用 JavaScript 并且requests
,Beautifulsoup
无法运行 JavaScript。您可能需要Selenium来控制可以运行 JavaScript 的 Web 浏览器。它可以为您提供完整的 HTML,您可以使用它进行搜索Selenium
或使用Beautifulsoup
第二个问题:您必须div
使用这些类进行搜索,稍后div
您必须在没有a
这些类的情况下进行搜索。
顺便说一句:要控制浏览器,您还将拥有Firefox或Chrome的驱动程序
代码:
import selenium.webdriver
from bs4 import BeautifulSoup
url = "https://www.douglascollege.ca/programs-courses/catalogue/programs"
driver = selenium.webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
all_div = soup.find_all("div", class_='ui-accordion-content')
for div in all_div:
all_items = div.find_all("a")
for item in all_items:
print(item.text)
部分结果:
Basic Occupational Education - Electronics and General Assembly
Basic Occupational Education - Food Services
Basic Occupational Education - Retail and Business Services
Child and Youth Care (Bachelor of Arts)
Child and Youth Care (Diploma)
Classroom and Community Support (Certificate)
Classroom and Community Support (Diploma)
Education Assistance and Inclusion (Certificate)
Early Childhood Education (Certificate)
Early Childhood Education (Diploma)
Early Childhood Education: Infant/Toddler (Post-Basic Certificate)
Early Childhood Education: Special Needs - Inclusive Practices (Post-Basic Certificate)
Employment Supports Specialty
Therapeutic Recreation (Bachelor)
Therapeutic Recreation (Diploma)
Accounting (Bachelor of Business Administration)
Accounting (Certificate)
编辑:同样BeautifulSoup
不只使用Selenium
import selenium.webdriver
url = "https://www.douglascollege.ca/programs-courses/catalogue/programs"
driver = selenium.webdriver.Firefox()
driver.get(url)
all_div = driver.find_elements_by_xpath('//div[contains(@class, "ui-accordion-content")]')
for div in all_div:
all_items = div.find_elements_by_tag_name("a")
for item in all_items:
print(item.get_attribute('textContent'))
#print(item.text) # doesn't work for hidden element
推荐阅读
- compression - 一致地压缩文件夹
- java - 如何将按钮放在android java布局的右侧
- android - 如何在 Context 中捕获变量
- mysql - 查询在一个表中查找重复值
- css - 为什么这个 calc() 函数在变换比例下不起作用?
- php - 保存的 html 与浏览器中的输出不同
- java - 播放音频文件并保持 UI 更新 Android
- matlab - 如何将颜色条的字体更改为乳胶?
- node.js - 如何在响应 oracle DB 和 Node JS 中从 post API 返回受影响的行数据
- javascript - 使 ant design modal 可调整大小