首页 > 解决方案 > 如何使用 BeautifulSoup 从 div 类中的特定文本中获取数据

问题描述

我只是用 python开发Scraper 。我想在主页上刮一些文字,我这样写代码来获取具体的测试数据,但它什么也没返回。

这是我想要转义的 html 部分

<div class="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom ui-accordion-content-active" id="ui-id-94" aria-labelledby="ui-id-93" role="tabpanel" aria-hidden="false" style="display: block; height: 210px;">
<p>
    <a href="/programs-courses/catalogue/programs/PBDCIS">Computer and Information Systems (Post-Baccalaureate Diploma)</a>
    <a href="/programs-courses/catalogue/programs/DPCSTI">Computing Studies and Information Systems (Diploma)</a>
    <a href="/programs-courses/catalogue/programs/PDDATA">Data Analytics (Post-Degree Diploma)</a>
    <a href="/programs-courses/catalogue/programs/ACTCSI_DA">Data and Analytics</a>
    <a href="/programs-courses/catalogue/programs/PDEMTC">Emerging Technology (Post-Degree Diploma)</a>
    <a href="/programs-courses/catalogue/programs/PDICT">Information and Communication Technology (Post-Degree Diploma) </a>
    <a href="/programs-courses/catalogue/programs/ACTCSI_WEB">Web and Mobile Computing</a>
</p>

我想获取程序名称,我这样编码,但它返回一个空列表。

from bs4 import BeautifulSoup
import requests
import os
import re
import sys

URL = "https://www.douglascollege.ca/programs-courses/catalogue/programs"

    r = requests.get(URL, headers = self.requestHeaders())
    soup = BeautifulSoup(r.text, "html.parser")

    test = soup.find_all("a", class_='ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom ui-accordion-content-active')

    print(test)

问题是什么...?

标签: pythonweb-scrapingbeautifulsoup

解决方案


第一个问题:此页面使用 JavaScript 并且requestsBeautifulsoup无法运行 JavaScript。您可能需要Selenium来控制可以运行 JavaScript 的 Web 浏览器。它可以为您提供完整的 HTML,您可以使用它进行搜索Selenium或使用Beautifulsoup

第二个问题:您必须div使用这些类进行搜索,稍后div您必须在没有a这些类的情况下进行搜索。


顺便说一句:要控制浏览器,您还将拥有FirefoxChrome的驱动程序


代码:

import selenium.webdriver
from bs4 import BeautifulSoup

url = "https://www.douglascollege.ca/programs-courses/catalogue/programs"

driver = selenium.webdriver.Firefox()
driver.get(url)

soup = BeautifulSoup(driver.page_source, "html.parser")

all_div = soup.find_all("div", class_='ui-accordion-content')

for div in all_div:
    all_items = div.find_all("a")

    for item in all_items:
        print(item.text)

部分结果:

Basic Occupational Education - Electronics and General Assembly
Basic Occupational Education - Food Services
Basic Occupational Education - Retail and Business Services
Child and Youth Care (Bachelor of Arts)
Child and Youth Care (Diploma)

Classroom and Community Support (Certificate)
Classroom and Community Support (Diploma)
Education Assistance and Inclusion (Certificate)
Early Childhood Education (Certificate)
Early Childhood Education (Diploma) 
Early Childhood Education: Infant/Toddler (Post-Basic Certificate)
Early Childhood Education: Special Needs - Inclusive Practices (Post-Basic Certificate)
Employment Supports Specialty
Therapeutic Recreation (Bachelor)
Therapeutic Recreation (Diploma)
Accounting (Bachelor of Business Administration)
Accounting (Certificate)

编辑:同样BeautifulSoup不只使用Selenium

import selenium.webdriver

url = "https://www.douglascollege.ca/programs-courses/catalogue/programs"

driver = selenium.webdriver.Firefox()
driver.get(url)

all_div = driver.find_elements_by_xpath('//div[contains(@class, "ui-accordion-content")]')

for div in all_div:
    all_items = div.find_elements_by_tag_name("a")

    for item in all_items:
        print(item.get_attribute('textContent'))
        #print(item.text) # doesn't work for hidden element

推荐阅读