首页 > 解决方案 > BeautifulSoup find_all() 不返回任何内容 []

问题描述

我正在尝试抓取所有优惠的此页面,并且想要迭代<p class="white-strip">page_soup.find_all("p", "white-strip")返回一个空列表 []。

到目前为止我的代码-

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.sbicard.com/en/personal/offers.page#all-offers'

# Opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "lxml")

编辑:我使用 Selenium 让它工作,下面是我使用的代码。但是,我无法找出可以完成相同操作的其他方法。

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome("C:\chromedriver_win32\chromedriver.exe")
driver.get('https://www.sbicard.com/en/personal/offers.page#all-offers')

# html parsing
page_soup = BeautifulSoup(driver.page_source, 'lxml')

# grabs each offer
containers = page_soup.find_all("p", {'class':"white-strip"})

filename = "offers.csv"
f = open(filename, "w")

header = "offer-list\n"

f.write(header)

for container in containers:
    offer = container.span.text
    f.write(offer + "\n")

f.close()
driver.close()

标签: pythonweb-scrapingbeautifulsoup

解决方案


网站是动态渲染请求数据。您应该尝试自动化 selenium 库。它允许您抓取动态呈现请求(js 或 ajax)页面数据。

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome("/usr/bin/chromedriver")
driver.get('https://www.sbicard.com/en/personal/offers.page#all-offers')

page_soup = BeautifulSoup(driver.page_source, 'lxml')
p_list = page_soup.find_all("p", {'class':"white-strip"})

print(p_list)

'/usr/bin/chromedriver'selenium web 驱动程序路径在哪里。

下载 chrome 浏览器的 selenium 网络驱动程序:

http://chromedriver.chromium.org/downloads

为 chrome 浏览器安装 web 驱动程序:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

硒教程:

https://selenium-python.readthedocs.io/


推荐阅读