首页 > 解决方案 > 使用 Python 进行网页抓取 - 空白返回

问题描述

我正在尝试从 TrustPilot 中抓取评论,但代码总是返回空白表和我指定的标题/类别。有人可以帮我解决这个问题吗?

from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd
driver= webdriver.Chrome()
names=[] #List to store name of the product
headers=[] #List to store price of the product
bodies=[]
ratings=[] #List to store rating of the product
dates=[]
#driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.trustpilot.com/review/birchbox.com?page=2")

content = driver.page_source
soup = BeautifulSoup(content, "html.parser", parse_only=SoupStrainer('a'))
for a in soup.findAll('a', href=True, attrs={'class':'reviews-container'}):
    name=a.find('div', attrs={'class':'consumer-information_name'})
    header=a.find('div', attrs={'class':'review-content_title'})
    body=a.find('div', attrs={'class':'review-content_text'})
    rating=a.find('div', attrs={'class':'star-rating star-rating--medium'})
    date=a.find('div', attrs={'class':'review-date--tooltip-target'})
    names.append(name.text)
    headers.append(header.text)
    bodies.append(body.text)
    ratings.append(rating.text)
    dates.append(date.text)

print ('webpage, no errors')

df = pd.DataFrame({'User Name':names,'Header':headers,'Body':bodies,'Rating':ratings,'Date':dates})
df.to_csv('reviews02.csv', index=False, encoding='utf-8')

print ('csv made')```

标签: pythonpandasseleniumweb-scrapingbeautifulsoup

解决方案


问题是soup.findAll('a', href=True, attrs={'class':'reviews-container'})没有找到任何结果,所以循环中有 0 次迭代。确保您使用正确的标签和类名。你也不需要使用循环,因为 BeautifulSoup 有一个find_all方法。我使用 requests 模块打开网页,虽然它不应该有所作为。

from bs4 import BeautifulSoup
import requests
req = requests.get("https://www.trustpilot.com/review/birchbox.com?page=2")
content = req.content
soup = BeautifulSoup(content, "lxml")
names = soup.find_all('div', attrs={'class': 'consumer-information__name'})
headers = soup.find_all('h2', attrs={'class':'review-content__title'})
bodies = soup.find_all('p', attrs={'class':'review-content__text'})
ratings = soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})
dates = soup.find_all('div', attrs={'class':'review-content-header__dates'})

现在每个列表都有 20 个条目。


推荐阅读