python - Selenium/BeautifulSoup - Python - 循环多个页面
问题描述
我花了大部分时间研究和测试在零售商网站上循环浏览一组产品的最佳方法。
虽然我成功地收集了第一页上的一组产品(和属性),但我一直难以找出循环浏览网站页面以继续我的抓取的最佳方法。
根据下面的代码,我尝试使用“while”循环和 Selenium 单击网站的“下一页”按钮,然后继续收集产品。
问题是我的代码仍然没有超过第 1 页。
我在这里犯了一个愚蠢的错误吗?阅读此站点上的 4 或 5 个类似示例,但没有一个足够具体,无法在此处提供解决方案。
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')
products.clear()
hyperlinks.clear()
reviewCounts.clear()
starRatings.clear()
products = []
hyperlinks = []
reviewCounts = []
starRatings = []
pageCounter = 0
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('li', class_ = 'products_grid')
while (pageCounter < maxPageCount):
for product in prod_containers:
# If the product has review count, then extract:
if product.find('span', class_ = 'prod_ratingCount') is not None:
# The product name
name = product.find('div', class_ = 'prod_nameBlock')
name = re.sub(r"\s+", " ", name.text)
products.append(name)
# The product hyperlink
hyperlink = product.find('span', class_ = 'prod_ratingCount')
hyperlink = hyperlink.a
hyperlink = hyperlink.get('href')
hyperlinks.append(hyperlink)
# The product review count
reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
reviewCounts.append(reviewCount)
# The product overall star ratings
starRating = product.find('span', class_ = 'prod_ratingCount')
starRating = starRating.a
starRating = starRating.get('alt')
starRatings.append(starRating)
driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
counterProduct +=1
print(counterProduct)
解决方案
每次“单击”下一页时都需要解析。因此,您需要将其包含在您的 while 循环中,否则您将继续迭代第一页,即使它单击到下一页,因为 prod_containers 对象永远不会改变。
其次,按照你的方式,你的while循环永远不会停止,因为你设置了pageCounter = 0,但永远不会增加它......它将永远是<你的maxPageCount。
我在代码中修复了这两件事并运行它,它似乎已经工作并解析了第 1 页到第 5 页。
from selenium import webdriver
from bs4 import BeautifulSoup
import re
driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')
products = []
hyperlinks = []
reviewCounts = []
starRatings = []
pageCounter = 0
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1
prod_containers = html_soup.find_all('li', class_ = 'products_grid')
while (pageCounter < maxPageCount):
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('li', class_ = 'products_grid')
for product in prod_containers:
# If the product has review count, then extract:
if product.find('span', class_ = 'prod_ratingCount') is not None:
# The product name
name = product.find('div', class_ = 'prod_nameBlock')
name = re.sub(r"\s+", " ", name.text)
name = name.strip()
products.append(name)
# The product hyperlink
hyperlink = product.find('span', class_ = 'prod_ratingCount')
hyperlink = hyperlink.a
hyperlink = hyperlink.get('href')
hyperlinks.append(hyperlink)
# The product review count
reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
reviewCounts.append(reviewCount)
# The product overall star ratings
starRating = product.find('span', class_ = 'prod_ratingCount')
starRating = starRating.a
starRating = starRating.get('alt')
starRatings.append(starRating)
driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
pageCounter +=1
print(pageCounter)
推荐阅读
- kotlin - 如何在 Kotlin 中过滤具有多个“案例”的单个列表
- swift - 可能带有新 ViewController 演示样式的 iOS13 或 Xcode 错误
- python - VGG16微调
- c# - 如何将值分配给任务
类型字段 - ios - 属性更改时 Observable 不更新列表
- reactjs - 不兼容的道具:无效不能分配给 ThunkAction
- javascript - 为什么 Axios 从 React 子组件 POST 时不包含 XSRF-TOKEN?
- django - 创建新对象时如何更新外键上的字段?
- javascript - 更新用户生成标记的搜索结果
- php - 如何在现有 xml php 的末尾添加新节点