python-3.x - 我列出了用于抓取数据的不同页面的 URL。谁能告诉我有没有办法自动化这个过程?
问题描述
from bs4 import BeautifulSoup
import requests
urls = ['https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=2&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=3&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=4&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=5&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=6&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=7&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=8&status=all&timeperiod=0']
for url in URLs:
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'lxml')
restaurants = soup.find_all('div', class_ = 'categoryBusinessListWrapper___14CgD')
for index, restaurant in enumerate(restaurants):
tags = restaurant.find_all('a', class_ = 'internal___1jK0Z wrapper___26yB4')
for tag in tags:
restaurant_name = tag.find('div', class_ = 'businessTitle___152-c').text.split(',')[0]
ratings = tag.find('div', class_ = 'textRating___3F1NO')
location = tag.find('span', class_ = 'locationZipcodeAndCity___33EfU')
more_info = tag['href']
如您所见,我创建了一个 URL 列表来存储该网站上不同页面的 URL。有什么流程可以自动化吗?我使用 BeautifulSoup 和请求模块进行抓取。我想知道是否有任何过程可以自动访问不同页面的 URL。
解决方案
您可以查看页面底部的分页并使用列表理解来创建这些链接:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&status=all&timeperiod=0'
regex = re.compile('pagination')
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
pages = len(soup.find_all("a", {"class": regex}))
links = ['https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page={page}&status=all&timeperiod=0'.format(page=page+1) for page in range(0,pages) ]
输出:
print (links)
['https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=1&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=2&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=3&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=4&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=5&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=6&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=7&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=8&status=all&timeperiod=0']
推荐阅读
- c# - 如何在触发器中使用左边距值?
- azure-devops - 更改分配给 VSTS 中用户的任务的顺序
- android - Volley 网络请求响应没有 HTML 标签
- android - 如何在不使用 TabLout 中的向上按钮的情况下访问父活动
- visual-studio - 如何在无人值守模式下禁用 Visual Studio 扩展自动更新?
- javascript - 提交表单后重定向
- javascript - 在 2 个字段中拆分 Javascript 中的字符串
- python - 在 Python 中解析 DN、来自 LDAP URL 的属性
- html - 3 列流体 flexbox 和 rtl
- geojson - 是否可以在本地可视化超过 100kB 的 GeoJSON 文件?