首页 > 解决方案 > 我列出了用于抓取数据的不同页面的 URL。谁能告诉我有没有办法自动化这个过程?

问题描述

from bs4 import BeautifulSoup
import requests

urls = ['https://www.trustpilot.com/categories/restaurants_bars? 
 numberofreviews=0&status=all&timeperiod=0', 
'https://www.trustpilot.com/categories/restaurants_bars? 
numberofreviews=0&page=2&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars? 
numberofreviews=0&page=3&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars? 
numberofreviews=0&page=4&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars? 
numberofreviews=0&page=5&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars? 
numberofreviews=0&page=6&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars? 
numberofreviews=0&page=7&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars? 
numberofreviews=0&page=8&status=all&timeperiod=0']
for url in URLs:
    html_text = requests.get(url).text
    soup = BeautifulSoup(html_text, 'lxml')
    restaurants = soup.find_all('div', class_ = 'categoryBusinessListWrapper___14CgD')
    for index, restaurant in enumerate(restaurants):
        tags = restaurant.find_all('a', class_ = 'internal___1jK0Z wrapper___26yB4')
        for tag in tags:
            restaurant_name = tag.find('div', class_ = 'businessTitle___152-c').text.split(',')[0]
            ratings = tag.find('div', class_ = 'textRating___3F1NO')
            location = tag.find('span', class_ = 'locationZipcodeAndCity___33EfU')
            more_info = tag['href']
               

如您所见,我创建了一个 URL 列表来存储该网站上不同页面的 URL。有什么流程可以自动化吗?我使用 BeautifulSoup 和请求模块进行抓取。我想知道是否有任何过程可以自动访问不同页面的 URL。

标签: python-3.xweb-scrapingbeautifulsouppython-requests

解决方案


您可以查看页面底部的分页并使用列表理解来创建这些链接:

import requests
from bs4 import BeautifulSoup
import re

url = 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&status=all&timeperiod=0'
regex = re.compile('pagination')

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
pages = len(soup.find_all("a", {"class": regex}))

links = ['https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page={page}&status=all&timeperiod=0'.format(page=page+1) for page in range(0,pages) ]

输出:

print (links)

['https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=1&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=2&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=3&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=4&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=5&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=6&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=7&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=8&status=all&timeperiod=0']

推荐阅读