首页 > 解决方案 > 使用 BS4 刮取查看更多结果

问题描述

如何通过查看更多按钮或使用 bs4 向下滚动来抓取隐藏的产品?就我而言,我试图从下面的链接中抓取所有搜索结果,但我只能抓取 20 本书,即使有超过 20 本书。在这种情况下,我如何获得所有搜索结果,以及如何在其他做同样事情的网站上做到这一点?

from bs4 import BeautifulSoup
from Book import Book
import requests

class BertrandScrapper:

    def get_prices(self, title,author):
        page = requests.get('https://www.bertrand.pt/pesquisa/'+(str(title)+" "+str(author)).replace(" ","+"))
        soup = BeautifulSoup(page.text, 'html.parser')
        titles=soup.findAll(class_='title')
        for a in titles:
            print(a.text.strip())
        print(len(titles))

https://www.bertrand.pt/pesquisa/os+maias+e%C3%A7a+de+queiroz

标签: pythonweb-scrapingbeautifulsoup

解决方案


这些页面加载了对 的post请求https://www.bertrand.pt/pesquisando。您可以像这样检索所有标题:

import requests

def get_results(page_nr):
  data = {
    'requestArea': '',
    'pagina': str(page_nr),
    'palavra': 'os+maias+e%C3%A7a+de+queiroz',
    'filterKey': '',
    'filterValue': '',
    'filterName': '',
    'filterMap': '',
    'filterOperation': '',
    'filterField': '',
    'filterOrder': '',
    'tab': 'livros'
  }

  response = requests.post('https://www.bertrand.pt/pesquisando', data=data)
  soup = BeautifulSoup(response.content, 'html.parser')
  titles=soup.findAll(class_='title')
  return [a.text.strip() for a in titles]

page_nr = 1
titles = []

while True:
  print("checking page nr", page_nr)
  title_results = get_results(page_nr)
  if not title_results:
    print("No more results")
    break
  else:
    titles.extend(title_results)
    page_nr = page_nr+1

结果titles

['Os Maias', 'Reler Eça de Queiroz', 'Os Maias', 'Os Maias', 'Maias\n\n\n(eBook)', 'Maias', 'Os Maias', 'MAIAS (OS) QUEIROZ, ECA DE', 'Os Maias', 'Os Maias', 'Os Maias\n\n\n(eBook)', 'Os Maias de Eça de Queiróz', 'Os Maias', 'The Maias', 'Os Maias\n\n\n(eBook)', 'Os Maias', 'Os Maias', 'Os Maias', 'Os Maias - Antologia Ilustrada', 'The Maias, The', 'Os Maias  - Volume Ii', 'Os Maias - Volume I', 'Os Maias - Vol. 1 e 2', 'Os Maias - O Realismo']

推荐阅读