首页 > 解决方案 > 美丽的汤不会从“下一页”中抓取数据

问题描述

我正在尝试使用 BeautifulSoup 和 Pandas 抓取 airbnb 数据。我检查了很多教程,找到了我遵循的那个。汤应该从下一页刮取数据的步骤不起作用,在 15 页中,它只刮取前 2 或 3 页,有时甚至没有(即使页面的 URL 是正确的)。

我似乎无法理解为什么会发生这种情况以及如何解决它。有人可以帮忙吗?

import requests
import bs4
import pandas as pd
import numpy as np
import csv
import time

url = 'https://www.airbnb.it/s/Italy/homes?checkin=2021-08-01&checkout=2021-08-02'

def get_page(url):
    response = requests.get(url)
    soup = bs4.BeautifulSoup(response.text, "html.parser")
    return soup

def get_listings(soup):
    result = []
    result.extend(soup.find_all("div", {"class": "_8ssblpx"}))
    return result

def get_listing_title(listing):
    for l in listing:
        try:
            return str(l.find('div', {'class': '_1tanv1h'}).text)
        except:
            return None

def get_listing_subtitle(listing):
    for l in listing:
        try:
            return str(l.find('span', {'class': '_1whrsux9'}).text)
        except:
            return None

def get_listing_info(listing):
    for l in listing:
        try:
            return str(l.find_all('div', {'class': '_3c0zz1'})[0].text.lower())
        except:
            return None

def find_next_page(page):
    base_url = "https://www.airbnb.it"
    try:
        nextpage = base_url + get_page(url).find_all("div", attrs={"class": "_jro6t0"})[0].find("a", attrs={'class':'_za9j7e'})['href']
    except:
        nextpage = None
    return nextpage

title = []
subtitle = []
info = []

while url is not None:
    soup = get_page(url)
    listings = get_listings(soup)
    for l in listings:
        title.append(get_listing_title(l))
        subtitle.append(get_listing_subtitle(l))
        info.append(get_listing_info(l))
    time.sleep(5)
    url = find_next_page(soup)
    print(url)

airbnb_data = pd.DataFrame(data = {'title': title,
                          'subtitle': subtitle,
                          'info': info})
airbnb_data

标签: pythonpandasweb-scrapingbeautifulsoup

解决方案


推荐阅读