首页 > 解决方案 > 为什么我的 BeautifulSoup 代码只能抓取一些 Airbnb URL?

问题描述

我一直在尝试使用 Beautiful Soup 从 airbnb.com 抓取 Airbnb 数据。然而,使用下面的代码,即使检查 HTML 代码具有正确的类名,也不是所有的 URL 都被抓取。

   ab_lists[:4]
   output of ab_list: ['www.airbnb.com/rooms/34594075?adults=2&previous_page_section_name=1000',
  'www.airbnb.com/rooms/34056273?adults=2&previous_page_section_name=1000',
  'www.airbnb.com/rooms/48028470?adults=2&previous_page_section_name=1000',
  'www.airbnb.com/rooms/46915499?adults=2&previous_page_section_name=1000']
 

在上面的代码中,我有 4 个 airbnbs 的四个 url,我正在尝试获取 airbnbs 的标题。我为上述列表运行 for 循环以获取更多数据:

 def get_pages(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    return soup

 for ab_url in ab_lists[:4]:
    ab_soup = get_pages("https://" + ab_url)
    a = ab_soup.select('div._mbmcsn')
    b = ab_soup.select('span._142pbzop')
    print(b)

 Output: [<span class="_142pbzop">(36 reviews)</span>]
 []
 []
 []

但是,当我运行 for 循环时,只会抓取一些 URL,而不是全部。

谁能帮我解决这个问题?

标签: pythonhtmlweb-scrapingbeautifulsoup

解决方案


您希望从这些页面中获取的内容是动态的。requests除非您使用任何 api 或任何负责加载相同内容的替代链接,否则模块与动态内容无关。但是,我使用pyppeteer从这些页面中抓取您感兴趣的字段。

import asyncio
import pyppeteer
from pyppeteer import launch

links = [
    'www.airbnb.com/rooms/34594075?adults=2&previous_page_section_name=1000',
    'www.airbnb.com/rooms/34056273?adults=2&previous_page_section_name=1000',
    'www.airbnb.com/rooms/48028470?adults=2&previous_page_section_name=1000',
    'www.airbnb.com/rooms/46915499?adults=2&previous_page_section_name=1000'
]


async def fetch(page,url):
    await page.goto(url,{"waitUntil": "networkidle0"})
    name = await page.querySelectorEval('h1','(e => e.innerText)')
    review = await page.querySelectorEval('a[aria-label*="reviews"] > span','(e => e.innerText)')
    print(name,review)
        
async def main():
    browser = await launch(headless=False,autoClose=False)
    [page] = await browser.pages()
    for link in links:
        qualified_link = f"https://{link}"
        await fetch(page,qualified_link)
    await browser.close()

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

输出:

OGS - Studio 4 (36 reviews)
Bohemian London Living - Large Double Room! (20 reviews)
Bijoux yet luxurious -Belgravia, London 2 reviews
MODERN Double room NOX HOTELS Paddington (78 reviews)

推荐阅读