首页 > 解决方案 > Python 从tripadvisor 抓取“要做的事情”

问题描述

这个页面,我想抓取列表“迈阿密要做的事情类型”(你可以在页面末尾附近找到它)。这是我到目前为止所拥有的:

import requests
from bs4 import BeautifulSoup

# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"

headers = {'User-Agent': user_agent}

new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")

tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})

# Iterate over tag_elements and exctract strings
tags_list = []
for i in tag_elements:
    tags_list.append(i.string)

问题是,我得到的值'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)'来自页面的“事物类型...”部分下方的“迈阿密常用搜索”区域。我也没有得到一些我需要的值"Traveler Resources (7)", "Day Trips (7)"等。这两个列表“要做的事情......”和“常用搜索......”的类名是相同的,我使用的类soup.findAll()可能我猜是这个问题的原因。这样做的正确方法是什么?我应该采取其他方法吗?

标签: pythonweb-scrapingbeautifulsouptripadvisor

解决方案


这在浏览器中非常简单:

filters = driver.execute_script("return [...document.querySelectorAll('.filterName a')].map(a => a.innerText)")

推荐阅读