首页 > 解决方案 > 想知道如何在tripadvisor上爬行

问题描述

我正在尝试获取新加坡餐馆的所有 url 链接,但我的代码不起作用

data = requests.get("https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html").text

soup = BeautifulSoup(data, "html.parser")

for link in soup.find_all('a', {'property_title'}):
    print('https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href'))
    print(link.string)

它在代码中不断加载和再次加载soup = BeautifulSoup(data, "html.parser")

我不知道为什么会发生这种情况,即使这适用于其他网站。

这是因为旅行顾问阻止抓取还是代码错误?

标签: pythonbeautifulsoupweb-crawlertripadvisor

解决方案


它继续加载并再次加载

要获得响应,请添加:user-agent header

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

data = requests.get(
    "https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text

但是数据是动态加载的,requests不支持动态加载的页面。但是,网站上提供 JSON 格式的文件,(不清楚您要抓取什么)。要获取所有数据,您可以使用json/re模块:

import json
...

data = requests.get(
    "https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text

json_data = re.search(r"window\.__WEB_CONTEXT__=({.*});", data, flags=re.MULTILINE).group(1)

print(
    # Prints all the data, you can use `json.loads` instead to access  the data instead
    json.dumps(json_data, indent=4)
)

要获取所有链接:

import re
import requests


headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

data = requests.get(
    "https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text

for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
    print("https://www.tripadvisor.com.sg/" + link)

输出(截断):

https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1145149-Reviews-Grand_Shanghai_Restaurant-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1193730-Reviews-Entre_Nous_creperie-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1173583-Reviews-The_Courtyard-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d4611806-Reviews-NOX_Dine_in_the_Dark-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d13152787-Reviews-Positano_Risto-Singapore.html

推荐阅读