首页 > 解决方案 > 使用 Post 更改页面

问题描述

我一直在使用 Selenium 来抓取一个网站,但由于某些原因它不再工作了。我使用 Selenium 是因为您需要与站点交互以翻阅页面(即:单击下一步按钮)。

作为解决方案,我正在考虑使用 Requests 中的 Post 方法。我不确定它是否可行,因为我从未使用过 Post 方法,而且我不熟悉它的作用(尽管我有点理解一般想法)。

我的代码看起来像这样:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent":
           "Mozilla/5.0 (Macintosh; Intel Mac OS X 10 11 5) "
           "AppleWebKit/537.36 (KHTML, like Gecko) "
           "Chrome/50.0.2661.102 Safari/537.36"}

url = "https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail"

def infinity():
    while True:
        yield

c = 0
urls = []
for i in infinity():
    c += 1
    page = list(str(soup.find("li",{"class":"pager-current"}).text).split())
    pageTot = int("".join(page[-2:])) # Check the total number of page
    if c <= pageTot: # Scrape the first page
        if c <= 1:
            req = requests.get(url, headers=headers)
        else:
            pass
            # This is where I'm stuck but ideally I'd be using Post method in some way 
        soup = BeautifulSoup(req.content,"lxml")
        for link in soup.find_all("a",{"class":"a-more-detail"}):
            try: # For each page scrape ads url
                urls.append("https://www.centris.ca" + link["href"])
            except KeyError:
                pass
    else: # When all pages are scrape exit the loop
        break

for url in list(dict.fromkeys(urls)):
    pass # do stuff

当您在网页上单击下一步时,会发生以下情况:

这是请求(startPosition 从第 1 页上的 0 开始,并以 12 的跳跃增加) 在此处输入图像描述

这是响应的一部分:

{"d":{"Message":"","Result":{"html": [...], "count":34302,"inscNumberPerPage":12,"title":""},"Succeeded":true}}

有了这些信息,是否可以使用 Post 方法来抓取每一页?我怎么能这样做?

标签: pythonhtmlpostweb-scrapingpython-requests

解决方案


以下应该可以解决问题。我添加了重复过滤逻辑以避免打印重复链接。一旦没有更多的结果可以抓取,脚本应该会中断。

import requests
from bs4 import BeautifulSoup

base = 'https://www.centris.ca{}'
post_link = 'https://www.centris.ca/Property/GetInscriptions'
url = 'https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail'
unique_links = set()

payload = {"startPosition":0}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['content-type'] = 'application/json; charset=UTF-8'
    
    s.get(url) #Sent this requests to get the cookies

    while True:
        r = s.post(post_link,json=payload)
        if not len(r.json()['d']['Result']['html']):break
        soup = BeautifulSoup(r.json()['d']['Result']['html'],"html.parser")
        for item in soup.select(".thumbnailItem a.a-more-detail"):
            unique_link = base.format(item.get("href"))
            if unique_link not in unique_links:
                print(unique_link)
            unique_links.add(unique_link)

        payload['startPosition']+=12

推荐阅读