首页 > 解决方案 > requests_html 渲染返回拒绝访问

问题描述

尝试使用 requests_html 呈现页面时,服务器拒绝访问。当我通过请求发送时,我得到了 HTML。

为什么我的访问被拒绝?

代码

from requests_html import HTMLSession
s = HTMLSession()

base_url = 'https://secure.louisvuitton.com/eng-gb/checkout/review'

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:79.0) Gecko/20100101 Firefox/79.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-GB,en;q=0.5',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'TE': 'Trailers',
}

r = s.get('https://secure.louisvuitton.com/eng-gb/checkout/review', headers=headers)
print(r)


r.html.render()
print(r.html.text)

终端

<Response [200]>
Access Denied
Access Denied
You don't have permission to access "http://secure.louisvuitton.com/eng-gb/checkout/review" on this server.
Reference #18.6fce7a5c.1597604631.1e8bfd7

标签: pythonpython-3.x

解决方案


看起来这个网站不喜欢无头浏览器,它从User-Agent标题中检测到这一点。就我而言,它是:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome /60.0.3112.113 Safari/537.36

现在,该requests_html模块在底层使用Pyppeteer来渲染 JavaScript。有一个选项可以为页面设置 UA ,Pyppeteer但我没有看到一种方便的方法来覆盖某些类来进行此更改。page是在_async_render函数中定义的(准确地说是 a )coroutine

您可以尝试Pyppeteer直接使用,然后仅使用以下方式解析 HTML requests_html

import asyncio
import traceback

from pyppeteer import launch
from requests_html import HTML

URL = 'https://secure.louisvuitton.com/eng-gb/checkout/review'
UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'


async def fetch(url, browser):
    page = await browser.newPage()
    await page.setUserAgent(UA)

    try:
        await page.goto(url, {'waitUntil': 'load'})
    except:
        traceback.print_exc()
    else:
        return await page.content()
    finally:
        await page.close()


async def main():
    browser = await launch(headless=True, args=['--no-sandbox'])

    doc = await fetch(URL, browser)
    await browser.close()

    html = HTML(html=doc)
    print(html.links)


if __name__ == '__main__':
    asyncio.run(main())

推荐阅读