首页 > 解决方案 > 使用 python 请求和标头拒绝访问

问题描述

我试图从www.aldi.co.uk获取信息,但即使使用来自网站的标头,我也不断被 python 请求拒绝访问。我试图抓取的示例 URL 是https://www.aldi.co.uk/gardenline-2000w-patio-heater/p/709938458399000


with open('C:/Users/Administrator/Desktop/proxies2.txt') as f:
    PROXIES = f.readlines()

headers = {
    'authority': 'www.aldi.co.uk',
    'method': 'GET',
    'path': '/gardenline-2000w-patio-heater/p/709938458399000',
    'scheme': 'https',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'cache-control': 'max-age=0',
    'cookie': 'JSESSIONID=3CD7A9CD1F847495BC0E57269B0D2343.node5; mercia_pers=15dbd719637e71e60e064d3064d298c0da86611e; b1pi=!dVpaE+0OdClIJlDBIJDu7YLsLa10Fj5VXK60G1v10AEXoRwO7BVS0C6sZ++KgnjRJbFDk6WX6K7CyFk=; bm_sz=566521EFC53AE332524BF8A3685DB4D9~YAAQVeQWAqyBfjJ5AQAAL2fQgwsVeG8vZbdoiSjRseh3EQtmAw+s7t2wc1idrKIP54BG7+4BbyOwnSGfVjYswRgGvbJR1Ab/rd1KK6d2089qTkRwbdo7ualo72Tb1MxS9eBFUsAQZLADc8LQwh9Lf0mOEDuTWLHG3WPL/tWpRWs7zIGzgBTJSjAGklW2xqs0; _gcl_au=1.1.1711508455.1621414143; px_random=1; ak_bmsc=12D85371C2578A6C86F8010A7775C3140216E455B2060000FFD0A460101E7146~plUnE+Y3O3qH92wWP4RKrOIZ75DoHXQ782drrSUjzA4AFxFqAGGda2Q71kZW9MKLQMSIPGSy0b2mER3KKFFFJCHHvEPi85myXeUMExqzJz9yhP3zGkG+yk1RwjkT0Bxi5Gu0uU7gK5zjOLtCByAVki9oDM0z5gYpK0StNP19EzZvhIUtqShvNxagYxKA3ZR+FSH1ZfVkSTQPDIaoDS1NfLa3Xw0iG+IaoFcg0SapH40Mr1Gze7DNJseRJjZk6uHrLH; AMCVS_95446750574EBBDF7F000101%40AdobeOrg=1; _hjTLDTest=1; _hjid=dd61e5d7-6420-4751-a6ff-531345b775d0; _hjFirstSeen=1; bm_sv=5FDE1A04B76A59BDA13506B03EE1A388~qH6bHmkgTSXHJmnMknd6rz+apNw7QYxO2a9vKPYbdPuFIt01M//OXvb515o0Rn6O/7cXh077yUtYAfjKTWxZZ28IxChyTlWif08M+gm/qTtK+tfcJ1ElbKOvqPDL7V1Z5F6PmS6HyXCJ1lmmddlMLkdJkFk62zjwJ06vKlscb4M=; _uetsid=10915620b87f11eba4782bddfe4be808; _uetvid=1091af00b87f11eba4af25f4003c61bd; _ga=GA1.3.514717246.1621414144; _gid=GA1.3.862108792.1621414144; AMCV_95446750574EBBDF7F000101%40AdobeOrg=-1124106680%7CMCIDTS%7C18767%7CMCMID%7C16739144063247909584513939200244025078%7CMCAAMLH-1622018944%7C6%7CMCAAMB-1622018944%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1621421344s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C5.2.0; BVBRANDID=90e17435-7edf-450f-81ae-9acc5e07a709; BVBRANDSID=879c4208-412b-44a8-92e9-bbf6a9466c97; _hjIncludedInSessionSample=0; _hjAbsoluteSessionInProgress=0; rr_rcs=eF5j4cotK8lMETA0NzTRNdQ1ZClN9kgztQTyElN1jVMNLXRNjJMtdFNTDIHcJNMU81RLIzMjM0sAlEoOSA; _dc_gtm_UA-62398555-4=1; _fbp=fb.2.1621414145210.1275760171; _derived_epik=dj0yJnU9OENZUUdzMk1kamt3c3M4WE5RRi0zSUtTaGVUdTNZX2Ymbj11ek1sUVhLSjJWVzA1NUs5UmdPYUF3Jm09NyZ0PUFBQUFBR0NrMFFNJnJtPWUmcnQ9QUFBQUFHQVVnWlE; _pin_unauth=dWlkPU5UZGpPV1pqTVRZdE1UTTRZaTAwT0RCaExUbGhZekl0WkdVMFl6SmxOR1ZsT1RNeg; BVImplmercia=11002_5_0; _abck=25BD5B1352FF94A494F37B1D0F75572D~0~YAAQVeQWArOBfjJ5AQAAEHjQgwVlotJa2ibNkcfJoxy8PuZj3f2IXG1tblii4YK+jK7Ns5M5J+EYKse7lrnG5bgsD5p+5RPVpWDfpgKfBLJPBiFKNgxWnwXRFgUVsAYWkntTM/TxPR3NQWZuUOG0M3hjwHxN/x3E3HZoEqDmgoDO5R/l6boWjeh2s0Vp5hFX8pVaUGIHy+2vK/TN9HmYt+zn1+5JHZ1qb9+6/LGGRfh3CIiPO1PJ23lwdNCJTRZwc39jtzqiL412q4KNuOIowO0cgyayckmCOilyqRHjK6+OGgEblY/c1PakRF/5bmkTLvxAR4NzELDyscOLb8SdBeCM2jK78IiOVnpSeQScpVbOYUENP9pc5+hYFm9HBhAy3cQQDTJ50NKH6d69uDEFcA+m7f9XUbi0~-1~||-1||~-1; QueueITAccepted-SDFrts345E-V3_easter2021=EventId%3Deaster2021%26QueueId%3D00000000-0000-0000-0000-000000000000%26RedirectType%3Ddisabled%26IssueTime%3D1621414148%26Hash%3D9a60f4e686dd39c4e97a68193d55b3bf6dca90c5d4d417d20ddd3ed6992296c1; _taggstar_ses=11aa20f2-b87f-11eb-896f-9958026d8250; _taggstar_vid=11aa20f2-b87f-11eb-896f-9958026d8250; _taggstar_exp=v:3|id:|group:; s_vnum=1652950147659%26vn%3D1; s_invisit=true; s_nr=1621414149284-New; gpv_pn=%2Fketer-manor-garden-shed%2Fp%2F710498460844800; s_cc=true; __atuvc=1%7C20; __atuvs=60a4d0ffec2acf52000; aam_uuid=17038672222160981164484197334904414492',
    'origin': 'www.aldi.co.uk',
    'referer': 'www.aldi.co.uk/',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
    
proxy_index=random.randint(0,len(PROXIES)-1)
proxy_parts=PROXIES[proxy_index].split(':')
proxy_parts[3]=proxy_parts[3].strip('\n')
proxy_parts[3]=proxy_parts[3].strip(' ')

            
            
proxy={"http":('http://'+proxy_parts[2]+':'+proxy_parts[3]+'@'+proxy_parts[0]+':'+proxy_parts[1]),"https":('http://'+proxy_parts[2]+':'+proxy_parts[3]+'@'+proxy_parts[0]+':'+proxy_parts[1])}

html = scraper.get(info[0], proxies= proxy, headers=headers).text

print(html)

我认为这可能是代理问题,但在没有代理的情况下运行它仍然会出现同样的错误。这是html返回的内容:

<TITLE>Access Denied</TITLE>
    <!-- endbuild -->
    <script defer src="https://www.google.com/recaptcha/api.js?onload=renderRecaptchas&render=explicit"></script>
<script type="text/javascript" nonce="ab27d842a4c159f0791de7fc7d0e9e7a">var _cf = _cf || [];  _cf.push(['_setFsp', true]);  _cf.push(['_setBm', true]);  _cf.push(['_setAu', '/staticweb/4a5ae10f5cfti2055070e1c94cd9e821b']); </script><script type="text/javascript" nonce="ab27d842a4c159f0791de7fc7d0e9e7a" src="/staticweb/4a5ae10f5cfti2055070e1c94cd9e821b"></script></body>
<!--Variable required in redirect.tag in order to ASM to work -->
<script type="text/javascript">
    var ACC = { config: {} };
    ACC.config.encodedContextPath = "";
</script>
</html>
<div class="hidden">
        vdecKVlCbUukcVtbDymMUYPJVVOTSBpMLSEKMcLEeBAJDpdpodOhdpAywCtFBYmbtPMYYwiLlhFSXjpVUjccPKKFupuVFArBvjLmVZlJbIttTIPLutuodUqhEtjYmUwVRkqMiHlSkkBjnjQhiyBcJYkbuybVNRUTrqfQYjGNdoS
    <!-- endbuild -->
    <script defer src="https://www.google.com/recaptcha/api.js?onload=renderRecaptchas&render=explicit"></script>
<script type="text/javascript" nonce="cdaad728d6c7629abcaf0b22503c1f0f">var _cf = _cf || [];  _cf.push(['_setFsp', true]);  _cf.push(['_setBm', true]);  _cf.push(['_setAu', '/staticweb/4a5ae10f5cfti2055070e1c94cd9e821b']); </script><script type="text/javascript" nonce="cdaad728d6c7629abcaf0b22503c1f0f" src="/staticweb/4a5ae10f5cfti2055070e1c94cd9e821b"></script></body>
<!--Variable required in redirect.tag in order to ASM to work -->
<script type="text/javascript">
    var ACC = { config: {} };
    ACC.config.encodedContextPath = "";
</script>
</html>
<div class="hidden">
        QNBuxaDGYAfWXfkAiKIJFmGHySjFArtIZPjyUmiqetXcgYSrMvtvOYUcazdZFnGWEDDYNsJplgFXjofUdqXnnAAtClJlpcikRbNQlldVEKUWNSFIdwgNRZKcWkNkLUAYCyWUpqsIfotcXDnfOdHJnWTZTTaKXdGoWYswgnCNNRzwnHbLBuiUQMkEbrhmrxx
</div>

我还有什么可以添加到请求中以阻止网站拒绝请求的吗?如果没有,硒是唯一的选择,有没有办法以与请求相似的速度运行它

标签: pythonweb-scrapingpython-requests

解决方案


推荐阅读