首页 > 解决方案 > 无法使用 Selenium 和 Python 加载网页 https://www.riachuelo.com.br/feminino/colecao-feminino

问题描述

我一直在尝试使用 Selenium抓取此页面( https://www.riachuelo.com.br/feminino/colecao-feminino ),但我无法访问 html,因为它永远不会加载。我尝试使用随机用户代理和其他浏览器,但问题仍然存在。任何想法为什么会发生这种情况?

这是代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
URL = "https://www.riachuelo.com.br/feminino/colecao-feminino"
options = Options()
ua = UserAgent()
userAgent = ua.random
options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(chrome_options=options,executable_path=r"C:\Program Files (x86)\chromedriver.exe")
driver.get(URL)

标签: pythonseleniumselenium-webdriverweb-scrapingwebdriver

解决方案


我使用Selenium执行了您的用例以在https://www.riachuelo.com.br/feminino/colecao-feminino加载网页,如下所示:

from selenium import webdriver

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.riachuelo.com.br/feminino/colecao-feminino')

同样,根据您的观察,我遇到了网页从未加载的相同障碍。:

里亚丘洛


分析

在检查网页的DOM 树时,您会发现一些,标签引用了关键字dist。举个例子:<iframe><script>

  • src="https://dtbot.directtalk.com.br/1.0/staticbot/dist/js/../index.html#!/?token=c243ce95-db6c-4ab6-9f2b-bf60d69c2d3d&widget=true&top=40&text=Alguma%20d%C3%BAvida%3F&textcolor=ffffff&bgcolor=4E1D3A&from=bottomRigth"
  • <script id="dtbot-script" src="https://dtbot.directtalk.com.br/1.0/staticbot/dist/js/dtbot.js?token=c243ce95-db6c-4ab6-9f2b-bf60d69c2d3d&amp;widget=true&amp;top=40&amp;text=Alguma%20d%C3%BAvida%3F&amp;textcolor=ffffff&amp;bgcolor=4E1D3A&amp;from=bottomRigth"></script>

这清楚地表明该网站受到Bot Management服务提供商Distil Networks的保护,并且ChromeDriver的导航被检测到并随后被阻止


蒸馏

根据文章Distil.it 确实有一些东西......

Distil 通过观察网站行为和识别抓取工具特有的模式来保护网站免受自动内容抓取机器人的侵害。当 Distil 在一个站点上识别出恶意机器人时,它会创建一个列入黑名单的行为配置文件,并部署到其所有客户。类似于机器人防火墙的东西,Distil 检测模式并做出反应。

更远,

"One pattern with Selenium was automating the theft of Web content",Distil 首席执行官 Rami Essaid 在上周接受采访时表示。"Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


参考

您可以在以下位置找到一些详细的讨论:


推荐阅读