python - 脚本突然停止爬行,没有错误或异常
问题描述
我不知道为什么,但是我的脚本在到达第 9 页后总是停止爬行。没有错误、异常或警告,所以我有点不知所措。
有人可以帮帮我吗?
def initiate_crawl():
def refresh_page(url):
ff = create_webdriver_instance()
ff.get(url)
ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
items = WebDriverWait(ff, 15).until(
EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
)
print(len(items))
for count, item in enumerate(items):
slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
if len(slashed_price) > 0 and len(active_deals) > 0:
product_title = item.find_element(By.ID, 'dealTitle').text
if product_title not in already_scraped_product_titles:
already_scraped_product_titles.append(product_title)
url = ff.current_url
ff.quit()
refresh_page(url)
break
if count+1 is len(items):
try:
next_button = WebDriverWait(ff, 15).until(
EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
)
ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
url = ff.current_url
ff.quit()
refresh_page(url)
except Exception as error:
print(error)
ff.quit()
refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')
initiate_crawl()
打印长度items
也会调用一些奇怪的行为。它不是总是返回 32,这将对应于每页上的项目数,而是打印32
第一页,64
第二页,96
第三页,依此类推。我通过使用//div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")]
而不是//div[contains(@id, "100_dealView_")]
作为items
变量的 XPath 来解决这个问题。我希望这就是它在第 9 页遇到问题的原因。我现在正在运行测试。更新:它现在正在抓取第 10 页及以后,因此问题已解决。
解决方案
根据您对该问题的第10次修订,错误消息...
HTTPConnectionPool(host='127.0.0.1', port=58992): Max retries exceeded with url: /session/e8beed9b-4faa-4e91-a659-56761cb604d7/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D31378A58>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
...暗示该get()
方法未能引发HTTPConnectionPool错误并显示Max retries exceeded消息。
有几件事:
根据讨论max-retries-exceeded 异常令人困惑,回溯有点误导。Requests 包装了异常以方便用户使用。原始异常是显示的消息的一部分。
请求从不重试(它设置了
retries=0
urllib3's ),因此如果没有MaxRetryError和HTTPConnectionPool关键字HTTPConnectionPool
,错误会更加规范。所以一个理想的Traceback应该是:NewConnectionError(<class 'socket.error'>: [Errno 10061] No connection could be made because the target machine actively refused it)
解决方案
根据Selenium 3.14.1的发行说明:
* Fix ability to set timeout for urllib3 (#6286)
Merge是:修复urllib3无法设置超时!
结论
升级到Selenium 3.14.1后,您将能够设置超时并查看规范的Tracebacks,并能够采取必要的措施。
参考
一些相关的参考资料:
这个用例
我从codepen.io - A PEN BY Anthony中获取了您的完整脚本。我不得不对您现有的代码进行一些调整,如下所示:
正如你所使用的:
ua_string = random.choice(ua_strings)
您必须强制导入random
为:
import random
您已创建变量next_button但尚未使用它。我总结了以下四行:
next_button = WebDriverWait(ff, 15).until( EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→') ) ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
作为:
WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')) ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
您修改后的代码块将是:
# -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver.firefox.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait import time import random """ Set Global Variables """ ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'] already_scraped_product_titles = [] """ Create Instances of WebDriver """ def create_webdriver_instance(): ua_string = random.choice(ua_strings) profile = webdriver.FirefoxProfile() profile.set_preference('general.useragent.override', ua_string) options = Options() options.add_argument('--headless') return webdriver.Firefox(profile) """ Construct List of UA Strings """ def fetch_ua_strings(): ff = create_webdriver_instance() ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/') ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]') for ua_string in ua_strings_ff_eles: if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text: ua_strings.append(ua_string.text) ff.quit() """ Log in to Amazon to Use SiteStripe in order to Generate Affiliate Links """ def log_in(ff): ff.find_element(By.XPATH, '//a[@id="nav-link-yourAccount"] | //a[@id="nav-link-accountList"]').click() ff.find_element(By.ID, 'ap_email').send_keys('anthony_falez@hotmail.com') ff.find_element(By.ID, 'continue').click() ff.find_element(By.ID, 'ap_password').send_keys('lo0kyLoOkYig0t4h') ff.find_element(By.NAME, 'rememberMe').click() ff.find_element(By.ID, 'signInSubmit').click() """ Build Lists of Product Page URLs """ def initiate_crawl(): def refresh_page(url): ff = create_webdriver_instance() ff.get(url) ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click() ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click() items = WebDriverWait(ff, 15).until( EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]')) ) for count, item in enumerate(items): slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]') active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]') # For Groups of Items on Sale # active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")] if len(slashed_price) > 0 and len(active_deals) > 0: product_title = item.find_element(By.ID, 'dealTitle').text if product_title not in already_scraped_product_titles: already_scraped_product_titles.append(product_title) url = ff.current_url # Scrape Details of Each Deal #extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href')) print(product_title[:10]) ff.quit() refresh_page(url) break if count+1 is len(items): try: print('') print('new page') WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')) ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click() time.sleep(10) url = ff.current_url print(url) print('') ff.quit() refresh_page(url) except Exception as error: """ ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click() url = ff.current_url ff.quit() refresh_page(url) """ print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next?")') print('Because of... {}'.format(error)) ff.quit() refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8') #def extract_info(ff, url): fetch_ua_strings() initiate_crawl()
控制台输出:使用Selenium v3.14.0和Firefox Quantum v62.0.3,我可以在控制台上提取以下输出:
J.Rosée Si B.Catcher Bluetooth4 FRAM G4164 Major Crim 20% off Oh True Blood Prime-Line Marathon 3 True Blood B.Catcher 4 Film Fav True Blood Texture Pa Westinghou True Blood ThermoPro ... ... ...
注意:我可以优化您的代码并执行相同的网络抓取操作,只初始化一次Firefox 浏览器客户端并遍历各种产品及其详细信息。但是为了保持你的逻辑和创新,我建议你做一些最小的改变来让你通过。
推荐阅读
- javascript - 如何在 JS 中编码查询字符串并在 PHP 中解码?
- python - 熊猫根据列值添加行
- typescript - 从元组评估函数的参数类型
- opengl - 可以使用顶点数组渲染光线投射吗?
- python - 显示当前小部件类之外的小部件
- python - 如何将字符串写入内存视图?
- sql-server - 导出 SQL 计划并清除 Azure SQL 数据库的计划缓存
- python - 在 Flask-SocketIO 事件中获取 Flask-Security 用户 ID
- android - 从 kotlin 到 firebase 的 int 按键
- grpc - 如果服务器和客户端托管在不同的机器上,gRPC“无法选择子通道”