首页 > 解决方案 > 脚本突然停止爬行,没有错误或异常

问题描述

我不知道为什么,但是我的脚本在到达第 9 页后总是停止爬行。没有错误、异常或警告,所以我有点不知所措。

有人可以帮帮我吗?

PS这是完整的脚本,以防有人想自己测试!

def initiate_crawl():
    def refresh_page(url):
        ff = create_webdriver_instance()
        ff.get(url)
        ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
        ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
        items = WebDriverWait(ff, 15).until(
            EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
        )
        print(len(items))
        for count, item in enumerate(items):
            slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
            active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
            if len(slashed_price) > 0 and len(active_deals) > 0:
                product_title = item.find_element(By.ID, 'dealTitle').text
                if product_title not in already_scraped_product_titles:
                    already_scraped_product_titles.append(product_title)
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                    break
            if count+1 is len(items):
                try:
                    next_button = WebDriverWait(ff, 15).until(
                        EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                    )
                    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                except Exception as error:
                    print(error)
                    ff.quit()

    refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

initiate_crawl()

打印长度items也会调用一些奇怪的行为。它不是总是返回 32,这将对应于每页上的项目数,而是打印32第一页,64第二页,96第三页,依此类推。我通过使用//div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")]而不是//div[contains(@id, "100_dealView_")]作为items变量的 XPath 来解决这个问题。我希望这就是它在第 9 页遇到问题的原因。我现在正在运行测试。更新:它现在正在抓取第 10 页及以后,因此问题已解决。

标签: pythonseleniumpython-requestsgeckodriverurllib3

解决方案


根据您对该问题的第10修订,错误消息...

HTTPConnectionPool(host='127.0.0.1', port=58992): Max retries exceeded with url: /session/e8beed9b-4faa-4e91-a659-56761cb604d7/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D31378A58>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

...暗示该get()方法未能引发HTTPConnectionPool错误并显示Max retries exceeded消息。

有几件事:

解决方案

根据Selenium 3.14.1发行说明

* Fix ability to set timeout for urllib3 (#6286)

Merge是:修复urllib3无法设置超时!

结论

升级到Selenium 3.14.1后,您将能够设置超时并查看规范的Tracebacks,并能够采取必要的措施。

参考

一些相关的参考资料:


这个用例

我从codepen.io - A PEN BY Anthony中获取了您的完整脚本。我不得不对您现有的代码进行一些调整,如下所示:

  • 正如你所使用的:

      ua_string = random.choice(ua_strings)
    

您必须强制导入random为:

    import random
  • 您已创建变量next_button但尚未使用它。我总结了以下四行:

      next_button = WebDriverWait(ff, 15).until(
                      EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                  )
      ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
    

    作为:

      WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
      ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()              
    
  • 您修改后的代码块将是:

      # -*- coding: utf-8 -*-
      from selenium import webdriver
      from selenium.webdriver.firefox.options import Options
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      from selenium.webdriver.support.ui import WebDriverWait
      import time
      import random
    
    
      """ Set Global Variables
      """
      ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36']
      already_scraped_product_titles = []
    
    
    
      """ Create Instances of WebDriver
      """
      def create_webdriver_instance():
          ua_string = random.choice(ua_strings)
          profile = webdriver.FirefoxProfile()
          profile.set_preference('general.useragent.override', ua_string)
          options = Options()
          options.add_argument('--headless')
          return webdriver.Firefox(profile)
    
    
    
      """ Construct List of UA Strings
      """
      def fetch_ua_strings():
          ff = create_webdriver_instance()
          ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/')
          ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]')
          for ua_string in ua_strings_ff_eles:
              if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text:
                  ua_strings.append(ua_string.text)
          ff.quit()
    
    
    
      """ Log in to Amazon to Use SiteStripe in order to Generate Affiliate Links
      """
      def log_in(ff):
          ff.find_element(By.XPATH, '//a[@id="nav-link-yourAccount"] | //a[@id="nav-link-accountList"]').click()
          ff.find_element(By.ID, 'ap_email').send_keys('anthony_falez@hotmail.com')
          ff.find_element(By.ID, 'continue').click()
          ff.find_element(By.ID, 'ap_password').send_keys('lo0kyLoOkYig0t4h')
          ff.find_element(By.NAME, 'rememberMe').click()
          ff.find_element(By.ID, 'signInSubmit').click()
    
    
    
      """ Build Lists of Product Page URLs
      """
      def initiate_crawl():
          def refresh_page(url):
          ff = create_webdriver_instance()
          ff.get(url)
          ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
          ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
          items = WebDriverWait(ff, 15).until(
              EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
          )
          for count, item in enumerate(items):
              slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
              active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
              # For Groups of Items on Sale
              # active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")]
              if len(slashed_price) > 0 and len(active_deals) > 0:
                  product_title = item.find_element(By.ID, 'dealTitle').text
                  if product_title not in already_scraped_product_titles:
                      already_scraped_product_titles.append(product_title)
                      url = ff.current_url
                      # Scrape Details of Each Deal
                      #extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href'))
                      print(product_title[:10])
                      ff.quit()
                      refresh_page(url)
                      break
              if count+1 is len(items):
                  try:
                      print('')
                      print('new page')
                      WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
                      ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                      time.sleep(10)
                      url = ff.current_url
                      print(url)
                      print('')
                      ff.quit()
                      refresh_page(url)
                  except Exception as error:
                      """
                      ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click()
                      url = ff.current_url
                      ff.quit()
                      refresh_page(url)
                      """
                      print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next?")')
                      print('Because of... {}'.format(error))
                      ff.quit()
    
          refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')
    
      #def extract_info(ff, url):
      fetch_ua_strings()
      initiate_crawl()
    
  • 控制台输出:使用Selenium v​​3.14.0Firefox Quantum v62.0.3,我可以在控制台上提取以下输出:

      J.Rosée Si
      B.Catcher 
      Bluetooth4
      FRAM G4164
      Major Crim
      20% off Oh
      True Blood
      Prime-Line
      Marathon 3
      True Blood
      B.Catcher 
      4 Film Fav
      True Blood
      Texture Pa
      Westinghou
      True Blood
      ThermoPro 
      ...
      ...
      ...
    

注意:我可以优化您的代码并执行相同的网络抓取操作,只初始化一次Firefox 浏览器客户端并遍历各种产品及其详细信息。但是为了保持你的逻辑创新,我建议你做一些最小的改变来让你通过。


推荐阅读