首页 > 解决方案 > selenium scraper 工作,但一段时间后 chrome 显示“无法访问此站点”

问题描述

我正在抓取美国专利网站,他们的 robots.txt 在抓取方面没有任何限制,但是经过几百次请求后,我得到了这个问题:在此处输入图像描述

我在每次搜索请求后清除 cookie,并且我也尝试过使用不同的代理。关于为什么会发生这种情况的任何想法?我的代码工作正常,但在抓取 10-20 分钟后,我得到了这个错误。

这是我的代码,但我认为它根本不会很有帮助,因为到目前为止代码工作正常

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
import time
import pandas as pd
from fake_useragent import UserAgent
from webdriver_manager.chrome import ChromeDriverManager


PATH = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(executable_path=PATH)
num_rows = 50000
df = pd.read_csv('company_names.csv').head(500)
df_new = pd.DataFrame(index=range(num_rows),columns=['company_name','link','patent title','abstract','company_id'])
row_number = 0
for company in df['company_name']:
    company_id = df.loc[df.company_name == company, 'company_id'].values[0]
    print(company_id)
    df_new.iloc[row_number,4]=str(company_id)
    print(company)
    df_new.iloc[row_number,0]=str(company)
    driver.get("http://patft.uspto.gov/netahtml/PTO/")
    driver.get("http://patft.uspto.gov/netahtml/PTO/search-adv.htm")
    search_box = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"/html/body/center/form/table/tbody/tr[1]/td[1]/textarea")))
    print('found search box')
    search_box.send_keys("AN/"+'"'+str(company)+'"')
    search_button = driver.find_element_by_xpath("/html/body/center/form/table/tbody/tr[2]/td[2]/input[1]").click()
    #multiple results
        
    check_table = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"/html/body/table/tbody/tr[1]/th[1]")))
    if check_table.text == 'PAT. NO.':
        #multiple links
        rows = driver.find_elements_by_xpath("/html/body/table/tbody/tr")
        num_patents = len(rows)-1
        min_patents = min(10,num_patents)
        for row in range(min_patents):
            df_new.iloc[row_number,4]=str(company_id)
            df_new.iloc[row_number,0]=str(company)
            title_link = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,"/html/body/table/tbody/tr["+str(row+2)+"]/td[4]/a")))
            link = title_link.get_attribute('href')
            print(str(link))
            title_text = title_link.text
            print(title_text)
            df_new.iloc[row_number,1] = str(link)
            df_new.iloc[row_number,2] = str(title_text)
            title_link.click()
            abstract = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"/html/body/p[1]")))
            print(abstract.text)
            df_new.iloc[row_number,3] = str(abstract.text)
            row_number += 1
            driver.back()
            #get patent abstract data

    elif check_table.text == 'Inventors:':
        #one link
        df_new.iloc[row_number,4]=str(company_id)
        df_new.iloc[row_number,0]=str(company)
        abstract = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"/html/body/p[1]")))
        link = driver.current_url
        df_new.iloc[row_number,1] = str(link)
        abstract_text = abstract.text
        title = driver.find_element_by_xpath('/html/body/font')
        title_text = title.text
        print(title_text)
        df_new.iloc[row_number,2] = str(title_text)
        print(abstract_text)
        df_new.iloc[row_number,3] = str(abstract_text)
        row_number += 1
        driver.delete_all_cookies()
    

df_new.to_csv('patent_results.csv')

标签: seleniumselenium-chromedriver

解决方案


USPTO 网站使用条款

USPTO 的在线数据库并非旨在或旨在成为通过网站界面访问时批量下载 USPTO 数据的来源。个人、公司、IP 地址或 IP 地址块实际上通过生成异常大量的数据库访问(搜索、页面或点击)来拒绝或减少服务,无论是手动还是自动生成,都可能被拒绝访问 USPTO 服务器,恕不另行通知。

下一段:

批量数据产品可以从美国专利商标局单独获得,无论是免费还是以传播为代价。有关详细信息,请参阅有关电子批量数据产品的信息。


推荐阅读