首页 > 解决方案 > 硒只获得第一页导致循环

问题描述

我很难弄清楚为什么我的代码不会刷新 DOM 并获取新结果。我想抓住每一页:

案例标题 日期 pdf 链接 详细信息链接

该脚本抓取第一页结果并单击“下一步按钮”,表格计数器继续增加,但是以下结果来自每次单击下一步按钮后的第一页。

网页在这里

我的相关代码:

    url = https://www.govinfo.gov/app/collection/uscourts/district/caed/2021/%7B%22pageSize%22%3A%22500%22%2C%22offset%22%3A%220%22%7D
    driver.get(url)
    

    page = driver.page_source
    soup = bs(page, "html.parser")

    cnt = 0
    while True:
        tables = soup.find_all('table', class_='table')
        
            # WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.XPATH,'//span[@class="custom-paginator"]')))
        for my_table in tables:

            cnt += 1
            print ('=============== Table {} ==============='.format(cnt))

            print('Court: ' + 'United States Court ' + value)
            rows = my_table.find_all('td')                  
            for row in rows:
                cells = row.find_all('b')
                """ getting case title"""         
                for cell in cells:
                    span_1 = cell.find('span', {'class':'results-line1'}).text
                    print('Case: ' + span_1)
                """getting case date"""
                next_cells = row.find_all('em')
                for next_cell in next_cells:
                    span_2 = next_cell.find('span', {'class':'results-line2'}).text
                    print('Date: ' + span_2)
                # links = []
                links = row.find_all('a', href=True)

                """grabbing the pdf link then the details link only"""
                for link in links:
                    
                    start_link = 'https://www.govinfo.gov'
                    pdf = (link.get('href'))
                    pdf_link = re.search("pdf$", pdf)
                    fixed_link = "".join((start_link,pdf))
                    if pdf_link:
                        print('Link: ' + fixed_link)
                    elif '/details' in pdf:
                        print('Details: ' + fixed_link)
                    else:
                        break
                    
                    

        try:
            next_page = driver.find_elements_by_class_name('fw-pagination-btn')
            if len(next_page) <1:
                print("No more pages left")
                break
            else:
                
                pages = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, "//span[@class='custom-paginator']//li[@class='next fw-pagination-btn']/a")))
                page_count = 0
                print('clicking next button')
                page_count +=1
                print('---------page{}---------'.format(page_count))
                pages.click()
                
                time.sleep(7)
        except TimeoutException:
            break
    


driver.quit()

我似乎无法弄清楚如何在代码的“尝试”部分中刷新数据。任何帮助表示赞赏。

标签: pythonselenium

解决方案


推荐阅读