首页 > 解决方案 > Web Scraping的分页需要帮助>

问题描述

因为我对 Python 和 WebScraping 非常陌生。任何人都可以在网站的分页部分提供帮助。

网站 - https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage=1

我能够抓取数据,即第一页的公司名称和日期。请帮助我从多个页面中抓取数据。

这是我的代码

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from selenium.webdriver.support.select import Select
import time

driver=webdriver.Chrome(executable_path='C:\\Users\\chromedriver_win32\\chromedriver.exe')
driver.get('https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage=1')
driver.maximize_window()
time.sleep(1)

# append the data to list
CompanyName=driver.find_elements_by_xpath('/html/body/div[1]/div[3]/div[2]/section/section/table/tbody/tr/td[4]')
Date=driver.find_elements_by_xpath('/html/body/div[1]/div[3]/div[2]/section/section/table/tbody/tr/td[1]')


Name = []
for i in range(len(CompanyName)):
     Name.append(CompanyName[i].text)

data = pd.DataFrame(Name)

Date_ = []
for i in range(len(Date)):
    Date_.append(Date[i].text)

data['Date_'] = Date
data

标签: pythonpython-3.xpandasseleniumweb-scraping

解决方案


您还可以使用next选项移动到下一页并抓取详细信息。

from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage=1")

details = []
for i in range(3): # for 1st 3 pages, increase the range to scrape more pages.
    tables = driver.find_elements_by_xpath("//table[@class='filings-table']/tbody/tr") # Find individual rows 
    print(len(tables))
    for table in tables: # Extract details from all rows.
        company = table.find_element_by_xpath(".//td[4]").text # Extract Company name from that row
        date = table.find_element_by_xpath(".//td[1]").text # Extract date from that row
        details.append([company,date])
    driver.find_element_by_xpath("//a[contains(@class,'next')]").click() # Find and click on next page.
    time.sleep(2)
print(len(details))
for i in range(len(details)):
    print(details[i])
driver.quit()

输出:

50
50
50
150
['3rd Rock Logistics, LLC', '09/03/2021']
['3rd Rock Holdings, Inc.', '09/03/2021']
['Philippine Airlines, Inc.', '09/03/2021']
['Bennett Rosa, LLC', '09/03/2021']
['James David Theros', '09/03/2021']
...

推荐阅读