首页 > 解决方案 > 使用 BeautifulSoup 分步进行表格抓取和分页

问题描述

我正在尝试使用 BeautifulSoup 包来抓取这个网站。我已经使用此解决方案中的指针成功地抓取了页面,但我正在尝试实现分页。

import pandas as pd
import requests
from bs4 import BeautifulSoup
    
for num in range(0, 800,80):
    url = 'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000018349&type=&dateb=&owner=include&start='+ str(num)
    r = requests.get(url)
    html = r.text

    soup = BeautifulSoup(html)
    table = soup.find('table', id="transaction-report")
    rows = table.find_all('tr')
    data = []
    final = []
    for row in rows[1:]:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])
    final = final.append(data)

result = pd.DataFrame(final, columns=['A or D', 'Date', 'Reporting Owner', 'Form', 'Transaction Type', 
                                     'Ownership D or I', 'Number of Securities Transacted', 'Number of Securities Owned',
                                     'Line Number', 'Owner CIK', 'Security Name'])

print(result)

页面以 80 的增量增加。但是,我无法将页面放在同一个数据框中。我试图创建一个名为从每个页面final附加的列表data,但我没有成功。

标签: pythonpython-3.xbeautifulsoup

解决方案


您必须将最终列表放在循环之外,它会起作用。

import pandas as pd
import requests
from bs4 import BeautifulSoup
 
final = [] 
for num in range(0, 800,80):
    url = 'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000018349&type=&dateb=&owner=include&start='+ str(num)
    r = requests.get(url)
    html = r.text

    soup = BeautifulSoup(html)
    table = soup.find('table', id="transaction-report")
    rows = table.find_all('tr')
    data = []
    for row in rows[1:]:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])
    final = final.append(data)

result = pd.DataFrame(final, columns=['A or D', 'Date', 'Reporting Owner', 'Form', 'Transaction Type', 
                                     'Ownership D or I', 'Number of Securities Transacted', 'Number of Securities Owned',
                                     'Line Number', 'Owner CIK', 'Security Name'])

print(result)

推荐阅读