python - 使用 BeautifulSoup 分步进行表格抓取和分页
问题描述
我正在尝试使用 BeautifulSoup 包来抓取这个网站。我已经使用此解决方案中的指针成功地抓取了页面,但我正在尝试实现分页。
import pandas as pd
import requests
from bs4 import BeautifulSoup
for num in range(0, 800,80):
url = 'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000018349&type=&dateb=&owner=include&start='+ str(num)
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
table = soup.find('table', id="transaction-report")
rows = table.find_all('tr')
data = []
final = []
for row in rows[1:]:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
final = final.append(data)
result = pd.DataFrame(final, columns=['A or D', 'Date', 'Reporting Owner', 'Form', 'Transaction Type',
'Ownership D or I', 'Number of Securities Transacted', 'Number of Securities Owned',
'Line Number', 'Owner CIK', 'Security Name'])
print(result)
页面以 80 的增量增加。但是,我无法将页面放在同一个数据框中。我试图创建一个名为从每个页面final
附加的列表data
,但我没有成功。
解决方案
您必须将最终列表放在循环之外,它会起作用。
import pandas as pd
import requests
from bs4 import BeautifulSoup
final = []
for num in range(0, 800,80):
url = 'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000018349&type=&dateb=&owner=include&start='+ str(num)
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
table = soup.find('table', id="transaction-report")
rows = table.find_all('tr')
data = []
for row in rows[1:]:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
final = final.append(data)
result = pd.DataFrame(final, columns=['A or D', 'Date', 'Reporting Owner', 'Form', 'Transaction Type',
'Ownership D or I', 'Number of Securities Transacted', 'Number of Securities Owned',
'Line Number', 'Owner CIK', 'Security Name'])
print(result)
推荐阅读
- javascript - 在将对象的错误数组推送到 CSV 之前等待 promise 完成
- java - 从 Web 服务中注销 JPA
- javascript - 在 Ionic 5 中,如何将 firestore 文档显示到 ionic(html) 屏幕
- spring - 是否可以在 Spring 中将假时钟传递给计划任务?
- java - 如何解决这个错误我被卡住了..我想创建登录页面
- facebook - Flutter Facebook 网页登录
- python - 即使你写了一些东西,变量也是空的
- python - 根据用户输入的不同条件过滤熊猫
- r - 创建动态分组依据
- heroku - DNS gTLD 适用于 REST,但不适用于浏览器