首页 > 解决方案 > 需要帮助使用 Beautiful Soup 遍历 URL

问题描述

我正在尝试抓取该网站上列出的所有公司的名称。每页(共 14 页)显示 80 家公司的名称。每个 URL 的末尾都有一个start=241&count=80&first=2009&last=2018,其中 start 是页面的第一行。我试图遍历每 80 家公司,这将遍历每一页,并刮掉公司的名称。但是,每次我尝试时,我都会在第二次通过循环时收到此错误:

File "beautiful_soup_2.py", line 10, in <module>
name_table = (soup.findAll('table')[4])
File "C:\Users\adamm\Downloads\Python\lib\site-packages\bs4\element.py", line 1807, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

但是,如果我删除列表并手动输入start=81、161、241等的 URL,结果将返回页面上的公司列表。

到目前为止我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup

for x in range(1,1042,80):
sauce = ('https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%20%3D%2010-12b%20OR%20form-type%3D10-12b%2Fa&start={}&count=80&first=2009&last=2018'.format(x))

source_link = urlopen(sauce).read()
soup = soup(source_link, 'lxml')

name_table = (soup.findAll('table')[4])
table_rows = name_table.findAll('tr')

for row in table_rows:
    cols = row.findAll('td')
    cols = [x.text.strip() for x in cols]
    print(cols)

这让我发疯,所以非常感谢任何帮助。

标签: python-3.xloopsweb-scrapingbeautifulsoup

解决方案


推荐阅读