python - BeautifulSoup 抓取多个链接
问题描述
我想用 BeautifulSoup抓取这个网站,首先提取每个链接,然后一个一个地打开它们。打开它们后,我想抓取公司名称、股票代码、证券交易所,并在可用时提取多个 PDF 链接。之后它会将它们写在一个 csv 文件中。
为了实现它,我首先尝试这种方式:
import requests
from bs4 import BeautifulSoup
import re
import time
source_code = requests.get('https://www.responsibilityreports.co.uk/Companies?a=#')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []
base = 'https://www.responsibilityreports.co.uk'
for link in soup.find_all('a', href=True):
data.append(str(link.get('href')))
print(link)
try:
for link in links:
url = base + link
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
for j in soup.find_all('a', href=True):
print(j)
except:
pass
据我所知,本网站不禁止爬虫。但是虽然它实际上给了我每个链接,但我无法打开它们,这不允许我让我的刮刀继续执行以下任务。
提前致谢!
解决方案
您可以使用此示例如何迭代所有公司链接:
import requests
from bs4 import BeautifulSoup
url = "https://www.responsibilityreports.co.uk/Companies?a=#"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = [
"https://www.responsibilityreports.co.uk" + a["href"]
for a in soup.select('a[href^="/Company"]')
]
for link in links:
soup = BeautifulSoup(requests.get(link).content, "html.parser")
name = soup.select_one("h1").get_text(strip=True)
ticker = soup.select_one(".ticker_name")
if ticker:
ticker = ticker.get_text(strip=True)
else:
ticker = "N/A"
# extract other info...
print(name)
print(ticker)
print(link)
print("-" * 80)
印刷:
3i Group plc
III
https://www.responsibilityreports.co.uk/Company/3i-group-plc
--------------------------------------------------------------------------------
3M Corporation
MMM
https://www.responsibilityreports.co.uk/Company/3m-corporation
--------------------------------------------------------------------------------
AAON Inc.
AAON
https://www.responsibilityreports.co.uk/Company/aaon-inc
--------------------------------------------------------------------------------
ABB Ltd
ABB
https://www.responsibilityreports.co.uk/Company/abb-ltd
--------------------------------------------------------------------------------
Abbott Laboratories
ABT
https://www.responsibilityreports.co.uk/Company/abbott-laboratories
--------------------------------------------------------------------------------
Abbvie Inc
ABBV
https://www.responsibilityreports.co.uk/Company/abbvie-inc
--------------------------------------------------------------------------------
Abercrombie & Fitch
ANF
https://www.responsibilityreports.co.uk/Company/abercrombie-fitch
--------------------------------------------------------------------------------
ABM Industries, Inc.
ABM
https://www.responsibilityreports.co.uk/Company/abm-industries-inc
--------------------------------------------------------------------------------
Acadia Realty Trust
AKR
https://www.responsibilityreports.co.uk/Company/acadia-realty-trust
--------------------------------------------------------------------------------
Acciona
N/A
https://www.responsibilityreports.co.uk/Company/acciona
--------------------------------------------------------------------------------
ACCO Brands
ACCO
https://www.responsibilityreports.co.uk/Company/acco-brands
--------------------------------------------------------------------------------
...and so on.
推荐阅读
- autotools - m4 宏和 shell 重定向
- javascript - 在 Wordpress 中自定义 HTML、CSS、JS、PHP 页面
- r - 为什么将 inputId 添加到我闪亮的应用程序的 pickerInput 段会破坏我的代码?
- css - object-fit:根据 Mozilla 和 caniuse 覆盖兼容性差异
- codenameone - AnimateLayout 与 AnimateHierarchy
- asp.net-mvc - 我无法在 Visual Studio 中打开所有项目
- sql - 在 SQL Server 2012 中从字符串中分离一个值
- git - git 合并 --continue 与 --no-commit
- java - keycloak spring boot 始终接收匿名身份验证令牌
- ibm-mq - MQ 文件共享的 Veritas 替代方案