首页 > 解决方案 > BeautifulSoup 抓取多个链接

问题描述

我想用 BeautifulSoup抓取这个网站,首先提取每个链接,然后一个一个地打开它们。打开它们后,我想抓取公司名称、股票代码、证券交易所,并在可用时提取多个 PDF 链接。之后它会将它们写在一个 csv 文件中。

为了实现它,我首先尝试这种方式:

import requests
from bs4 import BeautifulSoup
import re
import time

source_code = requests.get('https://www.responsibilityreports.co.uk/Companies?a=#')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []
base = 'https://www.responsibilityreports.co.uk'

for link in soup.find_all('a', href=True):
    data.append(str(link.get('href')))
    print(link)
    try:
        for link in links:
            url = base + link
            req = requests.get(url)
            soup = BeautifulSoup(req.content, 'html.parser')
            for j in soup.find_all('a', href=True):
                print(j)
    except:
        pass

据我所知,本网站不禁止爬虫。但是虽然它实际上给了我每个链接,但我无法打开它们,这不允许我让我的刮刀继续执行以下任务。

提前致谢!

标签: pythonweb-scrapingbeautifulsoup

解决方案


您可以使用此示例如何迭代所有公司链接:

import requests
from bs4 import BeautifulSoup


url = "https://www.responsibilityreports.co.uk/Companies?a=#"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

links = [
    "https://www.responsibilityreports.co.uk" + a["href"]
    for a in soup.select('a[href^="/Company"]')
]

for link in links:
    soup = BeautifulSoup(requests.get(link).content, "html.parser")

    name = soup.select_one("h1").get_text(strip=True)
    ticker = soup.select_one(".ticker_name")
    if ticker:
        ticker = ticker.get_text(strip=True)
    else:
        ticker = "N/A"

    # extract other info...

    print(name)
    print(ticker)
    print(link)
    print("-" * 80)

印刷:

3i Group plc
III
https://www.responsibilityreports.co.uk/Company/3i-group-plc
--------------------------------------------------------------------------------
3M Corporation
MMM
https://www.responsibilityreports.co.uk/Company/3m-corporation
--------------------------------------------------------------------------------
AAON Inc.
AAON
https://www.responsibilityreports.co.uk/Company/aaon-inc
--------------------------------------------------------------------------------
ABB Ltd
ABB
https://www.responsibilityreports.co.uk/Company/abb-ltd
--------------------------------------------------------------------------------
Abbott Laboratories
ABT
https://www.responsibilityreports.co.uk/Company/abbott-laboratories
--------------------------------------------------------------------------------
Abbvie Inc
ABBV
https://www.responsibilityreports.co.uk/Company/abbvie-inc
--------------------------------------------------------------------------------
Abercrombie & Fitch
ANF
https://www.responsibilityreports.co.uk/Company/abercrombie-fitch
--------------------------------------------------------------------------------
ABM Industries, Inc.
ABM
https://www.responsibilityreports.co.uk/Company/abm-industries-inc
--------------------------------------------------------------------------------
Acadia Realty Trust
AKR
https://www.responsibilityreports.co.uk/Company/acadia-realty-trust
--------------------------------------------------------------------------------
Acciona
N/A
https://www.responsibilityreports.co.uk/Company/acciona
--------------------------------------------------------------------------------
ACCO Brands
ACCO
https://www.responsibilityreports.co.uk/Company/acco-brands
--------------------------------------------------------------------------------

...and so on.

推荐阅读