首页 > 解决方案 > 使用 BeautifulSoup 抓取网页中的 URL

问题描述

以下是抓取此网页的代码。在页面上的所有 URL 中,我只需要那些有关于职位发布的更多信息的 URL,例如,公司名称的 URL,如 - “Abbot”、“Abbvie”、“Affymetrix”等。

import requests
import pandas as pd
import re
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
list = ['#medical-device','#engineering','#recruitment','#job','#linkedin']
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
list_of_pages = [page + x for x in list]
for info in list_of_pages:
    pages= requests.get(info)
    soup = BeautifulSoup(pages.content, 'html.parser')
    tags = [div.p for div in soup.find_all('div', attrs ={'class':'fusion-text'})]
    for m in tags:
        try:
            links = [link['href'] for link in tags]
        except KeyError:
            pass
        print(links)

我得到的输出是一个空白列表,如下所示:

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

我应该在上面的代码中添加/编辑什么来抓取这些 URL 中的 URL 和更多信息。

谢谢 !!

标签: python-3.xweb-scrapingbeautifulsoup

解决方案


我注意到的是,带有锚点的网页并没有真正隔离您真正想要的 HTML。因此,您正在获取<div class='fusion-text'>.

以下代码示例将检索您想要的所有 URL:

import requests
from bs4 import BeautifulSoup

# Get webpage 
page = "https://dpseng.com.sg/definitive-singapore-pharma-job-website-directory/"
soup= BeautifulSoup(requests.get(page).content, 'html.parser')
# Grab all URLs under each section
for section in ['medical-device','engineering','recruitment','job','linkedin']:
    subsection = soup.find('div', attrs ={'id': section})
    links = [a['href'] for a in subsection.find_all('a')]
    print("{}: {}".format(section, links))

推荐阅读