首页 > 解决方案 > 使用 soup.find() 提取特定的 html 项

问题描述

我正在尝试从我使用 Beautiful Soup 引入 python 的一些 html 中提取一些项目。

这是html:

[<div class="metadata container container-max-width-modifier">
 <div class="salary col-xs-12 col-sm-6 col-md-6 col-lg-6">
 <i class="icon icon-pound"></i>
 <span itemprop="baseSalary" itemscope="" itemtype="http://schema.org/MonetaryAmount">
 <meta content="GBP" itemprop="currency"/>
 <span>£7.83 - £8.83 per hour</span>
 <span itemprop="value" itemscope="" itemtype="http://schema.org/QuantitativeValue">
 <meta content="7.8300" itemprop="value"/>
 <meta content="7.8300" itemprop="minValue"/>
 <meta content="8.8300" itemprop="maxValue"/>
 <meta content="HOUR" itemprop="unitText"/>
 </span>
 </span>
 </div>
 <div class="location col-xs-12 col-sm-6 col-md-6 col-lg-6">
 <i class="icon icon-location-new"></i>
 <span id="jobCountry" value="Scotland"></span>
 <span>
 <a href="/jobs/jobs-in-aberdeen" itemprop="jobLocation" itemscope="" itemtype="http://schema.org/Place">
 <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
 <meta content="Aberdeenshire" itemprop="addressRegion"/>
 <span itemprop="addressLocality">Aberdeen</span>
 <meta content="GB" itemprop="addressCountry">
 </meta></span>
 </a>, <span>Aberdeenshire</span>
 </span>
 </div>
 <div class="time col-xs-12 col-sm-6 col-md-6 col-lg-6">
 <i class="icon icon-clock"></i>
 <span content="FULL_TIME, PART_TIME" itemprop="employmentType">Permanent, full-time or part-time</span>
 <meta content="full-time or part-time" itemprop="workHours"/>
 </div>
 <div class="applications col-xs-12 col-sm-6 col-md-6 col-lg-6">
 <i class="icon icon-applicants"></i>
                     Be one of the first ten applicants
                 </div>
 <ul itemscope="" itemtype="http://schema.org/BreadcrumbList" style="display:none">
 <li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
 <meta content="1" itemprop="position"/>
 <ul itemprop="item" itemscope="" itemtype="http://schema.org/WebPage">
 <li>
 <meta content="https://www.reed.co.uk/jobs/retail-jobs" itemprop="url"/>
 <meta content="Retail" itemprop="name"/>
 </li>
 </ul>
 <li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
 <meta content="2" itemprop="position"/>
 <ul itemprop="item" itemscope="" itemtype="http://schema.org/WebPage">
 <li>
 <meta content="https://www.reed.co.uk/jobs/retail-jobs" itemprop="url"/>
 <meta content="Other Retail" itemprop="name"/>
 </li>
 </ul>
 </li></li></ul>

这是我放在一起的代码:

salary_range = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="baseSalary").text.strip()
salary_min = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="value")
salary_time = soup.find('div', class_="metadata container container-max-width-modifier").find('span', itemprop="unitText")
job_location = soup.find('div', class_="location col-xs-12 col-sm-6 col-md-6 col-lg-6").find('span', itemprop="addressLocality")
job_country = soup.find('div', class_="location col-xs-12 col-sm-6 col-md-6 col-lg-6").find('span', id="jobCountry")

第一个工作正常,用于拉出工资范围。我想有单独的变量:单位(例如每小时、每年、每月等)、最小值、最大值、工作地点、工作国家、全职/兼职和部门。

我想我可以自己管理其中的大部分,但我特别遇到的问题是salary_min、salary_max 和单位(小时、年、月)。对于 job_country 和 job_location 它还返回完整的 html 行,我只需要语音标记中的文本。

如果有人可以提供有关如何做到这一点/做得更好的任何见解,我将不胜感激!

标签: pythonhtmlweb-scrapingbeautifulsoup

解决方案


您可以使用 python 的 lxml 库代替 BeautifulSoup,请参见下面的代码。

import requests
from lxml import html

req = requests.get('https://www.reed.co.uk/jobs/barista-costa-aberdeen-tesco/36178175')
tree = html.fromstring(req.content)
salary_range = tree.xpath('.//span[@itemprop="baseSalary"]/span/text()')[0]
salary_min = tree.xpath('.//meta[@itemprop="minValue"]/@content')[0]
salary_max = tree.xpath('.//meta[@itemprop="maxValue"]/@content')[0]
salary_time = tree.xpath('.//meta[@itemprop="unitText"]/@content')[0]
job_region = tree.xpath('.//meta[@itemprop="addressRegion"]/@content')[0]
job_locality = tree.xpath('.//span[@itemprop="addressLocality"]/text()')[0]
job_country = tree.xpath('.//meta[@itemprop="addressCountry"]/@content')[0]

print('Salaray Range:', salary_range,'\n' 'Min Salary:', salary_min,'\n'
 'Max Salary:', salary_max,'\n' 'Salary Time:', salary_time,'\n'
 'Job Region:', job_region,'\n' 'Job Locality:', job_locality,'\n'
 'Job Country:', job_country)

输出

Salaray Range: £7.83 - £8.83 per hour
Min Salary: 7.8300
Max Salary: 8.8300
Salary Time: HOUR
Job Region: Aberdeenshire
Job Locality: Aberdeen
Job Country: GB

推荐阅读