python - 使用漂亮的 Soup 从 Airtasker 中提取数据

问题描述

我正在尝试从该网站提取数据 - https://www.airtasker.com/users/brad-n-11346775/。

到目前为止，我已经设法提取了除许可证号之外的所有内容。我面临的问题很奇怪，因为许可证号是文本形式的。我能够提取其他所有内容，例如名称、地址等。例如，要提取名称，我只是这样做了：

name.append(pro.find('div', class_= 'name').text)

它工作得很好。

这是我试图做的，但我得到的输出为None

license_number.append(pro.find('div', class_= 'sub-text'))

当我做：

license_number.append(pro.find('div', class_= 'sub-text').text)

它给了我以下错误：

AttributeError: 'NoneType' object has no attribute 'text'

这意味着它不会将许可证号识别为文本，即使它是文本。

有人可以给我一个可行的解决方案，请告诉我我做错了什么？？？问候，

标签： pythonhtmlbeautifulsoup

带有许可证号的徽章从位于其中一个标签中的HTMLa 动态添加到标签中。Boostrap JSON<script>

您可以使用找到标签bs4并使用舀出数据regex并使用解析它json。

就是这样：

import ast
import json
import re

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.airtasker.com/users/brad-n-11346775/").text
scripts = BeautifulSoup(page, "lxml").find_all("script")[-4]
bootstrap_JSON = json.loads(
    ast.literal_eval(re.search(r"parse\((.*)\)", scripts.string).group(1))
)
print(bootstrap_JSON["profile"]["badges"]["electrical_vic"]["reference_code"])

输出：

Licence No. 28661

python - 使用漂亮的 Soup 从 Airtasker 中提取数据

问题描述

解决方案

推荐阅读