python - 使用 beautifulsoup 和 css 选择器而不是 lxml 和 xpath 来抓取特定元素前面的元素
问题描述
我想从此页面刮掉“服务/产品”部分:https ://www.yellowpages.com/deland-fl/mip/ryan-wells-pumps-20533306?lid=1001782175490
文本位于始终位于元素之后的 dd 元素内
import requests
from lxml import html
url = ""
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers)
t = html.fromstring(r.content)
products = t.xpath('//dd[preceding-sibling::dt[contains(.,"Services/Products")]]/text()[1]')[0] if t.xpath('//dd[preceding-sibling::dt[contains(.,"Services/Products")]]') else ''
有什么方法可以使用 Beautifulsoup (如果可能的话还有 css 选择器)而不是 lxml 和 xpath 来获得相同的文本?
解决方案
尝试使用 BeautifulSoup 和 Requests。这要容易得多。这是一些代码
# BeautifulSoup is an HTML parser. You can find specific elements in a BeautifulSoup object
from bs4 import BeautifulSoup
from requests import get
url = "https://www.yellowpages.com/deland-fl/mip/ryan-wells-pumps-20533306?lid=1001782175490"
obj = BeautifulSoup(get(url).content, "html.parser")
# Gets the section with the Services
buisness_info = obj.find("section", {"id":"business-info"})
# Getting all <dd> elements (cause you can pick off the one you need from the list)
all_dd = buisness_info.find_all("dd")
# Finds the specific tag with the text you need
services_and_products = all_dd[2]
# Gets the text
text = services_and_products.text
# All Done
print(text)
推荐阅读
- python - python-根据分类变量预测时间
- c - 编译器会为 C 语言中的宏禁用的代码分配任何内存吗?
- debugging - WinHttpSendRequest: 800C0057
- ios - App 被拒 2. 1 性能:App Completeness
- scala - 如何将一些数据关联到spark中的每个分区并重新使用它?
- excel - excel:几个搜索条件的总和
- javascript - 画布鼠标悬停
- javascript - 如何直接在模板中获取组件的元素引用?
- node-red - Node-red 捕获错误 node-red-contrib-ab
- angularjs - ng-bind-html 和 unsafe 不起作用