python - 如何使用 Beautiful Soup 在某个元素之前获取特定类的标签计数?
问题描述
我想计算所有<a>
包含类名md-headline
并且位于包含标题“Dupont Lewis”的链接之前的标签。
要定义页面内链接(“Dupont Lewis”)的位置,我使用以下代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sortlist.fr/pub'
response= requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.prettify())
soup.a = soup.find_all("a", {"class": "md-headline"})
search = soup.select_one('a[title*="Dupont Lewis"]')
if search:
position = find_all_previous('a[title*="Dupont Lewis"]')
print(position.count)
else:
print('None')
但由于某种原因,我不断得到 0。
解决方案
查找所有先前的元素
link = soup.select_one('a[title*="Dupont Lewis"]')
previous_md_headlines = link.find_all_previous("a", {"class": "md-headline"})
查找所有下一个元素
link = soup.select_one('a[title*="Dupont Lewis"]')
next_md_headlines = link.find_all_next("a", {"class": "md-headline"})
原始问题:为什么我md-headline
在标题为“Dupont Lewis”的链接之前一直获得 0 个与课程的链接?
在网页“https://www.sortlist.fr/pub”上,第一个带有 class 的锚元素md-headline
也恰好是标题为“Dupont Lewis”的锚元素,这就是为什么前面的元素计数总是零(除非网页更改)。
完整示例
import requests
from bs4 import BeautifulSoup
url = 'https://www.sortlist.fr/pub'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
link = soup.select_one('a[title*="Dupont Lewis"]')
print(f"link: {link}")
previous_md_headlines = link.find_all_previous("a", {"class": "md-headline"})
next_md_headlines = link.find_all_next("a", {"class": "md-headline"})
print(f"\n\nFound {len(previous_md_headlines)} previous md-headlines.")
print("Previous md-headline links:\n")
print(*previous_md_headlines, sep="\n\n")
print(f"Found {len(next_md_headlines)} next md-headlines.")
print("Next md-headline links:\n")
print(*next_md_headlines, sep="\n\n")
输出
link: <a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9kdXBvbnQtbGV3aXM=" target="_blank" title="Dupont Lewis">Dupont Lewis</a>
Found 0 previous md-headlines.
Previous md-headline links:
Found 49 next md-headlines.
Next md-headline links:
<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9jb25jZXB0b3J5LTVmMjliMzFhLWExY2YtNDRlYS1iYzA4LWJiMzg2MTkyMmM1OQ==" target="_blank" title="The Collective Story">The Collective Story</a>
<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS90aGUtY3Jldw==" target="_blank" title="The Crew Communication">The Crew Communication</a>
<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9ub3ZlbWJyZQ==" target="_blank" title="Novembre - Creative Business Partner">Novembre - Creative Business Partner</a>
...
推荐阅读
- python - 使用 ElementTree 解析 XML:树的根作为 XML 本身返回。我如何进一步解析它以找到一个元素?
- reactjs - 如何将模板-redux 添加到现有的 React 项目?
- c# - 带有 HTTP 触发器和 blob 输出绑定的 Azure 函数失败,返回 500,没有详细信息
- sql - Oracle在when子句中触发多个条件
- python - TypeError: 需要一个类似字节的对象,而不是 'str' 使用 BytesIO
- ansible - 由 NewConnectionErrors 引起的 HTTPSConnectionPool 错误导致 Ansible 无法访问
- acumatica - BQL“IN<>”语句的问题
- windows - 使用 CMake 在 Windows 上构建 LLVM 的问题
- javascript - 回调(x + y)函数签名混淆
- android - Activity 从抽屉片段中获取方法