首页 > 解决方案 > 如何使用 Beautiful Soup 在某个元素之前获取特定类的标签计数?

问题描述

我想计算所有<a>包含类名md-headline并且位于包含标题“Dupont Lewis”的链接之前的标签。

要定义页面内链接(“Dupont Lewis”)的位置,我使用以下代码:

import requests
from bs4 import BeautifulSoup

url = 'https://www.sortlist.fr/pub'
response= requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")
print(soup.prettify())

soup.a = soup.find_all("a", {"class": "md-headline"})
search = soup.select_one('a[title*="Dupont Lewis"]')
if search:
    position = find_all_previous('a[title*="Dupont Lewis"]')
    print(position.count)
else:
    print('None')

但由于某种原因,我不断得到 0。

标签: pythonbeautifulsoup

解决方案


查找所有先前的元素

link = soup.select_one('a[title*="Dupont Lewis"]')
previous_md_headlines = link.find_all_previous("a", {"class": "md-headline"})

查找所有下一个元素

link = soup.select_one('a[title*="Dupont Lewis"]')
next_md_headlines = link.find_all_next("a", {"class": "md-headline"})

在网页“https://www.sortlist.fr/pub”上,第一个带有 class 的锚元素md-headline也恰好是标题为“Dupont Lewis”的锚元素,这就是为什么前面的元素计数总是零(除非网页更改)。

完整示例

import requests
from bs4 import BeautifulSoup

url = 'https://www.sortlist.fr/pub'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

link = soup.select_one('a[title*="Dupont Lewis"]')
print(f"link: {link}")
previous_md_headlines = link.find_all_previous("a", {"class": "md-headline"})
next_md_headlines = link.find_all_next("a", {"class": "md-headline"})

print(f"\n\nFound {len(previous_md_headlines)} previous md-headlines.")
print("Previous md-headline links:\n")
print(*previous_md_headlines, sep="\n\n")

print(f"Found {len(next_md_headlines)} next md-headlines.")
print("Next md-headline links:\n")
print(*next_md_headlines, sep="\n\n")

输出

link: <a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9kdXBvbnQtbGV3aXM=" target="_blank" title="Dupont Lewis">Dupont Lewis</a>


Found 0 previous md-headlines.
Previous md-headline links:

Found 49 next md-headlines.
Next md-headline links:

<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9jb25jZXB0b3J5LTVmMjliMzFhLWExY2YtNDRlYS1iYzA4LWJiMzg2MTkyMmM1OQ==" target="_blank" title="The Collective Story">The Collective Story</a>

<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS90aGUtY3Jldw==" target="_blank" title="The Crew Communication">The Crew Communication</a>

<a class="s-block s-bold md-headline md-padding s-pb0 md-truncate" ng-click='setExpertiseAndLocation({"expertise":{"id":84,"name":"Publicité","title":"Agences de Publicité","slug":"pub","imageUrl":"/images/expertises/84.jpg"}})' sl-link="xx-L2FnZW5jeS9ub3ZlbWJyZQ==" target="_blank" title="Novembre - Creative Business Partner">Novembre - Creative Business Partner</a>
...

推荐阅读