首页 > 解决方案 > 了解如何使用 beautifulsoup find() 提取特定 div 中 html 中的所有元素

问题描述

这是我正在使用的 URL 。

示例 html

我正在尝试Username使用soup.find(). 我不确定如何引用它div,因为我用 is 找到的最后div一个是soup.find("div", {"id": "sort-by"}).contents返回:

['\n',
 <div id="sort-by-container">
 <div id="sort-by-current"><i aria-hidden="true" class="fa fa-sort"></i> <span id="sort-by-current-title">Sorted by: Followers</span></div>
 <div class="border-box no-select" id="sort-by-dropdown">
 <div class="sort-by-select" data-sort="most-followers" data-title="Sorted by: Followers">Sort by Followers</div>
 <div class="sort-by-select" data-sort="most-following" data-title="Sorted by: Following">Sort by Following</div>
 <div class="sort-by-select" data-sort="most-uploads" data-title="Sorted by: Uploads">Sort by Uploads</div>
 <div class="sort-by-select" data-sort="most-likes" data-title="Sorted by: Likes">Sort by Likes</div>
 </div>
 </div>,
 '\n',
 <div style="clear: both;"></div>]

最终,我试图获取 username 下的每一行charli d’amelioaddison rae或者 `<a href""> 的内容 在此处输入图像描述

这是我到目前为止绑定的完整代码:

from bs4 import BeautifulSoup
with open('Top 50 TikTok users sorted by Followers - Socialblade TikTok Stats _ TikTok Statistics.html') as file:
    soup = BeautifulSoup(file)
soup.find('title').contents
soup.find("div", {"id": "sort-by"}).contents

标签: pythonhtmlweb-scrapingbeautifulsoup

解决方案


要查找“用户名”列下的所有名称,您可以使用:nth-of-type(n)CSS 选择器:div div:nth-of-type(n+5) > div > a

要使用 CSS 选择器,请使用.select()方法而不是.find_all().

在您的示例中:

from bs4 import BeautifulSoup

with open("file.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(str(file.readlines()), "html.parser")

for tag in soup.select("div div:nth-of-type(n+5) > div > a"):
    print(tag.text)

输出:

charli d’amelio
addison rae
Bella Poarch
Zach King
TikTok
...

推荐阅读