python-3.x - BeautifulSoup:有没有办法设置 find_all() 方法的起点?
问题描述
给定一个soup
我需要n
用class="foo"
.
这可以通过以下方式完成:
soup.find_all(class_='foo', limit=n)
但是,这是一个缓慢的过程,因为我要查找的元素位于文档的最底部。
这是我的代码:
main_num = 1
main_page = 'https://rawdevart.com/search/?page={p_num}&ctype_inc=0'
# get_soup returns bs4 soup of a link
main_soup = get_soup(main_page.format(p_num=main_num))
# get_last_page returns the number of pages which is 64
last_page_num = get_last_page(main_soup)
for sub_num in range(1, last_page_num+1):
sub_soup = get_soup(main_page.format(p_num=sub_num))
arr_links = sub_soup.find_all(class_='head')
# process arr_links
解决方案
该类head
是该页面上标签的一个属性a
,因此我假设您想要获取所有关注链接并继续浏览所有搜索页面。
以下是您可能希望完成的方法:
import requests
from bs4 import BeautifulSoup
base_url = "https://rawdevart.com"
total_pages = BeautifulSoup(
requests.get(f"{base_url}/search/?page=1&ctype_inc=0").text,
"html.parser",
).find(
"small",
class_="d-block text-muted",
).getText().split()[2]
pages = [
f"{base_url}/search/?page={n}&ctype_inc=0"
for n in range(1, int(total_pages) + 1)
]
all_follow_links = []
for page in pages[:2]:
r = requests.get(page).text
all_follow_links.extend(
[
f'{base_url}{a["href"]}' for a in
BeautifulSoup(r, "html.parser").find_all("a", class_="head")
]
)
print(all_follow_links)
输出:
https://rawdevart.com/comic/my-death-flags-show-no-sign-ending/
https://rawdevart.com/comic/tsuki-ga-michibiku-isekai-douchuu/
https://rawdevart.com/comic/im-not-a-villainess-just-because-i-can-control-darkness-doesnt-mean-im-a-bad-person/
https://rawdevart.com/comic/tensei-kusushi-wa-isekai-wo-meguru/
https://rawdevart.com/comic/iceblade-magician-rules-over-world/
https://rawdevart.com/comic/isekai-demo-bunan-ni-ikitai-shoukougun/
https://rawdevart.com/comic/every-class-has-been-mass-summoned-i-strongest-under-disguise-weakest-merchant/
https://rawdevart.com/comic/isekai-onsen-ni-tensei-shita-ore-no-kounou-ga-tondemosugiru/
https://rawdevart.com/comic/kubo-san-wa-boku-mobu-wo-yurusanai/
https://rawdevart.com/comic/gabriel-dropout/
and more ...
注意:要获取所有页面,只需slicing
从此行中删除:
for page in pages[:2]:
# the rest of the loop body
所以它看起来像这样:
for page in pages:
# the rest of the loop body
推荐阅读
- matlab - 如何在matlab中对窄带信号进行下采样?
- python - Python-OpenCV 强制 VideoWriter 使用 libx264 而不是 libcuda.so.1
- vba - 使用 vba 发送对象(json)
- flutter - 如何创建返回小部件数组的函数
- android - 如何在 Android Studio 预览中预览 ScrollView?
- database - 可以使用 GDPR 将密码存储到私有存储库中的数据库吗?
- bluetooth - UWP 如何检查来自 BLE 设备的传入请求?
- vb.net - 如何使用存储过程 oracle sql 和 vb.net 填充组合框
- sharepoint-online - 在 JSOM 中设置 SharePoint 搜索查询结果源
- javascript - 对每个表行反应 onclick