python - Python bs4:仅获取其中包含特定字符串的 URL
问题描述
我正在制作一个图像抓取工具,希望能够从此链接中获取其中一些照片,然后将它们保存在一个名为dribblephotos
:https ://dribbble.com/search/shots/popular/illustration?q=sneaker%20 的文件夹中
以下是我检索到的链接:
https://static.dribbble.com/users/458522/screenshots/6040912/nike_air_huarache_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/105681/screenshots/3944640/hype_1x.png
https://static.dribbble.com/users/105681/avatars/mini/avatar-01-01.png?1377980605
https://static.dribbble.com/users/923409/screenshots/7179093/basketball_marly_gallardo_1x.jpg
https://static.dribbble.com/users/923409/avatars/mini/bc17b2db165c31804e1cbb1d4159462a.jpg?1596192494
https://static.dribbble.com/users/458522/screenshots/6034458/nike_air_jordan_i_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1237425/screenshots/5071294/customize_air_jordan_web_2x.png
https://static.dribbble.com/users/1237425/avatars/mini/87ae45ac7a07dd69fe59985dc51c7f0f.jpeg?1524130139
https://static.dribbble.com/users/1174720/screenshots/6187664/adidas_2x.png
https://static.dribbble.com/users/1174720/avatars/mini/9de08da40078e869f1a680d2e43cdb73.png?1588733495
https://static.dribbble.com/users/179617/screenshots/4426819/ultraboost_1x.png
https://static.dribbble.com/users/179617/avatars/mini/2d545dc6c0dffc930a2b20ca3be88802.jpg?1596735027
https://static.dribbble.com/users/458522/screenshots/6126041/nike_air_max_270_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/60266/screenshots/6698826/nike_shoe_2x.jpg
https://static.dribbble.com/users/60266/avatars/mini/64826d925db1d4178258d17d8826842b.png?1549028805
https://static.dribbble.com/users/78464/screenshots/4950025/8x600_1x.jpg
https://static.dribbble.com/users/78464/avatars/mini/a9ae6a559ab479d179e8bd22591e4028.jpg?1465908886
https://static.dribbble.com/users/458522/screenshots/6118702/adidas_nmd_r1_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/458522/screenshots/6098953/nike_lebron_10_je_icon_qs_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/879147/screenshots/7152093/img_0966_2x.png
https://static.dribbble.com/users/879147/avatars/mini/e095f3837f221bb2ef652dcc966b99f7.jpg?1568473177
https://static.dribbble.com/users/458522/screenshots/6128979/nerd_x_adidas_pharrell_hu_nmd_trail_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/879147/screenshots/11064235/26fa4a2d-9033-4953-b48f-4c0e8a93fc9d_2x.png
https://static.dribbble.com/users/879147/avatars/mini/e095f3837f221bb2ef652dcc966b99f7.jpg?1568473177
https://static.dribbble.com/users/458522/screenshots/6132938/nike_moon_racer_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1823684/screenshots/5973495/jordannn1_2x.png
https://static.dribbble.com/users/1823684/avatars/mini/f6041c082aec67302d4b78b8d203f02b.png?1509719582
https://static.dribbble.com/users/552027/screenshots/4666241/airmax270_1x.jpg
https://static.dribbble.com/users/552027/avatars/mini/35bb0dcb5a6619f68816290898bff6cc.jpg?1535884243
https://static.dribbble.com/users/458522/screenshots/6044426/adidas_pharrell_hu_nmd_trail_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/220914/screenshots/11295053/woman_shoe_tree_floating2_2x.png
https://static.dribbble.com/users/220914/avatars/mini/d364a9c166edb6d96cc059a836219a7d.jpg?1590773568
https://static.dribbble.com/users/4040486/screenshots/7079508/___2x.png
https://static.dribbble.com/users/4040486/avatars/mini/f31e9b50df877df815177e2015135ff7.png?1582521697
https://static.dribbble.com/users/57602/screenshots/12909636/d2_2x.png
https://static.dribbble.com/users/57602/avatars/mini/b4c27f3be2c61d82fbc821433d058b04.jpg?1575089000
https://static.dribbble.com/users/458522/screenshots/6049522/nike_x_john_elliott_lebron_10_soldier_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1025917/screenshots/9738550/vans-2020-pixelwolfie-dribbble_2x.png
https://static.dribbble.com/users/1025917/avatars/mini/87fdcb145eab0b47eda29fc873f25f8c.png?1594466719
https://static.dribbble.com/assets/icon-backtotop-1b04df73090f6b0f3192a3b71874ca3b3cc19dff16adc6cf365cd0c75897f6c0.png
https://static.dribbble.com/assets/dribbble-ball-icon-e94956d5f010d19607348176b0ae90def55d61871a43cb4bcb6d771d8d235471.svg
https://static.dribbble.com/assets/icon-shot-x-light-40c073cd65443c99d4ac129b69bf578c8cf97d69b78990c00c4f8c5873b0d601.png
https://static.dribbble.com/assets/icon-shot-prev-light-ca583c76838d54eca11832ebbcaba09ba8b2bf347de2335341d244ecb9734593.png
https://static.dribbble.com/assets/icon-shot-next-light-871a18220c4c5a0325d1353f8e4cc204c3b49beacc63500644556faf25ded617.png
https://static.dribbble.com/assets/dribbble-square-c8c7a278e96146ee5a9b60c3fa9eeba58d2e5063793e2fc5d32366e1b34559d3.png
https://static.dribbble.com/assets/dribbble-ball-192-ec064e49e6f63d9a5fa911518781bee0c90688d052a038f8876ef0824f65eaf2.png
https://static.dribbble.com/assets/icon-overlay-x-2x-b7df2526b4c26d4e8410a7c437c433908be0c7c8c3c3402c3e578af5c50cf5a5.png
但是,我只希望能够获取其中包含字符串“screenshots”的 URL。因此,我尝试制作一个函数来抓取某些在其 URL 中具有“屏幕截图”的图像。例如:
https://static.dribbble.com/users/923409/screenshots/7179093/basketball_marly_gallardo_1x.jpg
起初为了看看是否有效,我做了一个函数来打印我想要的特定链接。然而它没有用。这是我的功能代码:
def art_links():
images = []
for img in x:
images.append(img['src'])
images = soup2.find_all("screenshots")
print(images)
这是我的完整代码:
from bs4 import BeautifulSoup
import requests as rq
import os
r2 = rq.get("https://dribbble.com/search/shots/popular/illustration?q=sneaker%20")
soup2 = BeautifulSoup(r2.text, "html.parser")
links = []
x = soup2.select('img[src^="https://static.dribbble.com"]')
for img in x:
links.append(img['src'])
def art_links():
images = []
for img in x:
images.append(img['src'])
images = soup2.find_all("screenshots")
print(images)
os.mkdir('dribblephotos')
for index, img_link in enumerate(links):
if "screenshots" in images:
img_data = r.get(img_link).content
with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
f.write(img_data)
else:
break
art_links()
解决方案
我注意到最后的 if 语句的代码语法有一点问题(没有在 if 下加标签),所以我重新格式化了一下,试图让它变成你想要的。我认为可能发生的事情是你在最后的 for 循环之外打破了 else 语句。这使得只要一个条目在链接中没有屏幕截图,它就会完全停止循环而不是继续。虽然可以使用关键字“继续”,但不使用 else 语句就足够了。您还在检查图像中的“屏幕截图”,但您尝试检查的链接名称在 for 循环中声明为 img_link。最后为你的 for 循环试试这个,看看你得到了什么:
for index, img_link in enumerate(links):
if "screenshots" in img_link:
img_data = rq.get(img_link).content
with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
f.write(img_data)
如果您仍然需要链接而不是文件下载,您应该能够在 for 循环中遍历图像时检索它们,如果它是屏幕截图链接,则将它们存储在新列表中。
更新:这个最新的对我有用。在将它们放入循环后,我删除了过滤掉ips的函数,因为在已经循环了两次之后这是不必要的。第一个 for 循环就是您所需要的,不需要迭代两次,所以我只检查第一次迭代的时候,如果需要,只保存到链接列表的链接。
from bs4 import BeautifulSoup
import requests as rq
import os
r2 = rq.get("https://dribbble.com/search/shots/popular/illustration?q=sneaker%20")
soup2 = BeautifulSoup(r2.text, "html.parser")
links = []
x = soup2.select('img[src^="https://static.dribbble.com"]')
os.mkdir('dribblephotos')
# Only one for loop required, shouldn't iterate twice if not required
for index, img in enumerate(x):
# Store the current url from the image result
url = img["src"]
# Check the url for screenshot before putting in the links
if "screenshot" in url:
links.append(img['src'])
# Download the image
img_data = rq.get(url).content
# Put the image into the file
with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
f.write(img_data)
print(links)
推荐阅读
- excel - 使用从“dd/MM/yyyy”格式的单元格中获取的日期填充用户表单文本框
- python-3.x - 从列表索引中的每个字符串中提取首字母缩写词
- perforce - Deleted (not submitted) directory in Perforce: cannot revert, force-sync, reconcile or do anything else
- sql - 将字符串格式的日期转换为日期数据类型
- data-structures - 同一时期在这个城市的人数最多是多少?
- bash - 使用“ls -l”时,Windows 版 Cygwin 挂起
- azure-active-directory - Azure AD 身份验证通过 Auth0 失败
- javascript - create-react-app 子文件夹项目不 lint
- javascript - 是否有一种有效的方法可以使用其键访问数组中的对象?
- php - 为 JSONString Zoho inventory api 传递的值无效