首页 > 解决方案 > 提取 src 属性

问题描述

我想做的事:

此 HTML 代码:

<img class="poster lazyload lazyloaded"
     data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
     data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
     alt="Hitman"
     src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
     srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
     data-loaded="true">

我想提取“data-src”或“src”(或每个属性都包含图像的 URL)属性值。

我试过的:

Posters = soup.find("img")["src"]
print(Posters)

但这显然会返回每个 img 标签的所有值,因此每个链接都与海报无关。输出:

https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG
https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG

对于海报,我的意思是(查看此网址https://www.themoviedb.org/search?&query=Hitman:)电影海报。

概括

我想在“.lazyloaded”类中提取属性内的值

我希望一切都清楚。谢谢。


编辑:

解释一下,问题出在哪里?

对于每个阅读者来说,Laurent 的答案是解决方案,问题在于解析的 HTML。

正如我们在浏览器上看到的那样,包含我试图抓取的属性的类位于“posterlazyloadlazyloaded”类中: HTML

但如果我们打印 website.content:

   <img class="poster lazyload" 
        data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg"                                                                          
        data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 2x"
        alt="The Hitman&#x27;s Bodyguard Collection">

这是非常非常不同的。

标签: pythonweb-scraping

解决方案


您可以尝试按以下方式过滤class

posters  = soup.find_all("img", {"class": "lazyloaded"})

for poster in posters:
    print(poster["src"])

请参阅文档:https ://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

编辑:更多解释

假设您有以下文件demo.html

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Title</title>
</head>
<body>
<img class="logo" src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg">
<img class="poster lazyload lazyloaded"
     data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
     data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
     alt="Hitman"
     src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
     srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
     data-loaded="true">
</body>
</html>

您可以像这样解析“海报”图像:

import io

from bs4 import BeautifulSoup

with io.open("demo.html", encoding="utf8") as fd:
    soup = BeautifulSoup(fd.read(), features="html.parser")

posters = soup.find_all("img", {"class": "lazyloaded"})

for poster in posters:
    print(poster["src"])

你得到:

https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg

推荐阅读