首页 > 解决方案 > 使用 BeautifulSoup 在 DIV 标签下刮取 IMG SRC

问题描述

我正在尝试获取位于 Div 标签下的图像的 src。我的代码给了我一个错误,KeyError: 'src'

来自 EndGadget.com 博客页面的 HTML 10

这是我的代码:

for page in range(1,4):
# code that gets dynamic URL
url = sys.argv[1] + "{}".format(page)
print(url)
html=urlopen(url)
soup=BeautifulSoup(html,"lxml")

for article in soup.find_all('article',class_='o-hit'):
    div=soup.find('div',{"class":"o-rating_thumb@m-"})
    img_src = div.find('img').attrs['src']
    #img_src = article.find('div',class_ ='o-rating_thumb c-white').img['src']   
    headline = article.h2.text.strip()

    summary = article.find('p',class_ ='mt-15@m+ t-d5@m- t-d5@tp+ c-gray-3').text.strip()

    #img_src = "none"

    print(headline)
    print(summary)
    print(img_src)
    csv_writer.writerow([headline,summary,img_src])

网页在这里: EndGadget 博客第 10 页

标签: pythonhtmlbeautifulsoup

解决方案


对于每个页面上最顶部的新闻项目,您可以从 'src' 属性本身获取图像源。

您可以首先使用find()方法导航到包含图像的 div 。接下来在该 div 中,您可以找到img标签并从其属性中获取其来源。

import requests
from bs4 import BeautifulSoup
url='https://www.engadget.com/reviews/latest/page/10/'
res=requests.get(url)
soup=BeautifulSoup(res.text,'html.parser')
div=soup.find('div',{"class":"o-rating_thumb@m-"})
print(div.find('img').attrs['src'])

输出:

https://o.aolcdn.com/images/dims?resize=810%2C455&crop=810%2C455%2C0%2C0&quality=80&image_uri=https%3A%2F%2Fo.aolcdn.com%2Fimages%2Fdims%3Fcrop%3D1400%252C933%252C0%252C0%26quality%3D85%26format%3Djpg%26resize%3D1600%252C1066%26image_uri%3Dhttp%253A%252F%252Fo.aolcdn.com%252Fhss%252Fstorage%252Fmidas%252F85a4e2b124ba329ab520e80e306f07eb%252F206517051%252FIMG_5243e.jpg%26client%3Da1acac3e1b3290917d92%26signature%3Dcea6158d0bf02768d31ee67f2694be6cafaf200c&client=amp-blogside-v2&signature=08a97a1109f1c3287c6766fa284104c6f78770fe

编辑以抓取页面的所有新闻来源:

即使第一张图片有一个属性src,为了抓取后续图片,我们必须使用该属性data-originals(您可以查看页面源并找出这一点)。我认为这就是您收到 AttributeError 的原因

我能够像这样抓取所有新闻项目

import requests
from bs4 import BeautifulSoup
url='https://www.engadget.com/reviews/latest/page/10/'
res=requests.get(url)
soup=BeautifulSoup(res.text,'html.parser')
articles=soup.find_all('article',{"class":"o-hit"})
for article in articles:
    print("Heading: ", article.find('h2').text.strip())#heading
    print("Summary: ", article.find('p').text.strip())#summary
    print("Image Source:", article.find('img').attrs['data-original'])#image src
    print()

输出:

Heading:  Netflix will remove user reviews from its website next month
Summary:  Last year five-star ratings got the ax, and now written reviews will fade away too.
Image Source: https://o.aolcdn.com/images/dims?thumbnail=300%2C200&quality=80&image_uri=https%3A%2F%2Fs.aolcdn.com%2Fhss%2Fstorage%2Fmidas%2F884e68f9a829f3a26db5b729f00ccd03%2F206508290%2FEnglish.jpg&client=amp-blogside-v2&signature=b37eb21e95cef8cebe1f3c741b8bb29eb3471dcc

Heading:  Smart ForTwo Electric Drive quick spin review
Summary:  The saddest way to spend $25,000.
Image Source: https://o.aolcdn.com/images/dims?thumbnail=300%2C200&quality=80&image_uri=https%3A%2F%2Fs.aolcdn.com%2Fhss%2Fstorage%2Fmidas%2Fedbdfdfeff2e77567cd0c4a73484d108%2F206502307%2Fsmartfortwo.jpg&client=amp-blogside-v2&signature=a9fc05d80d4b4d8ba6ef33453510c138632bab81

Heading:  Vivo's all-screen NEX S is a frustrating glimpse of the future
Summary:  Spoiler alert: It's really cool, but don't bother importing one.
Image Source: https://o.aolcdn.com/images/dims?thumbnail=386%2C217&quality=80&image_uri=https%3A%2F%2Fimg.vidible.tv%2Fprod%2F2018-06%2F29%2F5b36ac0e523dc352bd46785a%2F5b36aedc884c2354eb33d663_1920x1080_U_v1.jpg&client=amp-blogside-v2&signature=725c8033196a2ae3500e2144830d14b03e7abc0e

Heading:  Sonos Beam review: Smart features trump minor audio compromises
Summary:  Bringing the soundbar into the smart home era.
Image Source: https://o.aolcdn.com/images/dims?thumbnail=386%2C217&quality=80&image_uri=https%3A%2F%2Fimg.vidible.tv%2Fprod%2F2018-06%2F27%2F5b32f579523dc352bd3f66f3%2F5b32fbf2884c2354eb33d62f_1920x1080_U_v1.jpg&client=amp-blogside-v2&signature=4ad311aeb5cb23907fd99ec12d962b148646163d

Heading:  BlackBerry KEY2 review: The undisputed keyboard king
Summary:  This is the best Android-powered BlackBerry, if that means anything to you.
Image Source: https://o.aolcdn.com/images/dims?thumbnail=386%2C217&quality=80&image_uri=https%3A%2F%2Fimg.vidible.tv%2Fprod%2F2018-06%2F26%2F5b3188ee523dc36212a7ff02%2F5b318be5802b94347b7e586b_1920x1080_U_v1.jpg&client=amp-blogside-v2&signature=5438cdf814480be5856d38db73695f86ade186ea

Heading:  Amazon Echo Look review: Good selfie taker, so-so stylist
Summary:  An AI is no match for my style instincts.
Image Source: https://o.aolcdn.com/images/dims?thumbnail=386%2C217&quality=80&image_uri=https%3A%2F%2Fimg.vidible.tv%2Fprod%2F2018-06%2F25%2F5b30cbfce880db6107cb7ad0%2F5b30cde61aa5fc22c7bbf187_1920x1080_U_v1.jpg&client=amp-blogside-v2&signature=308e9f00afcb968da05823ce0d0718ccc6e43cb4

Heading:  Mitsubishi’s Outlander Plug-In Hybrid is an understated surprise
Summary:  Mitsubishi is back, even though it actually never left.
Image Source: https://o.aolcdn.com/images/dims?thumbnail=386%2C217&quality=80&image_uri=https%3A%2F%2Fimg.vidible.tv%2Fprod%2F2018-06%2F21%2F5b2bc80f523dc36212a2be79%2F5b2bc8a6884c2319c410c008_1920x1080_U_v1.jpg&client=amp-blogside-v2&signature=a00b8466fa281051de4d64b1223fe99f97315985

Heading:  Amazon Fire TV Cube review: Alexa still needs work as a TV guide
Summary:  This device was bound to be made at some point, but is it worth it?
Image Source: https://o.aolcdn.com/images/dims?thumbnail=386%2C217&quality=80&image_uri=https%3A%2F%2Fimg.vidible.tv%2Fprod%2F2018-06%2F21%2F5b2bb81edbaab36faf00ed0e%2F5b2bddfb884c2319c410c00c_1920x1080_U_v1.jpg&client=amp-blogside-v2&signature=baa2db64e12d013ab712d823238fc3efeee693f8

Heading:  HTC U12+ review: Fundamentally flawed
Summary:  The phone's pressure-sensitive power and volume keys are kinda the worst.
Image Source: https://o.aolcdn.com/images/dims?thumbnail=386%2C217&quality=80&image_uri=https%3A%2F%2Fimg.vidible.tv%2Fprod%2F2018-06%2F21%2F5b28cd94f50775726418990a%2F5b2bd7d4b46ab33c496c1607_1920x1080_U_v1.jpg&client=amp-blogside-v2&signature=8518ce5c141fb85b935794fbd3bd283d32508484

推荐阅读