首页 > 解决方案 > BeautifulSoup 找不到 Image Src 属性

问题描述

嗨,我一直在网上抓取Asos 时尚网站,我得到了所有元素,但在 8 日之后无法获取img源属性img

该类img由三个名称组成,或者名称可以属于?这有点可疑。

当我尝试查找所有img标签时,我得到了一个非常不同的名称,第 9 个没有源属性img

我的代码:

from helium import*
import time
from bs4 import BeautifulSoup

s = start_firefox(f"https://www.asos.com/men/shoes-boots-trainers/boots/cat/?cid=5774&currentpricerange=15-400&nlid=mw|shoes|shop%20by%20product|boots&refine=attribute_1046:8222,8629,10808&sort=priceasc",headless =True)

time.sleep(5)

for x in range(1,2):
    scroll_down(num_pixels=10000)
    for x in range(1,3):
        click("LOAD MORE")
        time.sleep(5)
        scroll_down(num_pixels=10000)


soup = BeautifulSoup(s.page_source,"lxml")

All = soup.find_all("article",class_="_2qG85dG")

kill_browser()

   
def img(s):
    try:
        return s.find("img",class_= "_2r9Zh0W")["src"]
    except:
        return s.find("img",class_="_2FC97Nq _2q4fCfJ _2r9Zh0W")['src']

for a in All:
    print(img(a))
    print()

输出:

//images.asos-media.com/products/asos-design-chelsea-boots-in-tan-faux-suede/12550524-1-tan?$n_480w$&wid=476&fit=constrain

//images.asos-media.com/products/asos-design-chelsea-boots-in-black-faux-suede/12550506-1-black?$n_480w$&wid=476&fit=constrain

//images.asos-media.com/products/asos-design-chelsea-boots-in-brown-suede-with-black-sole/14849004-1-brown?$n_480w$&wid=476&fit=constrain

//images.asos-media.com/products/asos-design-vegan-lace-up-boots-in-brown-faux-leather/12510724-1-brown?$n_480w$&wid=476&fit=constrain

//images.asos-media.com/products/asos-design-chelsea-boots-in-brown-leather-with-brown-sole/10278706-1-brown?$n_480w$&wid=476&fit=constrain

//images.asos-media.com/products/asos-design-cuban-heel-western-chelsea-boot-in-grey-faux-suede-with-square-toe-with-metal-cap/21031115-1-grey?$n_480w$&wid=476&fit=constrain

//images.asos-media.com/products/new-look-chelsea-boot-in-black-suede/21198040-1-black?$n_480w$&wid=476&fit=constrain

//images.asos-media.com/products/asos-design-wide-fit-chelsea-boots-in-black-faux-suede/12550515-1-black?$n_480w$&wid=476&fit=constrain

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-78-d9272492986c> in img(s)
      9     try:
---> 10         return s.find("img",class_= "_2r9Zh0W")["src"]
     11     except:

TypeError: 'NoneType' object is not subscriptable

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-79-51d9d651c40b> in <module>
      3     #print(a.find("div",class_= "_3J74XsK").text.strip())
      4     #print(price(a))
----> 5     print(img(a))
      6     print()

<ipython-input-78-d9272492986c> in img(s)
     10         return s.find("img",class_= "_2r9Zh0W")["src"]
     11     except:
---> 12         return s.find("img",class_="_2FC97Nq _2q4fCfJ _2r9Zh0W")["src"]
     13 
     14 

TypeError: 'NoneType' object is not subscriptable

标签: pythonpython-3.xweb-scrapingbeautifulsouperror-handling

解决方案


怎么了?

图像以惰性模式加载,这意味着如果它们进入视野。这就是为什么你只能得到src前 8 个的原因。

对于尚未加载的图像,您将获得以下信息:

<img alt="" class="_1Jj-2sd" data-auto-id="productTileEmptyImage"/>

怎么修?

不要一步一步滚动整个方式,做更小的步骤并等待图像加载:

for x in range(1,6):
    scroll_down(num_pixels=1800)
    time.sleep(3)

我还认为通过它的数据属性而不是它的类/类来选择图像会更好/更清楚:

    if a.find('img', {'data-auto-id':'productTileImage'}):
        print(a.find('img', {'data-auto-id':'productTileImage'})['src'])
    else:
        print(a.img)

例子

from helium import*
import time
from bs4 import BeautifulSoup

s = start_firefox(f"https://www.asos.com/men/shoes-boots-trainers/boots/cat/?cid=5774&currentpricerange=15-400&nlid=mw|shoes|shop%20by%20product|boots&refine=attribute_1046:8222,8629,10808&sort=priceasc",headless =False)

time.sleep(2)
for t in range(1,4):
    time.sleep(2)
    for x in range(1,6):
        scroll_down(num_pixels=2000)
        time.sleep(3)
    try:
        click(Link('Load more'))
    except:
        continue

soup = BeautifulSoup(s.page_source,'lxml')


for a in soup.find_all("article",{'data-auto-id':'productTile'}):
    if a.find('img', {'data-auto-id':'productTileImage'}):
        print(a.find('img', {'data-auto-id':'productTileImage'})['src'])
    else:
        print(a.img)

推荐阅读