首页 > 解决方案 > Webscraping 图像 URL 返回为 ''

问题描述

我认为我的问题是页面上运行的 javascript 并且在我向下滚动之前不加载图像。谁能帮我解决这个问题?该脚本工作正常,直到我点击页面上有更多图像的“ZendikarRising(ZNR)”。然后给我 Failed to save imageMakindi Ox (ZNR).png from url ...它应该说一个 URL,但它返回''我已经合并了一些 DEBUG 代码来绕过丢失的卡 URL,但我错过了吨。

我试过删除空字段,但如果你运行它,你会看到我有偶数个卡名称和 URL(其中一些是空白的),所以删除空 URL 会丢弃总数,并导致我从放。

这是有问题的代码

import requests
import os
from os.path import basename
from bs4 import BeautifulSoup
 
path = os.getcwd()
print ("The current working directory is %s" % path)
 
url = 'https://scryfall.com/sets'
r=requests.get(url).text
soup = BeautifulSoup(r, 'html.parser')
 
####################GATHERS ALL URLS FROM SET DIRECTORY#####################
links = []
Urls = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
 
for link in links:
    if link != None:
        if 'https://scryfall.com/sets/' in link:
            if link not in Urls:
                Urls.append(link)
 
#################START OF ALL URL LOOPS################################
for Url in Urls: ##goes threw all the URLS gathered from the sets links
    r=requests.get(Url).text
    soup = BeautifulSoup(r, 'html.parser')
 
    temp = soup.find('h1', {'class': 'set-header-title-h1'}).contents
    temp = ''.join(temp)
    temp = temp.strip()
    temp = temp.replace(':', '')
    temp = temp.replace(' ', '')
 
    test2 = (f"{path}\\{temp}")
#############################################MAKE DIRECTORY FOR SET FOLDERS##################
    try:
        os.mkdir(test2)
    except OSError:
        print ("Creation of the directory %s failed" % test2)
    else:
        print ("Successfully created the directory %s " % test2)
 
############################################GATHER ALL IMAGES####################
    images = soup.find_all('img')
 
    pictures = [] ##stores all the picture URLS
    names = [] ##stores all the name
 
    for image in images[:-1]:
        names.append(image.get('alt'))
        pictures.append(image.get('src'))
####################SAVES ALL IMAGES AS FILES#################
 
    x=0
    for i in pictures:
        fn = names[x] + '.png'
        try:
            with open(f'{test2}\\'+basename(fn),"wb") as f:
 
                f.write(requests.get(i).content)
                f.close
                ##print(i)
                ##print(f'saved {fn} to {path}')
                x+=1
        except OSError:
            print(f"Failed to save image{fn} from url{i}")
            print(len(pictures))
            print(len(names))
            exit()
##################RESETS IMAGES AND NAMES FOR NEXT SET FOLDER#############
 
    pictures.clear()
    names.clear()
Print("Completed With No Errors")

标签: pythonbeautifulsoup

解决方案


实际上,图像是由 JS 脚本延迟加载的,尽管您在页面后面找不到<img>带有属性的标签。src

但是解决方案非常简单。如果您查看几个<img>未加载的标签,您将看到图像链接不存在于src属性中,而是存在于data-src属性中。

例如:

<img alt="Wayward Guide-Beast (ZNR)" class="card znr border-black" data-component="lazy-image" data-src="https://c1.scryfall.com/file/scryfall-cards/normal/front/e/b/ebfe94fc-7a98-4f53-8fd0-f5fd016b1873.jpg?1599472001" src="" title="Wayward Guide-Beast (ZNR)"/>

因此,您所要做的就是检查是否src为空,如果是,则抓取data-src属性。


推荐阅读