首页 > 解决方案 > 如何从没有“img”标签的网站下载图片?

问题描述

最近我一直在尝试学习如何进行网络抓取,以便从我的学校目录中下载所有图像。但是,在元素中,它们不会将图像存储在 img 标签下,而是将它们全部存储在以下位置: background-image: url("/common/pages/GalleryPhoto.aspx?photoId=323070&width=180&height=180");

无论如何绕过这个?

这是将从目标网站下载图像的当前代码

import os, requests, bsf n4, webbrowser, random 
 
url = 'https://jhs.lsc.k12.in.us/staff_directory' 
  
res = requests.get(url)
try: 
    res.raise_for_status() 
except Exception as exc: 
    print('Sorry an error occured:', exc) 
 
soup = bs4.BeautifulSoup(res.text, 'html.parser') 
element = soup.select('background-image') 
 
for i in range(len(element)): 
    url = element[i].get('img') 
    name = random.randrange(1, 25) 
    file = open(str(name) + '.jpg', 'wb') 
    res = requests.get(url) 
    for chunk in res.iter_content(10000): 
        file.write(chunk) 
    file.close() 
 
print('done')

标签: pythonweb-scraping

解决方案


您可以使用此站点使用的内部 API 来获取包括图像 URL 在内的数据。/settings它首先使用端点获取人员组列表,然后/Search使用所有 groupID 调用 api

流程如下:

  • portletInstanceId从具有属性的 div 标签中获取值data-portlet-instance-id

  • 调用设置 api 并获取组 ID:

    POST https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Settings
    
  • 使用分页参数调用搜索api,您可以选择要请求的人数和每页的数量:

    POST https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Search
    

以下脚本获取前 20 个人并将结果放入 pandas DataFrame:

import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get("https://jhs.lsc.k12.in.us/staff_directory")
soup = BeautifulSoup(r.content, "lxml")

portletInstanceId = soup.select('div[data-portlet-instance-id].staffDirectoryComponent')[0]["data-portlet-instance-id"]

r = requests.post("https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Settings",
    json = { "portletInstanceId": portletInstanceId })

groupIds = [t["groupID"] for t in r.json()["d"]["groups"]]
print(groupIds)

payload = {
    "firstRecord": 0,
    "groupIds": groupIds,
    "lastRecord": 20,
    "portletInstanceId": portletInstanceId,
    "searchByJobTitle": True,
    "searchTerm": "",
    "sortOrder": "LastName,FirstName ASC"
}

r = requests.post("https://jhs.lsc.k12.in.us/Common/controls/StaffDirectory/ws/StaffDirectoryWS.asmx/Search",
    json = payload)

results = r.json()["d"]["results"]

#add image url based on userID
for t in results:
    t["imageURL"] = f'https://jhs.lsc.k12.in.us/{t["imageURL"]}' if t["imageURL"] else ''
 
df = pd.DataFrame(results)

#whole data
print(df)

#only image url
with pd.option_context('display.max_colwidth', 400):
    print(df["imageURL"])

在 repl.it 上试试这个

您需要相应地更新firstRecordlastRecord字段


推荐阅读