首页 > 解决方案 > 抓取时从网页中保存图像/表格

问题描述

我需要从这个网站上抓取一张图片:https ://web.archive.org/web/ 例如对于stackoverflow, towardsdatascience.

URL

stackoverflow.com
towardsdatascience.com

我不知道如何在其中包含有关表格/图像的信息

<div class="sparkline" style="width: 1225px;"><div id="wm-graph-anchor"><div id="wm-ipp-sparkline" title="Explore captures for this URL" style="height: 77px;"><canvas class="sparkline-canvas" width="1225" height="75" alt="sparklines"></canvas></div></div><div id="year-labels"><span class="sparkline-year-label">1996</span><span class="sparkline-year-label">1997</span><span class="sparkline-year-label">1998</span><span class="sparkline-year-label">1999</span><span class="sparkline-year-label">2000</span><span class="sparkline-year-label">2001</span><span class="sparkline-year-label">2002</span><span class="sparkline-year-label">2003</span><span class="sparkline-year-label">2004</span><span class="sparkline-year-label">2005</span><span class="sparkline-year-label">2006</span><span class="sparkline-year-label">2007</span><span class="sparkline-year-label">2008</span><span class="sparkline-year-label">2009</span><span class="sparkline-year-label">2010</span><span class="sparkline-year-label">2011</span><span class="sparkline-year-label">2012</span><span class="sparkline-year-label">2013</span><span class="sparkline-year-label">2014</span><span class="sparkline-year-label">2015</span><span class="sparkline-year-label">2016</span><span class="sparkline-year-label">2017</span><span class="sparkline-year-label">2018</span><span class="sparkline-year-label">2019</span><span class="sparkline-year-label selected-year">2020</span></div></div>

即时间线显示多年的图像。如果可能的话,我想为每个网站保存这个图像/表格。我试图写一些代码,但它错过了这部分:

import json
import requests


def my_function(file):
    urls = list(set(file.URL.tolist()))

   df_url= pd.DataFrame(columns=['URL'])
   df_url['URL']=urls

   api_url = 'https://web.archive.org/__wb/search/metadata'

   for url in df_url['URL']:
      res = requests.get(api_url, params={'q': url})   
      # part to scrape the image
   return

my_function(df)

你能给我一些关于如何获取这些图像的意见吗?

标签: pythonweb-scrapingpython-requests

解决方案


如果您在 for 循环中有每个图像 URL,则可以使用 python 库urllib.request函数下载图像urlretrive

首先使用在脚本开头导入它

import os
from urllib.parse import urlparse
import urllib.request

然后使用下载它们

for url in df_url['URL']:
  urllib.request.urlretrieve(url,os.path.basename(urlparse(url).path))

如果您不使用 URL 基本名称保存,则不要进行前 2 次导入。


推荐阅读