python - 抓取时从网页中保存图像/表格
问题描述
我需要从这个网站上抓取一张图片:https ://web.archive.org/web/
例如对于stackoverflow, towardsdatascience
.
URL
stackoverflow.com
towardsdatascience.com
我不知道如何在其中包含有关表格/图像的信息
<div class="sparkline" style="width: 1225px;"><div id="wm-graph-anchor"><div id="wm-ipp-sparkline" title="Explore captures for this URL" style="height: 77px;"><canvas class="sparkline-canvas" width="1225" height="75" alt="sparklines"></canvas></div></div><div id="year-labels"><span class="sparkline-year-label">1996</span><span class="sparkline-year-label">1997</span><span class="sparkline-year-label">1998</span><span class="sparkline-year-label">1999</span><span class="sparkline-year-label">2000</span><span class="sparkline-year-label">2001</span><span class="sparkline-year-label">2002</span><span class="sparkline-year-label">2003</span><span class="sparkline-year-label">2004</span><span class="sparkline-year-label">2005</span><span class="sparkline-year-label">2006</span><span class="sparkline-year-label">2007</span><span class="sparkline-year-label">2008</span><span class="sparkline-year-label">2009</span><span class="sparkline-year-label">2010</span><span class="sparkline-year-label">2011</span><span class="sparkline-year-label">2012</span><span class="sparkline-year-label">2013</span><span class="sparkline-year-label">2014</span><span class="sparkline-year-label">2015</span><span class="sparkline-year-label">2016</span><span class="sparkline-year-label">2017</span><span class="sparkline-year-label">2018</span><span class="sparkline-year-label">2019</span><span class="sparkline-year-label selected-year">2020</span></div></div>
即时间线显示多年的图像。如果可能的话,我想为每个网站保存这个图像/表格。我试图写一些代码,但它错过了这部分:
import json
import requests
def my_function(file):
urls = list(set(file.URL.tolist()))
df_url= pd.DataFrame(columns=['URL'])
df_url['URL']=urls
api_url = 'https://web.archive.org/__wb/search/metadata'
for url in df_url['URL']:
res = requests.get(api_url, params={'q': url})
# part to scrape the image
return
my_function(df)
你能给我一些关于如何获取这些图像的意见吗?
解决方案
如果您在 for 循环中有每个图像 URL,则可以使用 python 库urllib.request
函数下载图像urlretrive
:
首先使用在脚本开头导入它
import os
from urllib.parse import urlparse
import urllib.request
然后使用下载它们
for url in df_url['URL']:
urllib.request.urlretrieve(url,os.path.basename(urlparse(url).path))
如果您不使用 URL 基本名称保存,则不要进行前 2 次导入。
推荐阅读
- github - 如何在合并到以后的分支时忽略 GitHub 拉取请求中的特定文件?
- r - 如何从 PHP 网站将表格读入 R
- sql - 通过在一个表中聚合值来构造新表
- firebase - Flutter Firestore 数据更改通知
- google-sheets - 有条件地绘制电子表格数据
- vue.js - 从 API 密钥获取温度数据
- c# - 如何获取动态对象名称 WPF
- google-cloud-platform - 将 SSH 密钥传输到 VM。( SSH 不适用于实例)
- sql-server - 如何让我的 SQL Server 数据库与我在 Azure 上部署的 REST API 一起工作?
- serverless-framework - DynamoDB 和无服务器:未授权执行:dynamodb:Query on resource