首页 > 解决方案 > Python 请求基于 URL 的程序从气候网站 (http://www.climate.weather.gc.ca) 自动批量下载数据

问题描述

我正在尝试构建一个下载 .csv 并将其放入 pandas 数据框的程序。该指令建议我在 linux 上使用 wget,但是当我'http.ID={a}/.data'.format(a)从为我必须监控的所有气象站制作的字典中插入不同的气象站时,它无法正常工作。这是加拿大政府网站上的自述。

-------------------------------------------------- ---------------------


自述文件.txt

基于 URL 的程序从气候网站 ( http://www.climate.weather.gc.ca ) 自动批量下载数据 版本:2016-05-10


加拿大环境和气候变化

要在线阅读此文件,请访问:

ftp://client_climate@ftp.tor.ec.gc.ca/Pub/Get_More_Data_Plus_de_donnees/

文件夹:Get_More_Data_Plus_de_donnees > Readme.txt

关于如何从加拿大环境和气候变化部的气候网站下载一个站的所有天气数据的说明:

国家档案馆中每日更新的气候站列表,包括其气候 ID、站 ID、WMO ID、TC ID 和坐标,可在以下文件夹中找到:
Get_More_Data_Plus_de_donnees > Station Inventory EN.csv

使用以下实用程序下载数据: wget (GNU / Linux 操作系统) Cygwin (Windows 操作系统) https://www.cygwin.com Homebrew (OS X - Apple) http://brew.sh/ 下载所有示例从 1998 年到 2008 年,耶洛奈夫 A 的可用每小时数据(.csv 格式)

命令行:

for year in `seq 1998 2008`;do for month in `seq 1 12`;do wget --  content-disposition 
"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=1706&Year=${year}&Month=${month}&Day=14&timeframe=1&submit= Download+Data" ;done;done

在哪里;

year = 在命令行中更改值(seq 1998 2008)

月 = 在命令行中更改值(seq 1 12)

format= [csv|xml]:格式输出

timeframe = 1:每小时数据

timeframe = 2:用于每日数据

时间范围 = 3 用于每月数据

Day:不使用“day”变量的值,可以是任意值

对于另一个站,更改变量 stationID 的值

对于 XML 格式的数据,将 URL 中的变量 format 的值更改为 xml。

有关法语的信息,请将下载+数据更改为

++T%C3%A9l%C3%A9charger+%0D%0Ades+don​​n%C3%A9e​​s,同样在 url 中将 _e 更改为 _f。

如有问题或疑虑,请联系我们的国家气候服务办公室:ec.services.climatiques-climate.services.ec@canada.ca

-------------------------------------------------- ---------------------


我最初是使用 wget 从这个链接下载一个 csv 文件。它可以在没有 .format(ID,year).... 的情况下工作

这有效:

"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=50308&Year=2019&Month=3&Day=14&timeframe=2&submit= Download+Data"

但这不会:

"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year)

我需要能够插入不同的年份和车站 ID。

这不起作用,无论 ID 是什么,我仍然会得到相同的天气。它会产生一个结果,但它不是 ID 为 50308 的气象站。

ID = '50308'
year = '2019'
!wget -O Weather.csv"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year) 

df = pd.read_csv('Weather.csv',skiprows = 24)

我试图用以下语句替换上述语句:

import pandas as pd
import io
import requests

ID = '49088'
year = '2019'


url="http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={}&Year={}&Month=3&Day=14&timeframe=2&submit= Download+Data".format(ID,year)    
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

这是它吐出的错误代码:

ParserError: Error tokenizing data. C error: Expected 2 fields in line 26, saw 27

我希望能够为气象站名称和 ID 制作一个字典,这样我就可以创建一个函数并通过下载并放入熊猫数据框的函数来迭代字典。

标签: python-3.xpandascsvpython-requestsstring.format

解决方案


那么请求函数可以很好地获取 .csv,错误是熊猫无法正确读取 csv。下载的文件以包含空行之前的两个字段和正确数据的行开头。也许您不需要将介绍转换为熊猫:

"Station Name","DELTA BURNS BOG"
"Province","BRITISH COLUMBIA"
"Current Station Operator","Environment and Climate Change Canada - Meteorological Service of Canada"
"Latitude","49.13"
"Longitude","-123.00"
"Elevation","3.10"
 .. etc ...

对于前 24 行,然后是一个空格,其余的是您的数据:

"Date/Time","Year","Month","Day","Data Quality","Max Temp (°C)","Max Temp Flag","Min Temp (°C)","Min Temp Flag","Mean Temp (°C)","Mean Temp Flag","Heat Deg Days (°C)","Heat Deg Days Flag","Cool Deg Days (°C)","Cool Deg Days Flag","Total Rain (mm)","Total Rain Flag","Total Snow (cm)","Total Snow Flag","Total Precip (mm)","Total Precip Flag","Snow on Grnd (cm)","Snow on Grnd Flag","Dir of Max Gust (10s deg)","Dir of Max Gust Flag","Spd of Max Gust (km/h)","Spd of Max Gust Flag"
"2019-01-01","2019","01","01","","5.3","","-0.6","","2.4","","15.6","","0.0","","","","","M","0.0","","","","","","",""
"2019-01-02","2019","01","02","","5.2","","0.6","","2.9","","15.1","","0.0","","","","","M","3.4","","","","","","",""
"2019-01-03","2019","01","03","","9.1","","3.4","","6.2","","11.8","","0.0","","","","","M","61.0","","","","","","",""
...

因此,如果您告诉 pandas 跳过前 25(?)行,您应该避免解析问题:

h=pd.read_csv(io.StringIO(s.decode('utf-8')), skiprows = 25)

但话又说回来,也许你确实需要这些行。(我真的不了解熊猫,所以希望很快就会出现更明智的词)。


推荐阅读