首页 > 解决方案 > python webscraping的问题

问题描述

我正在尝试使用此网站收集天气数据:https ://www.almanac.com/weather/history/zipcode/10001/

抓取工作,但代码随机损坏,html中的表格似乎消失了。因此,当我使用 find 时,它返回 None 并且没有数据。这发生在随机日期,当它发生时,每天被抓取的所有数据集都被无数据填充。

import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup

year = 2000

def seperate(string):
  string = str(string)
  temp = string.split(">")
  outlist = temp[1].split("<")
  if outlist[0] == "No data.":
    return "none"
  else:
    return float(outlist[0])

dict = {}
while year <= 2020:
  for i in range(12):
    url = "https://www.almanac.com/weather/history/zipcode/10001/" + str(year) +"-"+ str(i+1) + "-1"
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    temp = soup.find("tr", class_="weatherhistory_results_datavalue temp")
    prcp = soup.find("tr", class_="weatherhistory_results_datavalue prcp")
    visib = soup.find("tr", class_="weatherhistory_results_datavalue visib")
    wdsp = soup.find("tr", class_="weatherhistory_results_datavalue wdsp")
    data = [temp, prcp, visib, wdsp]
    nums = []
    for item in data:
      if (item.find("p", class_="nullvalue")) == None:
        nums.append(seperate(item.find("span", class_="value")))
      else:
        nums.append(None)
    dict[(year + float(i)/12.0)] = nums
    print(nums)
  year+=1

print(dict)

标签: pythonweb-scraping

解决方案


我能够重现此错误。问题是服务器在多个连续请求后响应429 Too Many Requests 。

尝试使用处理此错误response.status_code并在请求循环中添加一些延迟以避免此问题:

while year <= 2020:
  for i in range(12):
    response = requests.get(url)
    if response.status_code != 200:
      # handle error
    # ...
    print(nums)
    # Wait some time before next request
    time.sleep(1)
  year+=1

推荐阅读