首页 > 解决方案 > 如何知道正在运行什么查询来获取网站中的数据以及如何提取它 Python

问题描述

我正在尝试获取网站https://www.wunderground.com/history/daily/pk/karachi/OPKC/date/2017-1-3上提供的日平均温度。但是我没有得到任何价值,或者如果我只是复制粘贴数据,它会显示“没有记录数据”而不是该网站上的表格。我做错了什么?我正在使用以下代码...

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
    }
r = requests.get('https://www.wunderground.com/',headers=headers)
res = requests.get("https://www.wunderground.com/history/daily/pk/karachi/OPKC/date/2017-1-3")
import urllib.request
soup = BeautifulSoup(res.content,'lxml')
tables = soup.find_all('table')
for table in tables:
    df = pd.read_html(str(table))
    print( tabulate(df[0], headers='keys', tablefmt='psql') )
print(soup.get_text())

标签: python-3.xweb-scraping

解决方案


我检查了带有请求的响应,发现 r.content 中没有表标签。不要从页面中读取,而是考虑使用其 API 和以下方法。

通过将参数传递给 get 方法来获取 json 响应。然后通过遍历 json 响应的每个对象来获得你想要的。

import requests
import csv

base_url = "https://api.weather.com/v1/geocode/24.90138817/67.15000153/observations/historical.json?"

data = {
  "apiKey": "6532d6454b8aa370768e63d6ba5a832e",
  "startDate": "20170103",
  "endDate": "20170103",
  "units": "e"
}

r = requests.get(base_url, params=data)
d = r.json()

headers = ['timestamp', 'temp', 'precip/hr', 'windspeed']
with open('results.csv', 'a') as f:
  writer = csv.writer(f)
  writer.writerow(headers)
  for item in d['observations']:
    writer.writerow([item['expire_time_gmt'], item['temp'],item['precip_hrly'], item['wspd']])
f.close()

下面打印

timestamp,temp,preciptation/hr,windspeed
1483392300,63,,7
1483394100,61,,3
1483395900,61,,3
1483397700,59,,7
1483399500,59,,6
1483401300,59,,5
1483403100,57,,7
1483404900,57,,5
1483406700,57,,5
1483408500,57,,9
1483413900,55,,7
1483415700,55,,7
1483417500,55,,6
1483419300,55,,3
1483421100,57,,3
1483422900,61,,3
1483424700,64,,5
1483426500,66,,7
1483428300,66,,5
1483430100,72,,5
1483431900,73,,0
1483433700,75,,2

推荐阅读