python - 抓取需要多个请求才能访问特定表的网站(连接的下拉菜单)
问题描述
我正在尝试从该网站https://pigeon-ndb.com/races/抓取数据。起初,我认为如果我弄清楚如何从下拉菜单中选择元素,这个问题会很容易解决,但它最终比预期的要复杂。
理想情况下,我想遍历所有年份和季节(2010-2019),然后遍历所有组织和比赛的所有记录。总之,使用scrapy(无硒)从网站中的所有表中抓取数据。
我知道问题涉及使用下拉菜单的 GET 请求(总共 3 个),如下所示:
https://pigeon-ndb.com/api/?request=get_databases(以某种方式为下一个请求选择年份和季节的 json 元素)
https://pigeon-ndb.com/api/?request=get_organizations&database=2010%20OB&_=1557098607652(需要上一个请求的年份和季节才能工作)
https://pigeon-ndb.com/api/?request=get_races&organization=&_=1557098607653(需要上一个请求(#2)中的组织名称才能工作)
以下代码是我计划使用的scrapy spider的基本轮廓,可能会有所变化:
from scrapy import Spider
from scrapy.http import Request
class PigeonSpider(Spider):
name = 'pigeonspider'
allowed_domains = ['pigeon-ndb.com']
start_urls = ['https://pigeon-ndb.com/races/']
def parse(self, response):
pass
def parse2(self,response):
pass
def parse3(self,response):
pass
由于它是一个 GET 请求,我希望多次使用它(或一些变体):
yield Request(url,callback=self.parse2)
我想我需要将 json 用于抓取过程的动态部分,但不确定这是否是最好的方法
在scrapy shell中:
import json
jsonresponse = json.loads(response.body)
这是第一个请求的 json 输出(https://pigeon-ndb.com/api/?request=get_databases):
{'data': [{'year': '2010', 'season': 'OB'}, {'year': '2010', 'season': 'YB'}, {'year': '2011', 'season': 'OB'}, {'year': '2011', 'season': 'YB'}, {'year': '2012', 'season': 'OB'}, {'year': '2012', 'season': 'YB'}, {'year': '2013', 'season': 'OB'}, {'year': '2013', 'season': 'YB'}, {'year': '2014', 'season': 'OB'}, {'year': '2014', 'season': 'YB'}, {'year': '2015', 'season': 'OB'}, {'year': '2015', 'season': 'YB'}, {'year': '2016', 'season': 'OB'}, {'year': '2016', 'season': 'YB'}, {'year': '2017', 'season': 'OB'}, {'year': '2017', 'season': 'YB'}, {'year': '2018', 'season': 'OB'}, {'year': '2018', 'season': 'YB'}, {'year': '2019', 'season': 'OB'}], 'jsonapi': {'version': 2.2, 'db': 'pigeon-ndb'}, 'meta': {'copyright': 'Copyright 2019 Craig Vander Galien', 'authors': ['Craig Vander Galien']}}
我仍在学习scrapy,因此希望了解有关如何解决此问题的示例代码。谢谢!
编辑:
所以我尝试实现以下代码,但我遇到了错误:
from scrapy import Spider
from scrapy.http import Request
import json
class PigeonSpider(Spider):
name = 'pigeonspider'
allowed_domains = ['pigeon-ndb.com']
start_urls = ['https://pigeon-ndb.com/races/']
def parse(self, response):
result = json.loads(response.body)
for node in result['data']:
yield Request(
url = 'https://pigeon-ndb.com/api/?request=get_organizations&database={year}%20{season}'.format(year=node["year"], season=node["season"]),
callback = self.parse_organizations,
cookies = {'database':'{year} {season}'.format(year=node['year'],season=node['season'])},
meta = {
'year': node['year'],
'season': node['season'],
}
)
def parse_organizations(self,response):
result = json.loads(response.body)
for node in result['data']:
org_num = node['orgNum']
if node['orgNum'] is None:
org_num = 'null'
yield Request(
url='https://pigeon-ndb.com/api/?request=get_races&organization={org_name}&orgNum={org_num}'.format(org_name=node["Sys"], org_num=org_num),
callback=self.parse_races,
headers={'x-requested-with': 'XMLHttpRequest'},
cookies={'database':'{year} {season}'.format(year=response.meta["year"], season=response.meta["season"])}
)
def parse_races(self,response):
result = json.loads(response.body)
for node in result['clockings']['data']:
yield {
'race':node['racename'],
'season':node['season'],
'date':node['date'],
'year':node['year'],
'time':node['Time'],
'complevel':node['CompLevel'],
'class': node['class'],
'city': node['City'],
'zip': node['Zip'],
'state': node['State'],
'entry': node['entry'],
'first_name':node['FirstName'],
'last_name':node['LastName'],
'line_num':node['LineNum'],
'band_num':node['band_no'],
'color': node['BB'],
'sex': node['sex'],
'arrival_time':node['arri_time'],
'distance':node['distance'],
'speed':node['speed'],
'reg_points':node['reg_points'],
'std_points':node['std_points'],
'unirate':node['unirate'],
'place': node['Place'],
}
运行蜘蛛时(错误):
Traceback (most recent call last):
File "/home/glenn/anaconda3/envs/scraperenv/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/glenn/anaconda3/envs/scraperenv/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/home/glenn/anaconda3/envs/scraperenv/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/glenn/anaconda3/envs/scraperenv/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/glenn/anaconda3/envs/scraperenv/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/glenn/Projects/pigeonscraper/pigeonscraper/spiders/pigeonspider.py", line 13, in parse
result = json.loads(response.body)
File "/home/glenn/anaconda3/envs/scraperenv/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/glenn/anaconda3/envs/scraperenv/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/glenn/anaconda3/envs/scraperenv/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
解决方案
首先,您database
需要season
使用cookies
. 之后,您可以遍历 JSON 结果:
from scrapy import Spider
from scrapy.http import Request
import json
class PigeonSpider(Spider):
name = 'pigeonspider'
allowed_domains = ['pigeon-ndb.com']
start_urls = ['https://pigeon-ndb.com/api/?request=get_databases']
def parse(self, response):
result = json.loads(response.body)
for node in result["data"]:
yield Request(
url="https://pigeon-ndb.com/api/?request=get_organizations&database={year}%20{season}".format(year=node["year"], season=node["season"]),
callback=self.parse_organizations,
# headers={'x-requested-with': "XMLHttpRequest", 'referer': "https://pigeon-ndb.com/races/"},
cookies={'database':'{year} {season}'.format(year=node["year"], season=node["season"])},
meta={
"year": node["year"],
"season": node["season"],
}
)
pass
def parse_organizations(self,response):
result = json.loads(response.body)
for node in result["data"]:
org_num = node["orgNum"]
if node["orgNum"] is None:
org_num = "null"
yield Request(
url="https://pigeon-ndb.com/api/?request=get_races&organization={org_name}&orgNum={org_num}".format(org_name=node["Sys"], org_num=org_num),
callback=self.parse_races,
headers={'x-requested-with': "XMLHttpRequest"},
cookies={'database':'{year} {season}'.format(year=response.meta["year"], season=response.meta["season"])}
)
pass
def parse_races(self,response):
result = json.loads(response.body)
for race_key in result["data"].keys():
race_date = result["data"][race_key]["date"]
race_release_time = result["data"][race_key]["release_time"]
race_bird_attend = result["data"][race_key]["bird_attend"]
# etc.
pass
更新您完全无视我的评论。parse_race_details
根本没有在您的代码中实现!
from scrapy import Spider
from scrapy.http import Request
import json
class PigeonSpider(Spider):
name = 'pigeonspider'
allowed_domains = ['pigeon-ndb.com']
start_urls = ['https://pigeon-ndb.com/api/?request=get_databases']
debug = False
def parse(self, response):
result = json.loads(response.body)
for node in result["data"]:
yield Request(
url="https://pigeon-ndb.com/api/?request=get_organizations&database={year}%20{season}".format(
year=node["year"], season=node["season"]),
callback=self.parse_organizations,
# headers={'x-requested-with': "XMLHttpRequest", 'referer': "https://pigeon-ndb.com/races/"},
cookies={
'database': '{year} {season}'.format(
year=node["year"],
season=node["season"])},
meta={
"year": node["year"],
"season": node["season"],
},
dont_filter=True,
)
# Debug
if self.debug:
break
pass
def parse_organizations(self, response):
result = json.loads(response.body)
for node in result["data"]:
org_num = node["orgNum"]
if node["orgNum"] is None:
org_num = "null"
yield Request(
url="https://pigeon-ndb.com/api/?request=get_races&organization={org_name}&orgNum={org_num}".format(org_name=node["Sys"], org_num=org_num),
callback=self.parse_races,
headers={'x-requested-with': "XMLHttpRequest"},
cookies={'database': '{year} {season}'.format(year=response.meta["year"], season=response.meta["season"])},
dont_filter=True,
# meta={
# "year": response.meta["year"],
# "season": response.meta["season"],
# },
)
# Debug
if self.debug:
break
pass
def parse_races(self, response):
result = json.loads(response.body)
if result["response"] == "failed":
print("Failed response!")
for race_key in result["data"].keys():
race_name = result["data"][race_key]["racename"]
race_date = result["data"][race_key]["date"].replace("/", "%2F")
race_time = result["data"][race_key]["Time"]
yield Request(
url="https://pigeon-ndb.com/api/?request=get_race_details&racename={race_name}&date={race_date}&time={race_time}".format(race_name=race_name, race_date=race_date, race_time=race_time),
callback=self.parse_race_details,
headers={'x-requested-with': "XMLHttpRequest"},
# cookies={'database': '{year} {season}'.format(year=response.meta["year"], season=response.meta["season"])},
dont_filter=True,
)
# Debug
if self.debug:
break
pass
def parse_race_details(self, response):
result = json.loads(response.body)
if result["response"] == "failed":
print("Failed response!")
for node in result['data']['clockings']['data']:
yield {
'race':node['racename'],
'season':node['season'],
'date':node['date'],
'year':node['year'],
'time':node['Time'],
'complevel':node['CompLevel'],
'class': node['Class'],
'city': node['City'],
'zip': node['Zip'],
'state': node['State'],
'entry': node['entry'],
'first_name':node['FirstName'],
'last_name':node['LastName'],
'line_num':node['LineNum'],
'band_num':node['band_no'],
'color': node['BB'],
'sex': node['sex'],
'arrival_time':node['arri_time'],
'distance':node['distance'],
'speed':node['speed'],
'reg_points':node['reg_points'],
'std_points':node['std_points'],
'unirate':node['unirate'],
'place': node['Place'],
}
pass
推荐阅读
- javascript - 单击时如何提交对Datagrid(不是行选择)内的复选框的更改?
- reactjs - 是的,对数组下的对象进行验证
- android - 我无法使用 xamarin 表单下载 android 11 中的某些文件
- javascript - 如何导入和使用特定的 BootStrap 5 JavaScript 插件?
- progressive-web-apps - 处理来自子目录的离线页面资源
- storybook - 如何删除故事书中的默认样式
- r - ggplot错误:中断和标签的长度不同
- wso2 - 当我们的系统呈现外部客户端用户时,WSO2 身份服务器可以对用户进行身份验证和授权吗?
- html - 根据父母的边界半径裁剪孩子
- spring-boot - Gradle processResources 扩展不同的项目模块