python - 如何通过更改“def start_requests(self)”中的一部分url在Scrapy中多次运行蜘蛛
问题描述
我对这只蜘蛛的逻辑有疑问。我想抓取具有无限分页的 Castbox 网站的类别之一。所以,我认为我可以拆分 JSON 文件的 URL,然后切片,最后重新加入 URL 以便能够解析它。因此,我使用了一个 while 循环来确定我的蜘蛛继续爬取我需要的元素的条件。
让我解释清楚。
当我检查 Castbox 网站的 JSON URL 时,我发现每次通过向下滚动页面重新加载时,只有一部分 URL 会发生变化。这部分称为“跳过”,它在 0 到 200 之间变化,您会在 URL 中看到它。所以,我想如果我可以写一个“def start_requests(self)”,其中 URL 的“skip”部分可以从 0 变为 200,我就能得到我想要的。这样的功能是否可以每次都更改 URL?如果是,我的蜘蛛的“def start_requests(self)”部分有什么问题?
顺便说一句,运行它时,我收到此错误: ModuleNotFoundError: No module named 'urlparse'
这是我的蜘蛛:
-- coding: utf-8 --
import scrapy
import json
class ArtsPodcastsSpider(scrapy.Spider):
name = 'arts_podcasts'
allowed_domains = ['www.castbox.fm']
def start_requests(self):
try:
if response.request.meta['skip']:
skip=response.request.meta['skip']
else:
skip=0
while skip < 201:
url = 'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=0&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1'
split_url = urlparse.urlsplit(url)
path = split_url.path
path.split('&')
path.split('&')[:-5]
'&'.join(path.split('&')[:-5])
parsed_query = urlparse.parse_qs(split_url.query)
query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
query['skip'] = skip
updated = split_url._replace(path='&'.join(base_path.split('&')[:-5]+['limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1', '']),
query=urllib.urlencode(query, doseq=True))
updated_url=urlparse.urlunsplit(updated)
yield scrapy.Request(url= updated_url, callback= self.parse_id, meta={'skip':skip})
def parse_id(self, response):
skip=response.request.meta['skip']
data=json.loads(response.body)
category=data.get('data').get('category').get('name')
arts_podcasts=data.get('data').get('list')
for arts_podcast in arts_podcasts:
yield scrapy.Request(url='https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={0}&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1'.format(arts_podcast.get('list')[2].get('cid')), meta={'category':category,'skip':skip}, callback= self.parse)
def parse(self, response):
skip=response.request.meta['skip']
category=response.request.meta['category']
arts_podcast=json.loads(response.body).get('data')
yield scrapy.Request(callback=self.start_requests,meta={'skip':skip+1})
yield{
'title':arts_podcast.get('title'),
'category':arts_podcast.get('category'),
'sub_category':arts_podcast.get('categories'),
'subscribers':arts_podcast.get('sub_count'),
'plays':arts_podcast.get('play_count'),
'comments':arts_podcast.get('comment_count'),
'episodes':arts_podcast.get('episode_count'),
'website':arts_podcast.get('website'),
'author':arts_podcast.get('author'),
'description':arts_podcast.get('description'),
'language':arts_podcast.get('language')
}
谢谢!
- -编辑 - -
这是我在运行蜘蛛后得到的日志的一部分,@Patrick Klein:
2020-11-14 15:51:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=0&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1> (referer: None)
2020-11-14 15:51:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=0&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1> (referer: None)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\castbox_arts_podcasts\castbox_arts_podcasts\spiders\arts_podcasts.py", line 27, in parse_id
url = f'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={arts_podcast.get("list")[2].get("cid")}&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1'
TypeError: 'NoneType' object is not subscriptable
---编辑2---
2020-11-15 13:14:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=2583691&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1> (referer: https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=8&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1)
2020-11-15 13:14:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=2946683&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1>
{'sub_category': None, 'title': None, 'subscribers': None, 'plays': None, 'comments': None, 'episodes': None, 'downloads': None, 'website': None, 'author': None, 'description': None, 'language': None}
2020-11-15 13:14:47 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2020-11-15 13:14:47 [scrapy.core.downloader.handlers.http11] WARNING: Got data loss in https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=12&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1. If you want to process broken responses set the setting DOWNLOAD_FAIL_ON_DATALOSS = False -- This message won't be shown in further requests
2020-11-15 13:14:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip=12&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>, <twisted.python.failure.Failure twisted.web.http._DataLoss: Chunked decoder in 'CHUNK_LENGTH' state, still expecting more data to get to 'FINISHED' state.>]
要抓取的一项的 JSON 对象的一部分:
{
"msg": "OK",
"code": 0,
"data": {
"category": {
"sub_categories": [
{
"image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"id": "10022",
"night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"name": "Books"
},
{
"image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"id": "10023",
"night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"name": "Design"
},
{
"image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"id": "10024",
"night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"name": "Fashion & Beauty"
},
{
"image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"id": "10025",
"night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"name": "Food"
},
{
"image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"id": "10026",
"night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"name": "Performing Arts"
},
{
"image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"id": "10027",
"night_image_url": "https://castbox.fm/static/everest/category/v3/grey/default.png",
"name": "Visual Arts"
}
],
"id": "10021",
"name": "Arts"
},
"list": [
{
"provider_id": 125443881,
"episode_count": 256,
"x_play_base": 0,
"stat_cover_ext_color": false,
"keywords": [
"Arts",
"Literature",
"TV & Film",
"Society & Culture",
"freshair",
"npr",
"terrygross",
"news",
"facts",
"interesting",
"worldwide",
"international",
"best",
"awardwinning",
"jay z"
],
"cover_ext_color": "-8610134",
"mongo_id": "5e74365585a4e5dcff18d769",
"show_id": "56a0a3399eb9a8dd9758c9c2",
"copyright": "Copyright 2015-2019 NPR - For Personal Use Only",
"author": "NPR",
"is_key_channel": true,
"audiobook_categories": [],
"comment_count": 29,
"website": "http://www.npr.org/programs/fresh-air/",
"rss_url": "https://feeds.npr.org/381444908/podcast.xml",
"description": "Fresh Air from WHYY, the Peabody Award-winning weekday magazine of contemporary arts and issues, is one of public radio's most popular programs. Hosted by Terry Gross, the show features intimate conversations with today's biggest luminaries.",
"tags": [
"from-itunes"
],
"editable": true,
"play_count": 8890966,
"link": "http://www.npr.org/programs/fresh-air/",
"twitter_names": [
"nprfreshair"
],
"categories": [
10021,
10022,
10125,
10001,
10101,
10014,
10015
],
"x_subs_base": 25254,
"small_cover_url": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/76/32/0c/76320cb7-7805-5ffc-6d48-18b311dd9be8/mza_18321298089187816075.jpg/200x200bb.jpg",
"big_cover_url": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/76/32/0c/76320cb7-7805-5ffc-6d48-18b311dd9be8/mza_18321298089187816075.jpg/600x600bb.jpg",
"language": "en",
"cid": 2698788,
"latest_eid": 326888897,
"topic_tags": [
"FreshAir",
"NPR"
],
"release_date": "2020-11-14T05:01:15Z",
"title": "Fresh Air",
"uri": "/ch/2698788",
"https_cover_url": "https://is5-ssl.mzstatic.com/image/thumb/Podcasts113/v4/76/32/0c/76320cb7-7805-5ffc-6d48-18b311dd9be8/mza_18321298089187816075.jpg/400x400bb.jpg",
"channel_type": "private",
"channel_id": "47b5be27cc1ca68aa80f8f7bbccedb47a40992d3",
"sub_count": 361101,
"internal_product_id": "cb.ch.2698788",
"social": {
"website": "http://www.npr.org/programs/fresh-air/",
"youtube": [
{
"name": "channel/UCwly5-E5e0EUY-SsnttN4Sg"
}
],
"twitter": [
{
"name": "nprfreshair"
}
],
"facebook": [
{
"name": "freshairwithterrygross"
}
],
"instagram": [
{
"name": "nprfreshair"
}
]
}
}
解决方案
我注意到您正在传递category
和传递skip
给您的解析函数,但实际上并没有在您的蜘蛛中使用它们。实际上有很多未使用且可能没有必要的进口。此外,您parse_id
在您的start_requests
方法中使用了几乎相同的 URL。
我已经将你的蜘蛛重写为我认为有点类似于你想要实现的东西,但略有不同。
import scrapy
import json
class ArtsPodcastsSpider(scrapy.Spider):
name = 'arts_podcasts'
def start_requests(self):
for skip in range(201):
url = f'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={skip}&limit=60&web=1&m=20201112&n=609584ea96edb64605bca96212128aa5&r=1'
yield scrapy.Request(
url=url,
callback=self.parse_id,
)
def parse_id(self, response):
data = json.loads(response.body)
arts_podcasts = data.get('data').get('list')
for arts_podcast in arts_podcasts:
url = f'https://everest.castbox.fm/data/top_channels/v2?category_id=10021&country=us&skip={arts_podcast["cid"]}&limit=60&web=1&m=20201111&n=609ba0097bb48d4b0778a927bdcf69f4&r=1'
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response):
arts_podcasts=json.loads(response.body).get('data')
for arts_podcast in arts_podcasts['list']:
yield {
'title':arts_podcast.get('title'),
'category':arts_podcast.get('category'),
'sub_category':arts_podcast.get('categories'),
'subscribers':arts_podcast.get('sub_count'),
'plays':arts_podcast.get('play_count'),
'comments':arts_podcast.get('comment_count'),
'episodes':arts_podcast.get('episode_count'),
'website':arts_podcast.get('website'),
'author':arts_podcast.get('author'),
'description':arts_podcast.get('description'),
'language':arts_podcast.get('language')
}
推荐阅读
- python - 在转换第三列时将 excel 转换为 pdf 问题
- excel - 如何创建仅显示过去 4 周的动态图表?
- python - 如何使用熊猫在同一行索引下拥有多行
- javascript - Javascript Switch 语句损坏?
- php - 使用 PHP 7.3.8 的未定义常量 CURLINFO_SCHEME
- javascript - 有角的。发送数组数据获取Itunes API;从组件搜索到组件主,通过服务
- regex - 如何在Jmeter中处理没有任何href的引导按钮?
- r - 如何绘制两个文件中的数据?
- linux - RPM 规范文件宏未在条件语句上扩展
- python - 正则表达式语句中的变量未找到任何匹配项