python - 如何在 Scrapy 中围绕 url 列表构建蜘蛛?
问题描述
我正在尝试使用蜘蛛从 Reddit 抓取数据。我希望我的蜘蛛遍历我的 url 列表(位于名为 reddit.txt 的文件中)中的每个 url 并收集数据,但我收到一个错误,其中整个 url 列表被视为启动 url。这是我的代码:
import scrapy
import time
class RedditSpider(scrapy.Spider):
name = 'reddit'
allowed_domains = ['www.reddit.com']
custom_settings={ 'FEED_URI': "reddit_comments.csv", 'FEED_FORMAT': 'csv'}
with open('reddit.txt') as f:
start_urls = [url.strip() for url in f.readlines()]
def parse(self, response):
for URL in response.css('html'):
data = {}
data['body'] = URL.css(r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] p::text").extract()
data['name'] = URL.css(r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] a::text").extract()
time.sleep(5)
yield data
这是我的输出:
scrapy.exceptions.NotSupported: Unsupported URL scheme '': no handler available for that scheme
2020-07-26 00:51:34 [scrapy.core.scraper] ERROR: Error downloading <GET ['http://www.reddit.com/r/electricvehicles/comments/lb6a3/im_meeting_with_some_people_helping_to_bring_evs/',%20'http://www.reddit.com/r/electricvehicles/comments/1b4a3b/prospective_buyer_question_what_is_a_home/',%20'http://www.reddit.com/r/electricvehicles/comments/1f5dmm/any_rav4_ev_drivers_on_reddit/' ...
我的清单的一部分:['http://www.reddit.com/r/electricvehicles/comments/lb6a3/im_meeting_with_some_people_helping_to_bring_evs/', 'http://www.reddit.com/r/electricvehicles/comments/1b4a3b/prospective_buyer_question_what_is_a_home/', 'http://www.reddit.com/r/electricvehicles/comments/1f5dmm/any_rav4_ev_drivers_on_reddit/', 'http://www.reddit.com/r/electricvehicles/comments/1fap6p/any_good_subreddits_for_ev_conversions/', 'http://www.reddit.com/r/electricvehicles/comments/1h9o9t/buying_a_motor_for_an_ev/', 'http://www.reddit.com/r/electricvehicles/comments/1iwbp7/is_there_any_law_governing_whether_a_parking/', 'http://www.reddit.com/r/electricvehicles/comments/1j0bkv/electric_engine_regenerative_braking/',...]
对我的问题有任何帮助将不胜感激。谢谢!
解决方案
因此,您可以在方法中打开 url 文件 start_requests
并为您的方法添加回调parse
。
class RedditSpider(scrapy.Spider):
name = "reddit"
allowed_domains = ['www.reddit.com']
custom_settings = {'FEED_URI': "reddit_comments.csv", 'FEED_FORMAT': 'csv'}
def start_requests(self):
with open('reddit.txt') as f:
for url in f.readlines():
url = url.strip()
# We need to check this has the http prefix or we get a Missing scheme error
if not url.startswith('http://') and not url.startswith('https://'):
url = 'https://' + url
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for URL in response.css('html'):
data = {}
data['body'] = URL.css(
r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] p::text").extract()
data['name'] = URL.css(
r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] a::text").extract()
time.sleep(5)
yield data
确保输入文件的内容格式正确并且每行有一个 url:
https://www.reddit.com/r/electricvehicles/comments/lb6a3/im_meeting_with_some_people_helping_to_bring_evs/
http://www.reddit.com/r/electricvehicles/comments/1b4a3b/prospective_buyer_question_what_is_a_home/
http://www.reddit.com/r/electricvehicles/comments/1f5dmm/any_rav4_ev_drivers_on_reddit/
推荐阅读
- javascript - 剑道列表框工具栏在移动视图中滚动到屏幕外
- java - 为什么此代码打印 12 而不是 1?
- android - 项目单击自定义列表视图 basic4android
- math - 如何通过傅里叶域计算函数的导数?
- multithreading - 如何将工作人员返回到 Go 中的工作人员池
- solr - 什么 Solr 字段类型提供基本的通配符搜索?
- c# - C# 代码不允许我在他们的库中使用函数
- symfony - 如何将 getter 或类属性传递给 symfony 自定义约束?
- css - 如何在使用 max-width:100% 调整大小后将 1 个图像粘贴到另一个图像上并将它们保持在一起
- deep-learning - 神经网络输出层激活