python - 如何阻止scrapy两次运行同一个蜘蛛?
问题描述
所以我正在关注在代码中运行蜘蛛的文档,但由于某种原因,在它完成爬行后,蜘蛛再次运行。我尝试添加 stop_after_crawl 和 stop() 函数,但没有成功。在尝试第二次运行后,它也给了我下面的错误。
twisted.internet.error.ReactorNotRestartable
任何帮助表示赞赏,谢谢!
编码
class DocSpider(scrapy.Spider):
"""
This is the broad scraper, the name is doc_spider and can be invoked by making an object
of the CrawlerProcess() then calling the class of the Spider. It scrapes websites csv file
for the content and returns the results as a .json file.
"""
#Name of Spider
name = 'doc_spider'
#File of the URL list here
urlsList = pd.read_csv('B:\docubot\DocuBots\Model\Data\linksToScrape.csv')
urls = []
#Take the urls and insert them into a url list
for url in urlsList['urls']:
urls.append(url)
#Scrape through all the websites in the urls list
start_urls = urls
#This method will parse the results and will be called automatically
def parse(self, response):
data = {}
#Iterates through all <p> tags
for content in response.xpath('/html//body//div[@class]//div[@class]//p'):
if content:
#Append the current url
data['links'] = response.request.url
#Append the texts within the <p> tags
data['texts'] = " ".join(content.xpath('//p/text()').extract())
yield data
def run_crawler(self):
settings = get_project_settings()
settings.set('FEED_FORMAT', 'json')
settings.set('FEED_URI', 'scrape_results.json')
c = CrawlerProcess(settings)
c.crawl(DocSpider)
c.start(stop_after_crawl=True)
D = DocSpider()
D.run_crawler()
错误终端输出
Traceback (most recent call last):
File "web_scraper.py", line 52, in <module>
D.run_crawler()
File "web_scraper.py", line 48, in run_crawler
c.start(stop_after_crawl=True)
File "B:\Python\lib\site-packages\scrapy\crawler.py", line 312, in start
reactor.run(installSignalHandlers=False) # blocking call
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
解决方案
你需要搬出run_spider
课堂DocSpider
:
class DocSpider(scrapy.Spider):
.....
def run_crawler(self):
settings = get_project_settings()
settings.set('FEED_FORMAT', 'json')
settings.set('FEED_URI', 'scrape_results.json')
c = CrawlerProcess(settings)
c.crawl(DocSpider)
c.start(stop_after_crawl=True)
run_crawler()
推荐阅读
- python - 我可以将 Flask 应用程序转换为像 .exe 文件一样在 Windows 上运行的可执行文件吗?
- c - C99 中的自动枚举定义
- tinymce - TinyMCE 粘贴问题
- postgresql - 如何在 PostgreSQL 中解密哈希密码
- javascript - 为什么给一个对象分配一个值给另一个对象,然后重新分配原始对象会改变两个对象?
- c# - _ = Task.Run vs async void | Task.Run vs Async Sub
- oracle - Oracle Developer Tools for Visual Studio 2019 无法正确安装
- date - 在 Tableau 中按当天过滤当前月份
- ios - 将数据传回前一个视图控制器(实例成员不能用于类型错误)
- ansible - 提取不带扩展名的文件名 - Ansible