flask - Why does scrapy crawler only work once in flask app?
问题描述
I am currently working on a Flask app. The app takes a url from the user and then crawls that website and returns the links found in that website. This is what my code looks like:
from flask import Flask, render_template, request, redirect, url_for, session, make_response
from flask_executor import Executor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from urllib.parse import urlparse
from uuid import uuid4
import smtplib, urllib3, requests, urllib.parse, datetime, sys, os
app = Flask(__name__)
executor = Executor(app)
http = urllib3.PoolManager()
process = CrawlerProcess()
list = set([])
list_validate = set([])
list_final = set([])
@app.route('/', methods=["POST", "GET"])
def index():
if request.method == "POST":
url_input = request.form["usr_input"]
# Modifying URL
if 'https://' in url_input and url_input[-1] == '/':
url = str(url_input)
elif 'https://' in url_input and url_input[-1] != '/':
url = str(url_input) + '/'
elif 'https://' not in url_input and url_input[-1] != '/':
url = 'https://' + str(url_input) + '/'
elif 'https://' not in url_input and url_input[-1] == '/':
url = 'https://' + str(url_input)
# Validating URL
try:
response = requests.get(url)
error = http.request("GET", url)
if error.status == 200:
parse = urlparse(url).netloc.split('.')
base_url = parse[-2] + '.' + parse[-1]
start_url = [str(url)]
allowed_url = [str(base_url)]
# Crawling links
class Crawler(CrawlSpider):
name = "crawler"
start_urls = start_url
allowed_domains = allowed_url
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
def parse_links(self, response):
base_url = url
href = response.xpath('//a/@href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
def start():
process.crawl(Crawler)
process.start()
for link in list_validate:
error = http.request("GET", link)
if error.status == 200:
list_final.add(link)
original_stdout = sys.stdout
with open('templates/file.txt', 'w') as f:
sys.stdout = f
for link in list_final:
print(link)
unique_id = uuid4().__str__()
executor.submit_stored(unique_id, start)
return redirect(url_for('crawling', id=unique_id))
else:
return render_template('index.html')
@app.route('/crawling-<string:id>')
def crawling(id):
if not executor.futures.done(id):
return render_template('start-crawl.html', refresh=True)
else:
executor.futures.pop(id)
return render_template('finish-crawl.html')
In my start.html
, I have this:
{% if refresh %}
<meta http-equiv="refresh" content="5">
{% endif %}
This code takes a url from a user, validates it, and if it is a working url, it starts crawling and takes the user to start-crawl.html
page. The page refreshes every 5 seconds until the crawling is complete and if the crawling finishes it renders the finish-crawl.html
. In finish-crawl.html
, the user can download a file that has the output (didn't include it because it isn't necessary).
Everything works as expected. My problem is once I crawl a website and it finishes crawling and I am at the finish-crawl.html
, I can't crawl another website. If I go back to the home page and enter another url, it validates the url and then goes directly to finish-crawl.html
. I think this happens because scrappy can only be run once and the reactor isn't restartable which is what I am trying to do here. So does anyone know what I can do to fix this? Please ignore the complicity of the code and anything that isn't considered "a programming convention".
解决方案
Scrapy recommended the use of CrawlerRunner
instead of CrawlerProcess
.
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#Spider definition
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(e):
print("finished")
def spider_error(e):
print("spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()
More information about reactor is available here:ReactorBasic
推荐阅读
- random - Bingo Caller App 如何生成数字 1-75?
- javascript - 查找单个 rgb 值是否落在数组中任何 rgb 值的阈值内 - javascript
- c++ - 使用 cuModuleLoad 从 ELF 二进制文件中获取当前模块(来自 argv[0])
- javascript - 如何检查 translate.instant 的返回值是否被翻译过一次
- robotframework - 如何在 RobotFramework 中迭代结果
- amazon-web-services - 如何为 EC2 现货实例模拟 BidEvictedEvent?
- r - 如何在R中将网络分割成相等的线段
- reactjs - 用于 safari 和 firefox 的 textarea 中的多行占位符
- c - 在 c 中使用 fscanf() 从 json 文件中读取
- javascript - 无法选择任何项目
带键盘