首页 > 解决方案 > 我的网络爬虫(python、Scrapy、Scrapy-splash)如何爬得更快?

问题描述

开发环境:

服务器规格:

你好。

我是一名 php 开发人员,这是我的第一个 python 项目。我正在尝试使用 python,因为我听说 python 对网络爬虫有很多好处。

我正在爬取一个动态网站,我需要在每 5-15 秒内爬取大约 3,500 个页面。目前,我的速度太慢了。它每分钟只能抓取 200 页。

我的来源是这样的:

主文件

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spiders.bot1 import Bot1Spider
from spiders.bot2 import Bot2Spider
from spiders.bot3 import Bot3Spider
from spiders.bot4 import Bot4Spider
from pprint import pprint


process = CrawlerProcess(get_project_settings())
process.crawl(Oddsbot1Spider)
process.crawl(Oddsbot2Spider)
process.crawl(Oddsbot3Spider)
process.crawl(Oddsbot4Spider)
process.start()

bot1.py

import scrapy
import datetime
import math

from scrapy_splash import SplashRequest
from pymongo import MongoClient
from pprint import pprint


class Bot1Spider(scrapy.Spider):
    name = 'bot1'
    client = MongoClient('localhost', 27017)
    db = client.db

    def start_requests(self):
        count = int(self.db.games.find().count())
        num = math.floor(count*0.25)
        start_urls = self.db.games.find().limit(num-1)

        for url in start_urls:
            full_url = domain + list(url.values())[5]
            yield SplashRequest(full_url, self.parse, args={'wait': 0.1}, meta={'oid': list(url.values())[0]})

    def parse(self, response):
        pass

设置.py

BOT_NAME = 'crawler'

SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'


# Scrapy Configuration

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'my-project-name (www.my.domain)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 64

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16

执行这些代码时,我使用的是这个命令:python main.py

看到我的代码后,请帮助我。我很乐意听任何说法。

1.我的蜘蛛怎么能更快?我尝试过使用线程,但它似乎无法正常工作。

2.什么是网络爬虫的最佳性能阵容?

3.是否可以每5-15秒抓取3500个动态页面?

谢谢你。

标签: mongodbscrapyweb-crawlerpython-3.7scrapy-splash

解决方案


推荐阅读