首页 > 解决方案 > 如果 file_urls 再次加载,则抓取 FilesPipeline

问题描述

我是scrapy的新手。

FilesPipeline用来下载一些 .pdf 文件。

我发现如果file_urlsScrapy.Item 的值相同,下载过程将不会重新开始。

我需要的是再次下载。

我该如何解决。

谢谢。

标签: python-3.xscrapy

解决方案


_onsuccess函数添加到您的管道以覆盖它。(从 复制它FilesPipelines)。

它看起来像这样:

    def media_to_download(self, request, info, *, item=None):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            
            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

            referer = referer_str(request)
            logger.debug(
                'File (uptodate): Downloaded %(medianame)s from %(request)s '
                'referred in <%(referer)s>',
                {'medianame': self.MEDIA_NAME, 'request': request,
                 'referer': referer},
                extra={'spider': info.spider}
            )
    AND MORE CODE HERE THAT I DIDN'T COPY

只需添加一个返回以跳过“更新”部分并下载。

    def media_to_download(self, request, info, *, item=None):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            
            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

            return    # force download

(你也可以覆盖里面的函数FilesPipeline,但我不建议这样做)。

此外,请记住激活您的自定义管道并添加您需要的其他功能。

您的 pipelines.py 文件现在应该如下所示:

from itemadapter import ItemAdapter
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request
import logging
import time
from twisted.internet import defer
from scrapy.utils.log import failure_to_exc_info

logger = logging.getLogger(__name__)

class TempPipeline():

    def process_item(self, item, spider):
        return item

class ProcessPipeline(FilesPipeline):
    def get_media_requests(self, item, info):
        urls = ItemAdapter(item).get(self.files_urls_field, [])
        return [Request(u) for u in urls]

    def media_to_download(self, request, info, *, item=None):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

            return

        path = self.file_path(request, info=info, item=item)
        dfd = defer.maybeDeferred(self.store.stat_file, path, info)
        dfd.addCallbacks(_onsuccess, lambda _: None)
        dfd.addErrback(
            lambda f:
            logger.error(self.__class__.__name__ + '.store.stat_file',
                         exc_info=failure_to_exc_info(f),
                         extra={'spider': info.spider})
        )
        return dfd

推荐阅读