首页 > 解决方案 > Scrapy 不会下载所有页面的所有图像

问题描述

我使用 Scrapy 从这个基本 url 抓取 102 页中的所有图像:https ://www.lazada.vn/dien-thoai-di-dong/ 。我在发送请求移动到下一页之前设置了 60 秒延迟,因为当 Scrapy 同时发送太多请求时,该域会阻止我的爬取过程。在进程日志中,我在前 2 页中看到许多下载通知行:

...
...
2019-12-08 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lazada.vn/robots.txt> (referer: None)
2019-12-08 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lazada.vn/dien-thoai-di-dong/?page=1> (referer: None)
2019-12-08 12:33:21 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/6d4a70571986291280d27d655f43c33b.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/ef903a6e40fac5cffde2fac25e9a695c.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/5292c25961bf9109d5896bc56f06f1eb.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/0e3f0321a5b12183d1caec077c5cddf7.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/d05e089b69960fa7e12851639db54833.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/c13858cf8aebf3a4474d07ca84100aca.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/ac8dfba90e44a4db294ab1ea95d6ec6f.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/e8a91ffdc7d36fd708bcc959b1e85a05.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/9b72137283d02a2c76ecc1a06f78ef5d.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/7d84fca1f4e4a423a6f7ecca1b462c65.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/95c8d5c76b9edc0c13168ee52ddb55d2.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/16021992d9b9ffb1d31bd4ed967cfda5.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/97d51ca35fe5953903b2c53913dc6204.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/69a75548dcb1e779a2c9b183a467c9b1.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/5580ad66d41ce6eaa91be9113d8e49d1.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/781c435a4e5d54ac0f0bd196cab6329b.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/90af16a3a5318aa38ac470ac0e78b4e1.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/7afcecd58b2ba746ce0bc360e78304fb.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/b6a6b9d9c1dca7eb4e79071ab5e04dfb.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/fd82fee9d2ec165e1c2bd5946d745660.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/ffdd4cdc5b1580c0426c341f0e54c04a.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/65a9ab76d5192a49a90222a7fdbad59f.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/1ebb07247431af734a0f956d9124a2a1.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/098a6071eeddc4d526ff310c8f4edbe3.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/40756a8648be2dbb416890f4f74fda3e.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/8b4b463d6c90b902858606b5978a96ff.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/80dee15ec45cfad725976c5947bf237d.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/5326c3132c9c11559fc75fe9ae9e2b63.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/cebe568afdcbad9f3719d1751a9b1117.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/ce771129fe8859a4609e796d51dc56aa.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/b51e6fcf692e5316cacde25913b86e89.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/6231fb489a949f6a7bf882ad8e85965b.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/3d80fa52c934a3999c8837402f852419.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/7016e51afc586d8725fec94f481f89a6.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/2e60c3233a708fa49c06964ed88792ba.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/2a347b03642f5e53c90fa03cfe8af63e.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/14adae58e3f65f087b6034eb165a1f20.jpg> referred in <None>
...
...

但是从第 3 页到最后,我再也看不到那些了:

...
...
2019-12-08 12:34:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lazada.vn/dien-thoai-di-dong/?page=3> (referer: None)
2019-12-08 12:35:22 [scrapy.extensions.logstats] INFO: Crawled 13 pages (at 6 pages/min), scraped 80 items (at 40 items/min)
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/d05e089b69960fa7e12851639db54833.jpg'],
 'images': [{'checksum': 'dca1d5a23d29d3d1a854d35ff578e3f4',
             'path': 'full/e3bea7d5eb5bc56158e4e69f0312877c96e5ac6f.jpg',
             'url': 'https://vn-test-11.slatic.net/p/d05e089b69960fa7e12851639db54833.jpg'}],
 'price': '1290000.00',
 'title': 'Điện thoại oppo a37 neo9 fullbox ram2 bộ nhớ 16gb'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/e8a91ffdc7d36fd708bcc959b1e85a05.jpg'],
 'images': [{'checksum': '97f7bf79aed3e1013510e30673a35ee8',
             'path': 'full/e0b37ed8016f08a3aeb08cbf22c4096a1bf37fca.jpg',
             'url': 'https://vn-test-11.slatic.net/p/e8a91ffdc7d36fd708bcc959b1e85a05.jpg'}],
 'price': '2890000.00',
 'title': 'Điện thoại oppo f9 fullbox ram4 bộ nhớ 64gb Liên quân pubg chiến '
          'mượt'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/9b72137283d02a2c76ecc1a06f78ef5d.jpg'],
 'images': [{'checksum': '9b3990378a8b5152969369fcba144271',
             'path': 'full/e2d7a2a6a18b6791a8ed1d7fe8d9d35713f07d76.jpg',
             'url': 'https://vn-test-11.slatic.net/p/9b72137283d02a2c76ecc1a06f78ef5d.jpg'}],
 'price': '3390000.00',
 'title': 'Điện thoại IPH0NE_8_PLUS Hàng fullbox 256GB, tặng Tai nghe '
          'Bluetooth, Xả khó giá cực sốc'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/0951417571ed48aafd9e0b0108a42cb4.jpg'],
 'images': [{'checksum': 'f88bd5e92730b682b5d1925bbae3be4d',
             'path': 'full/4737babbe6e3238252c00a882e8bf9ab6529658e.jpg',
             'url': 'https://vn-test-11.slatic.net/p/0951417571ed48aafd9e0b0108a42cb4.jpg'}],
 'price': '1850000.00',
 'title': 'ĐIện_Thoại_IPHONE7_PLUS_256GB'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/55a39333530c4045e10b4589be0ad36a.jpg'],
 'images': [{'checksum': '0423081571336a25c01c8aa0d99c0458',
             'path': 'full/532b38fb6a4c95a55b0892e8fe76bae8ff031d26.jpg',
             'url': 'https://vn-test-11.slatic.net/p/55a39333530c4045e10b4589be0ad36a.jpg'}],
 'price': '2749900.00',
 'title': 'ĐIện_Thoại_IPHONEXS_MAX_512GB'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/a1162694a5b65056d8b2fff54d2fd7b7.jpg'],
 'images': [{'checksum': '78d8042b5b80d3f713846fab71ac583d',
             'path': 'full/0a8bd03ec3f5cb2c96439c99c503a9e40aa8afb3.jpg',
             'url': 'https://vn-test-11.slatic.net/p/a1162694a5b65056d8b2fff54d2fd7b7.jpg'}],
 'price': '1950000.00',
 'title': 'ĐIện_Thoại_IPHONE8_PLUS_256GB'}
...
...

在该过程结束时,结果日志显示 Scrapy 抓取了 102 个页面,其中包含大约 4000 张图像,但它只下载了 153 张图像:

2019-12-08 11:34:21 [scrapy.extensions.logstats] INFO: Crawled 259 pages (at 1 pages/min), scraped 4000 items (at 0 items/min)
2019-12-08 11:34:21 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-08 11:34:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 70175,
 'downloader/request_count': 259,
 'downloader/request_method_count/GET': 259,
 'downloader/response_bytes': 27542657,
 'downloader/response_count': 259,
 'downloader/response_status_count/200': 259,
 'elapsed_time_seconds': 6292.681676,
 'file_count': 153,
 'file_status_count/downloaded': 153,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 8, 11, 34, 21, 729083),
 'item_scraped_count': 4000,
 'log_count/DEBUG': 4412,
 'log_count/INFO': 114,
 'memusage/max': 128307200,
 'memusage/startup': 55721984,
 'response_received_count': 259,
 'robotstxt/request_count': 4,
 'robotstxt/response_count': 4,
 'robotstxt/response_status_count/200': 4,
 'scheduler/dequeued': 102,
 'scheduler/dequeued/memory': 102,
 'scheduler/enqueued': 102,
 'scheduler/enqueued/memory': 102,
 'start_time': datetime.datetime(2019, 12, 8, 9, 49, 29, 47407)}

这是我的代码:

啜饮

import scrapy
import re
import json
from scrapy_lazada_test.items import ScrapyLazadaTestItem

class LazadaSpider(scrapy.Spider):
    name = "lazada"
    allowed_domains = ['lazada.vn']

    def start_requests(self):
        max_page_number = 102
        base_url = 'https://www.lazada.vn/dien-thoai-di-dong/'
        for i in range(1, max_page_number + 1):
            url = base_url + '?page=' + str(i)
            #delay before sending request to move to next page
            time.sleep(60)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        result = response.xpath('//html/body/script[@type="application/ld+json"][2]').re(r'(?<=itemListElement":)(.*?)(\}\<\/script>)')
        products = json.loads(result[0])

        for p in products:
            item = ScrapyLazadaTestItem()
            item["image_urls"] = [p["image"]]
            item["title"] = p["name"]
            item["price"] = p["offers"]["price"]
            yield item

物品

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyLazadaTestItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    images = scrapy.Field()
    image_urls = scrapy.Field()

设置

BOT_NAME = 'scrapy_lazada_test'

SPIDER_MODULES = ['scrapy_lazada_test.spiders']
NEWSPIDER_MODULE = 'scrapy_lazada_test.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_lazada_test (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {"scrapy.pipelines.images.ImagesPipeline": 1}
IMAGES_STORE = "/home/mmlab/scrapy_lazada_test/result/"

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 15
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
#CONCURRENT_REQUESTS_PER_IP = 16
...
...

我试图设置CONCURRENT_REQUESTS = 1CONCURRENT_REQUESTS_PER_DOMAIN = 1但它仍然像以前一样工作。我该如何解决?

标签: pythonscrapyweb-crawlerscrape

解决方案


推荐阅读