首页 > 解决方案 > Scrapy:无法获取数据

问题描述

我正在尝试使用 Scrapy 抓取这个网站www.zillow.com。我正在尝试从 CSV 文件导入地址并尝试通过它进行搜索。但是出现错误。这是我的代码。

csv_read.py

import pandas as pd
def read_csv():
    df = pd.read_csv('report2.csv')
    return df['site_address'].values.tolist()

zillow.py

import scrapy
from csv_automation.spiders.csv_read import read_csv

base_url = "https://www.zillow.com/homes/{}_rb"


class ZillowSpider(scrapy.Spider):
    name = 'zillow'

def start_requests(self):
    for tag in read_csv():
        yield scrapy.Request(base_url.format(tag))


def parse(self, response):

    yield {
        'Address': response.body(".(//h1[@id='ds-chip-property-address']/span)[1]/text()").get(),
        'zestimate': response.body(".(//span[@class='Text-c11n-8-38-0__aiai24-0 jtMauM'])[1]/text()").get(),
        'rent zestimate': response.body(".(//span[@class='Text-c11n-8-38-0__aiai24-0 jtMauM'])[2]/text()").get()
    }

设置.py

# Scrapy settings for csv_automation project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'csv_automation'

SPIDER_MODULES = ['csv_automation.spiders']
NEWSPIDER_MODULE = 'csv_automation.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 15
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'csv_automation.middlewares.CsvAutomationSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'csv_automation.middlewares.CsvAutomationDownloaderMiddleware': 543,
#}
DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
    # ...
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'csv_automation.pipelines.CsvAutomationPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
PROXY_POOL_ENABLED = True
PROXY_POOL_BAN_POLICY = 'policy.policy.BanDetectionPolicyNotText'

我试图从 JSON 文件中获取数据,但我一无所获。也许这只是因为我正在寻找一个特定的地址。所以,我尝试使用 Xpath。

截屏

我的输出

 PS G:\Python_Practice\scrapy_practice\csv_automation> scrapy crawl zillow
2021-08-15 22:31:10 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: csv_automation)
2021-08-15 22:31:10 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 
bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2021-08-15 22:31:10 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-08-15 22:31:10 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'csv_automation',
 'DOWNLOAD_DELAY': 15,
 'NEWSPIDER_MODULE': 'csv_automation.spiders',
 'SPIDER_MODULES': ['csv_automation.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}      
2021-08-15 22:31:10 [scrapy.extensions.telnet] INFO: Telnet Password: 5004f35501fe1348
2021-08-15 22:31:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-08-15 22:31:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware',
 'scrapy_proxy_pool.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-08-15 22:31:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-08-15 22:31:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-08-15 22:31:12 [scrapy.core.engine] INFO: Spider opened
2021-08-15 22:31:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-08-15 22:31:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-08-15 22:31:12 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.us-proxy.org:443
2021-08-15 22:31:14 [urllib3.connectionpool] DEBUG: https://www.us-proxy.org:443 "GET / HTTP/1.1" 200 None
2021-08-15 22:31:15 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): free-proxy-list.net:443
2021-08-15 22:31:15 [urllib3.connectionpool] DEBUG: https://free-proxy-list.net:443 "GET /anonymous-proxy.html HTTP/1.1" 200 None
2021-08-15 22:31:15 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): free-proxy-list.net:443
2021-08-15 22:31:15 [urllib3.connectionpool] DEBUG: https://free-proxy-list.net:443 "GET /uk-proxy.html HTTP/1.1" 200 None
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.free-proxy-list.net:80
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: http://www.free-proxy-list.net:80 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.free-proxy-list.net:443
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: https://www.free-proxy-list.net:443 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): free-proxy-list.net:80
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: http://free-proxy-list.net:80 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): free-proxy-list.net:443
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: https://free-proxy-list.net:443 "GET / HTTP/1.1" 200 None
2021-08-15 22:31:17 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.sslproxies.org:443
2021-08-15 22:31:17 [urllib3.connectionpool] DEBUG: https://www.sslproxies.org:443 "GET / HTTP/1.1" 200 None
2021-08-15 22:31:17 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2021-08-15 22:31:18 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:18 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.proxy-daily.com:443
2021-08-15 22:31:19 [urllib3.connectionpool] DEBUG: https://www.proxy-daily.com:443 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:19 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): proxy-daily.com:443
2021-08-15 22:31:20 [urllib3.connectionpool] DEBUG: https://proxy-daily.com:443 "GET / HTTP/1.1" 200 None
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-JP Japanese prober hit error at byte 3652
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: GB2312 Chinese prober hit error at byte 3654
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-KR Korean prober hit error at byte 3652
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: CP949 Korean prober hit error at byte 3652
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: Big5 Chinese prober hit error at byte 3654
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-TW Taiwan prober hit error at byte 3652
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: utf-8  confidence = 0.87625
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: SHIFT_JIS Japanese confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-JP not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: GB2312 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-KR not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: CP949 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: Big5 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-TW not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1251 Russian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: KOI8-R Russian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: ISO-8859-5 Russian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: MacCyrillic Russian confidence = 0.0
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: IBM866 Russian confidence = 0.0
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: IBM855 Russian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: ISO-8859-7 Greek confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1253 Greek confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: ISO-8859-5 Bulgarian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1251 Bulgarian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: TIS-620 Thai confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: ISO-8859-9 Turkish confidence = 0.5252901105163297
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: utf-8  confidence = 0.87625
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: SHIFT_JIS Japanese confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-JP not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: GB2312 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-KR not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: CP949 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: Big5 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-TW not active
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.94:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://185.77.221.113:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.55:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://109.94.172.150:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://109.94.172.150:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.149.75:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://193.56.64.200:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.122.237:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://109.94.172.211:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://54.156.145.160:8080
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://185.77.220.189:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.151.241:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://94.231.216.42:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://193.56.64.179:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://213.166.78.52:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://5.181.2.102:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/1421-Beechwood-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://204.87.183.21:3128
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/7393-Frolic-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.74:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/2759-Armaugh-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.208.211.87:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/673-Hummingbird-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.123.116:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/303-Old-Farm-Rd_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://18.205.10.48:80
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/8430-Burket-Way_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://91.188.246.162:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/778-Courtney-Cir_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/283-Meadow-Glen-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://94.231.216.146:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://5.181.2.54:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/2020-Bridlewood-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/6515-Springview-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/578-Shortleaf-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/493-Founders-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/481-Autumn-Springs-Ct_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.208.211.224:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.123.101:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://5.181.2.113:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://213.166.79.79:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.149.204:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/1377-E-New-Rd_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://91.188.247.38:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/1452-S-Highland-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://213.166.78.61:8085
2021-08-15 22:31:43 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/303-Old-Farm-Rd_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:31:43 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.208.211.87:8085
2021-08-15 22:32:01 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/2051-Milburn-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:32:01 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://3.12.95.129:80
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/7393-Frolic-Dr_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://185.77.221.177:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/1421-Beechwood-Dr_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.115:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/2759-Armaugh-Dr_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.122.202:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/673-Hummingbird-Dr_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.120.43:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/8430-Burket-Way_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.46:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/283-Meadow-Glen-Dr_rb> with another proxy (failed 2 times, max retries: 5) 
Traceback (most recent call last):
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
    yield next(it)
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
    return next(self.data)
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
    return next(self.data)
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "G:\Python_Practice\scrapy_practice\csv_automation\csv_automation\spiders\zillow.py", line 17, in parse
    'Address': response.body(".(//h1[@id='ds-chip-property-address']/span)[1]/text()").get(),
TypeError: 'bytes' object is not callable
2021-08-15 22:43:12 [scrapy.extensions.logstats] INFO: Crawled 14 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
2021-08-15 22:43:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zillow.com/homes/2759-Armaugh-Dr_rb/> (referer: None)
2021-08-15 22:43:18 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.zillow.com/homes/2759-Armaugh-Dr_rb/> (referer: None)
Traceback (most recent call last):
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
    yield next(it)
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
    return next(self.data)
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
    return next(self.data)
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "G:\Python_Practice\scrapy_practice\csv_automation\csv_automation\spiders\zillow.py", line 17, in parse
    'Address': response.body(".(//h1[@id='ds-chip-property-address']/span)[1]/text()").get(),
TypeError: 'bytes' object is not callable
2021-08-15 22:43:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/scrapy.core.downloader.handlers.http11.TunnelError': 1,
 'bans/error/twisted.internet.error.TCPTimedOutError': 106,
 'bans/status/307': 1,
 'downloader/exception_count': 107,
 'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 106,
 'downloader/request_bytes': 66947,
 'downloader/request_count': 141,
 'downloader/request_method_count/GET': 141,
 'downloader/response_bytes': 3019748,
 'downloader/response_count': 34,
 'downloader/response_status_count/200': 15,
 'downloader/response_status_count/301': 18,
 'downloader/response_status_count/307': 1,
 'elapsed_time_seconds': 725.869701,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2021, 8, 15, 16, 43, 18, 373681),
 'log_count/DEBUG': 326,
 'log_count/ERROR': 15,
 'log_count/INFO': 24,
 'response_received_count': 15,
 'scheduler/dequeued': 141,
 'scheduler/dequeued/memory': 141,
 'scheduler/enqueued': 144,
 'scheduler/enqueued/memory': 144,
 'spider_exceptions/TypeError': 15,
 'start_time': datetime.datetime(2021, 8, 15, 16, 31, 12, 503980)}
2021-08-15 22:43:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
PS G:\Python_Practice\scrapy_practice\csv_automation>

我正在使用代理来避免被网站禁止

标签: scrapy

解决方案


这是响应解析方法,你应该使用response.xpath()但不response.body

def parse(self, response):
    # For address, there's special whitespace \xa0, filter it
    address = [
        text
        for text in response.xpath('//h1[@id="ds-chip-property-address"]//span//text()').getall()
        if text not in ['\xa0']
    ]
    address = " ".join(address)
    # The class "Text-c11n-8-38-0__aiai24-0 jtMauM" seems be generated by some fontend
    # framework. In case the class name being changed later, use the id.
    # Search element with id, and find its next `span` sibling.
    zestimate, rent_zestimate = response.css('#dsChipZestimateTooltip ~ span::text').getall()[:2]

    yield {
        'Address': address,
        'zestimate': zestimate,
        'rent zestimate': rent_zestimate,
    }

推荐阅读