python - Scrapy 在尝试爬取网站时报告模糊错误
问题描述
我正在构建一个网络蜘蛛来抓取雅虎财经。我试图让它点击主页上的市场指数链接,并从相应市场指数页面上的表格中获取最后的收盘价
2021-05-29 11:39:21 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2021-05-29 11:39:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (v3.8.5:580fbb018f, Jul 20 2020, 12:11:27) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.0, Platform macOS-10.16-x86_64-i386-64bit
2021-05-29 11:39:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-05-29 11:39:21 [scrapy.crawler] INFO: Overridden settings:
{}
2021-05-29 11:39:21 [scrapy.extensions.telnet] INFO: Telnet Password: 8306af0a852a89a8
2021-05-29 11:39:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
这是代码
import scrapy
from scrapy.crawler import CrawlerProcess
class YahooFinanceSpider(scrapy.Spider):
name = "Yahoo Stock Scraper"
button_loc = '//*[@id="marketsummary-itm-0"]/h3/a[1]'
close_loc = '//*[@id="quote-summary"]/div[1]/table/tbody/tr[1]/td[2]/span/text()'
def __init__(self, urls):
self.urls=urls
def start_requests(self):
for url in self.urls:
scrapy.Request(url=url, callback=self.parse_front)
def parse_front(self, response):
button = response.xpath(YahooFinanceSpider.button_loc)
button_link = button.css('a.Fz\(s\).Ell.Fw\(600\).C\(\$linkColor ::attr(href)')
links_to_follow = button_link.extract()
for url in links_to_follow:
yield response.follow(url = url, callback = self.parse_pages)
def parse_pages(self, response):
closing_value = response.xpath(YahooFinanceSpider.close_loc).extract()
for value in closing_value:
print(value)
prices = []
urls=['https://finance.yahoo.com/']
yscraper=YahooFinanceSpider(urls)
process = CrawlerProcess()
process.crawl(YahooFinanceSpider)
process.start()
解决方案
您应该使用process.crawl(yscraper)
而不是process.crawl(YahooFinanceSpider)
.
您正在实例化对象 yscraper 但不使用它。
推荐阅读
- r - 如何将来自不同源/数据集的图(线/迹线)动态添加到 R(闪亮)中的绘图对象?
- sql - 如何在 BigQuery 中连续四个星期循环过去六个月的数据
- spring-webflux - 转换单声道
- > 列出
- azure-devops - Azure DevOps 迁移工具附加子链接
- html - 获取动态加载表的列的总数
- flutter - 如何在单页中处理多个块?
- powershell - How to handle countdown in windows form using PowerShell?
- mysql - Why is select id slower than select * in MySQL
- python - Unable to figure out what's missing to run my Django Project successfully
- integromat - 在我的自定义 Integromat 应用程序中使用不同的错误结构进行错误处理