首页 > 解决方案 > 使用 Scrapy 从 Business Insider 网络抓取股票详细信息

问题描述

我正在尝试从以下站点提取每只股票的“名称”、“最新价格”和“百分比”字段: https ://markets.businessinsider.com/index/components/s&p_500

但是,即使我已经确认我的 XPath 可以在 Chrome 控制台中为这些字段工作,我也没有得到任何数据。

作为参考,我一直在使用本指南: https ://realpython.com/web-scraping-with-scrapy-and-mongodb/

items.py

from scrapy.item import Item, Field

class InvestmentItem(Item):
    ticker = Field()
    name = Field()
    px = Field()
    pct = Field()

investment_spider.py

from scrapy import Spider
from scrapy.selector import Selector
from investment.items import InvestmentItem

class InvestmentSpider(Spider):
    name = "investment"
    allowed_domains = ["markets.businessinsider.com"]
    start_urls = [
            "https://markets.businessinsider.com/index/components/s&p_500",
            ]

    def parse(self, response):
        stocks = Selector(response).xpath('//*[@id="index-list-container"]/div[2]/table/tbody/tr')

        for stock in stocks:
            item = InvestmentItem()
            item['name'] = stock.xpath('td[1]/a/text()').extract()[0]
            item['px'] = stock.xpath('td[2]/text()[1]').extract()[0]
            item['pct'] = stock.xpath('td[5]/span[2]').extract()[0]

            yield item

控制台输出:

...
2020-05-26 00:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/robots.txt> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/index/components/s&p_500> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-26 00:08:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Spider closed (finished)

标签: javascriptpythonreactjsweb-scrapingscrapy

解决方案


您在 xpath 表达式的请求中缺少“./”。我已经简化了你的 xpath:

def parse(self, response):
    stocks = response.xpath('//table[@class="table table-small"]/tr')

    for stock in stocks[1:]:
        item = InvestmentItem()
        item['name'] = stock.xpath('./td[1]/a/text()').get()
        item['px'] = stock.xpath('./td[2]/text()[1]').get().strip()
        item['pct'] = stock.xpath('./td[5]/span[2]/text()').get()

        yield item

推荐阅读