javascript - 使用 Scrapy 从 Business Insider 网络抓取股票详细信息
问题描述
我正在尝试从以下站点提取每只股票的“名称”、“最新价格”和“百分比”字段: https ://markets.businessinsider.com/index/components/s&p_500
但是,即使我已经确认我的 XPath 可以在 Chrome 控制台中为这些字段工作,我也没有得到任何数据。
作为参考,我一直在使用本指南: https ://realpython.com/web-scraping-with-scrapy-and-mongodb/
items.py
from scrapy.item import Item, Field
class InvestmentItem(Item):
ticker = Field()
name = Field()
px = Field()
pct = Field()
investment_spider.py
from scrapy import Spider
from scrapy.selector import Selector
from investment.items import InvestmentItem
class InvestmentSpider(Spider):
name = "investment"
allowed_domains = ["markets.businessinsider.com"]
start_urls = [
"https://markets.businessinsider.com/index/components/s&p_500",
]
def parse(self, response):
stocks = Selector(response).xpath('//*[@id="index-list-container"]/div[2]/table/tbody/tr')
for stock in stocks:
item = InvestmentItem()
item['name'] = stock.xpath('td[1]/a/text()').extract()[0]
item['px'] = stock.xpath('td[2]/text()[1]').extract()[0]
item['pct'] = stock.xpath('td[5]/span[2]').extract()[0]
yield item
控制台输出:
...
2020-05-26 00:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/robots.txt> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/index/components/s&p_500> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-26 00:08:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Spider closed (finished)
解决方案
您在 xpath 表达式的请求中缺少“./”。我已经简化了你的 xpath:
def parse(self, response):
stocks = response.xpath('//table[@class="table table-small"]/tr')
for stock in stocks[1:]:
item = InvestmentItem()
item['name'] = stock.xpath('./td[1]/a/text()').get()
item['px'] = stock.xpath('./td[2]/text()[1]').get().strip()
item['pct'] = stock.xpath('./td[5]/span[2]/text()').get()
yield item
推荐阅读
- mysql - SQL中如何对一张表的一行使用多张表
- python - 在 Keras 嵌入层中获取一个热向量
- autodesk-forge - 使用 Autodesk Forge 查看器时,是否可以按需加载模型的不同部分(不是所有部分)
- python - 如何使用有时包含 np.nan 的其他列的字符串填充 df 列,遍历 elifs 以返回适当的组合?
- node.js - Nodejs sequelize hasMany 问题
- javascript - Vue如何在密码输入中显示字数限制?
- r - 对所有变量运行 svymean
- azure-pipelines - 为什么在 ADF 中不进行克隆就无法重命名管道?
- python-3.x - 如何使用python子进程并行运行命令列表列表
- c++ - 如何在 C++ 中实现 OOBE 完成通知?