python - 与循环内的“点”一起使用时,相对 xpath 不起作用
问题描述
我对 Python 和 Scrapy 还很陌生。所以我创建了一个蜘蛛,但我遇到了相对路径的问题。如果我不在循环内使用“点”,只要循环运行,它就会打印相同的结果,但如果我在循环内使用“点”,则表明它已被刮掉,但文本为空白。
import scrapy
from demo_proj.items import JokeItem
from scrapy.loader import ItemLoader
from scrapy import Selector
class JokesSpider(scrapy.Spider):
name = 'jokes'
allowed_domains=['kitco.com']
start_urls = [
'https://www.kitco.com/'
]
def parse(self, response):
for joke in response.xpath("//div[@class='top15']"):
l=ItemLoader(item=JokeItem(),selector=joke)
l.add_xpath('news',".//div[@class='top15']/a/h3")
l.add_xpath('time',".//div[@class='top15']/span[@class='post-date']")
l.add_xpath('source',".//div[@class='top15']/span[@class='source']")
yield l.load_item()
解决方案
//div[@class='top15']
谓词在您的 for 循环中是额外的。在进入 for 循环之前,您将其缩小到它。蜘蛛将是:
class JokesSpider(scrapy.Spider):
name = 'jokes'
allowed_domains=['kitco.com']
start_urls = [
'https://www.kitco.com/'
]
def parse(self, response):
for joke in response.xpath("//div[@class='top15']"):
l = ItemLoader(item=JokeItem(), selector=joke)
l.add_xpath('news', "./a/h3/text()")
l.add_xpath('time', "./span[@class='post-date']/text()")
l.add_xpath('source', "./span[@class='source']/text()")
yield l.load_item()
将items.py
是:
class JokeItem(scrapy.Item):
news = scrapy.Field()
time = scrapy.Field()
source = scrapy.Field()
这是我日志的几行:
{'news': ['The real gold price rally hasn’t even started yet, says analyst who '
'...'],
'source': ['Kitco Video News'],
'time': ['Dec 9']}
2019-12-10 10:08:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kitco.com/>
{'news': ['Who will win the 2020 presidential election? Doug Casey weighs in '
'on ...'],
'source': ['Kitco News'],
'time': ['Dec 9']}
2019-12-10 10:08:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kitco.com/>
{'news': ['What kind of a gold investor are you?'],
'source': ['Kitco News'],
'time': ['Dec 9']}
推荐阅读
- java - Eclipse IDE 中未读取 SystemPropertyVariable
- java - Java - 在将文件从一台服务器复制到另一台服务器之前进行过滤
- c# - 在 C# 中用泛型理解协变逆变的问题
- excel - IF条件下的Excel box*x%语法
- python - 拆分两个标签并在 bs4 python 中分别附加它们
- wso2 - 是否建议将 Hazelcast 集群用于 WSO2 主动-主动部署?
- python - Tornado 服务器的多个异步 HTTP 连接
- .htaccess - 将带有 301 的域和子页面重定向到新域
- r - R Studio 对 Run All 没有反应
- string - 从数据框中删除字符串模式(RStudio 中的 Twitter 数据)