xpath - 无法获取第二页的 next_maintext
问题描述
page1和page2网址。我想从第一个 URL 获取所有内容,只从第二个 URL 获取正文并将其附加到第一个 URL 的正文中。这只是一篇文章。函数 parse_indianexpress_archive_links() 包含新闻文章 URL 的列表。我从 page1 获取所有结果,但从 page2 结果输出中获取 next_maintext 列<GET http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/2>
class spider_indianexpress(scrapy.Spider):
name = 'indianexpress'
start_urls = parse_indianexpress_archive_links()
def parse(self,response):
items = ScrapycrawlerItem()
separator = ''
#article_url = response.xpath("//link[@rel = 'canonical']/@href").extract_first()
article_url = response.request.url
date_updated = max(response.xpath("//div[@class = 'story-date']/text()").extract() , key=len)[-27:] #Call max(list, key=len) to return the longest string in list by comparing the lengths of all strings in a list
if len(date_updated) <=10:
date_updated = max(response.xpath("//div[@class = 'story-date']/p/text()").extract() , key=len)[-27:]
headline = response.xpath("(//div[@id = 'ie2013-content']/h1//text())").extract()
headline=separator.join(headline)
image_url = response.css("div.storybigpic.ssss img").xpath("@src").extract_first()
maintext = response.xpath("//div[@class = 'ie2013-contentstory']//p//text()").extract()
maintext = ' '.join(map(str, maintext))
maintext = maintext.replace('\r','')
contd = response.xpath("//div[@class = 'ie2013-contentstory']/p[@align = 'right']/text()").extract_first()
items['date_updated'] = date_updated
items['headline'] = headline
items['maintext'] = maintext
items['image_url'] = image_url
items['article_url'] = article_url
next_page_url = response.xpath("//a[@rel='canonical']/@href").extract_first()
if next_page_url :
items['next_maintext'] = scrapy.Request(next_page_url , callback = self.parse_page2)
yield items
def parse_page2(self, response):
next_maintext = response.xpath("//div[@class = 'ie2013-contentstory']//p//text()").extract()
next_maintext = ' '.join(map(str, next_maintext))
next_maintext = next_maintext.replace('\r','')
yield {next_maintext}
输出:
article_url,date_publish,date_updated,description,headline,image_url,maintext,next_maintext
http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/,,"Fri Apr 03 2009, 14:49 hrs ",,Congress approves 2010 budget plan,http://static.indianexpress.com/m-images/M_Id_69893_Obama.jpg,"The Democratic-controlled US Congress on Thursday approved budget blueprints embracing President Barack Obama's agenda but leaving many hard choices until later and a government deeply in the red. With no Republican support, the House of Representatives and Senate approved slightly different, less expensive versions of Obama's $3.55 trillion budget plan for fiscal 2010, which begins on October 1. The differences will be worked out over the next few weeks. Obama, who took office in January after eight years of the Republican Bush presidency, has said the Democrats' budget is critical to turning around the recession-hit US economy and paving the way for sweeping healthcare, climate change and education reforms he hopes to push through Congress this year. Obama, traveling in Europe, issued a statement praising the votes as ""an important step toward rebuilding our struggling economy."" Vice President Joe Biden, who serves as president of the Senate, presided over that chamber's vote. Democrats in both chambers voted down Republican alternatives that focused on slashing massive deficits with large cuts to domestic social spending but also offered hefty tax breaks for corporations and individuals. ""Democrats know that those policies are the wrong way to go,"" House Majority Leader Steny Hoyer told reporters. ""Our budget lays the groundwork for a sustained, shared and job-creating recovery."" But Republicans have argued the Democrats' budget would be a dangerous expansion of the federal government and could lead to unnecessary taxes that would only worsen the country's long-term fiscal situation. ""The Democrat plan to increase spending, to increase taxes, and increase the debt makes no difficult choices,"" said House Minority Leader John Boehner. ""It's a roadmap to disaster."" The budget measure is nonbinding but it sets guidelines for spending and tax bills Congress will consider later this year. BIPARTISANSHIP ABSENT AGAIN Obama has said he hoped to restore bipartisanship when he arrived in Washington but it was visibly absent on Thursday. ... contd.",<GET http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/2>
解决方案
这不是 Scrapy 的工作方式(我的意思是 next_page 请求)如何在 Scrapy 上同步获取请求的响应对象?.
但实际上你不需要同步请求。您只需要检查下一页并将当前状态 ( item
) 传递给将处理您的下一页的回调。我正在使用cb_kwargs
(现在是推荐的方式)。request.meta
如果您有旧版本,则可能需要使用。
import scrapy
class spider_indianexpress(scrapy.Spider):
name = 'indianexpress'
start_urls = ['http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/']
def parse(self,response):
item = {}
separator = ''
#article_url = response.xpath("//link[@rel = 'canonical']/@href").extract_first()
article_url = response.request.url
date_updated = max(response.xpath("//div[@class = 'story-date']/text()").extract() , key=len)[-27:] #Call max(list, key=len) to return the longest string in list by comparing the lengths of all strings in a list
if len(date_updated) <=10:
date_updated = max(response.xpath("//div[@class = 'story-date']/p/text()").extract() , key=len)[-27:]
headline = response.xpath("(//div[@id = 'ie2013-content']/h1//text())").extract()
headline=separator.join(headline)
image_url = response.css("div.storybigpic.ssss img").xpath("@src").extract_first()
maintext = response.xpath("//div[@class = 'ie2013-contentstory']//p//text()").extract()
maintext = ' '.join(map(str, maintext))
maintext = maintext.replace('\r','')
contd = response.xpath("//div[@class = 'ie2013-contentstory']/p[@align = 'right']/text()").extract_first()
item['date_updated'] = date_updated
item['headline'] = headline
item['maintext'] = maintext
item['image_url'] = image_url
item['article_url'] = article_url
next_page_url = response.xpath('//a[@rel="canonical"][@id="active"]/following-sibling::a[1]/@href').extract_first()
if next_page_url :
yield scrapy.Request(
url=next_page_url,
callback = self.parse_next_page,
cb_kwargs={
'item': item,
}
)
else:
yield item
def parse_next_page(self, response, item):
next_maintext = response.xpath("//div[@class = 'ie2013-contentstory']//p//text()").extract()
next_maintext = ' '.join(map(str, next_maintext))
next_maintext = next_maintext.replace('\r','')
item["maintext"] += next_maintext
next_page_url = response.xpath('//a[@rel="canonical"][@id="active"]/following-sibling::a[1]/@href').extract_first()
if next_page_url :
yield scrapy.Request(
url=next_page_url,
callback = self.parse_next_page,
cb_kwargs={
'item': item,
}
)
else:
yield item
推荐阅读
- node.js - JWT 库错误:通用类型“ModuleWithProviders”
' 在 Angular 10 中需要 1 个类型参数 - firebase - 线程“构建事件通知”java.lang.NoClassDefFoundError 中的异常:无法初始化类 sun.security.ssl.SSLContextImpl$TLSContext
- python - tf.keras 指标中的 reset_states() 和 update_state() 是什么意思?
- html - 导航中的图像
- angular - RxJS - 可观察到模型对象
- html - 未定义标识符“图像”。'never' 不包含这样的成员
- excel - 根据单元格值将行复制到工作表底部并按升序排序
- reactjs - React 应用程序不与 shopify api 通信
- python - .fillna 打破 .dt.normalize()
- angular - 当 OnInit 中的语言更改时, translateService.onLangChange.subscribe 不会触发