首页 > 解决方案 > Scrapy:Rest API 返回的 Json 对象的以下链接

问题描述

我正在实现一个蜘蛛,它应该从此页面获取所有 url 链接(以及所有其他通过分页):https://www.ibm.com/search?lang=de&cc=de&q=iot。我可以通过使用 api 来做到这一点。

这是我的问题:我不知道如何跟踪我提取的链接,因为来自 Scrapy 的链接提取器仅适用于选择器而不是 Json 对象。

当尝试使用这样的第二个请求抓取 url 时:

url = result.get('url')
content = scrapy.Request(url=url,callback=self.parse_content)

对于内容变量,我只得到类似的东西:Request GET http://www-01.ibm.com/support/docview.wss?uid=ibm10884852

请帮忙。这是我的完整代码:

import scrapy
import json


class IbmSpiderSpider(scrapy.Spider):
    name = 'ibm_spider'
    start_urls = ['http://www.ibm.com/search?lang=de/']

    def start_requests(self):
        urls_=[]            
        for i in range(0,10):
                urls_.append('https://www-api.ibm.com/api/v1/search/aggr/rest/appid/mh?bookmark=eyJzZXJ2aWNlTmFtZSI6Imtub3dsZWRnZUNlbnRlciIsInRvdGFsIjoyOTMzNSwiY291bnQiOjMsInNtQ291bnQiOjAsIm9mZnNldCI6NiwiZmFpbGVkUGFnZXMiOltdfS17InNlcnZpY2VOYW1lIjoiZXNxcyIsInRvdGFsIjo0MDE3MywiY291bnQiOjE3LCJzbUNvdW50IjoyLCJvZmZzZXQiOjMyLCJmYWlsZWRQYWdlcyI6W119LXsicGFnZSI6MywicXVlcnkiOiJpb3QifQ&cachebust=1559896290619&dict=spelling&fr=60&nr=20&page={0}&query=iot&rc=de&refinement=ibmcom&rmdt=entitled&sm=true&smnr=20MzNSwiY291bnQiOjMsInNtQ291bnQiOjAsIm9mZnNldCI6NiwiZmFpbGVkUGFnZXMiOltdfS17InNlcnZpY2VOYW1lIjoiZXNxcyIsInRvdGFsIjo0MDE3MywiY291bnQiOjE3LCJzbUNvdW50IjoyLCJvZmZzZXQiOjMyLCJmYWlsZWRQYWdlcyI6W119LXsicGFnZSI6MywicXVlcnkiOiJpb3QifQ'.format(i))
        for url in urls_:
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self, response):
        data = json.loads(response.body)
        results = data.get('resultset').get('searchresults').get('searchresultlist')
        for result in results:
            url = result.get('url')
            content = scrapy.Request(url=url,callback=self.parse_content)
            yield {
                'title':  result.get('title'),
                'url':  url,
                # added to extract Links content
                'content': content
            }

    def parse_content(self,response):
        return response.text

标签: jsonpython-3.xrestscrapy

解决方案


在您的parse函数中,您应该产生的不是 dict,而是内容请求。检查这个例子:

def parse(self, response):
    data = json.loads(response.body)
    results = data.get('resultset').get('searchresults').get('searchresultlist')
    for result in results:
        url = result.get('url')
        yield scrapy.Request(url, self.parse_content, meta={'title': result.get('title')})

因此,parse_content您可以获取请求的标题、网址和内容:

def parse_content(self, response):
    # and your logics here
    print response.meta['title']
    print response.url
    print response.text

推荐阅读