首页 > 解决方案 > For循环没有刮掉所有项目,只有一个

问题描述

我正在尝试抓取包含大约 20 篇文章的页面,但由于某种原因,蜘蛛只能找到第一篇文章所需的信息。如何让它刮掉页面上的每一篇文章?

我已尝试多次更改 xpath,但我认为我对此太陌生,无法确定问题所在。当我从 for 循环中取出所有路径时,它会很好地报废所有内容,但它的格式不允许我将数据传输到 csv 文件。

import scrapy


class AfgSpider(scrapy.Spider):
    name = 'afg'
    allowed_domains = ['www.pajhwok.com/en']
    start_urls = ['https://www.pajhwok.com/en/security-crime']

    def parse(self, response):
        container = response.xpath("//div[@id='taxonomy-page-block']")
        for x in container:
            title = x.xpath(".//h2[@class='node-title']/a/text()").get()
            author = x.xpath(".//div[@class='field-item even']/a/text()").get()
            rel_url = x.xpath(".//h2[@class='node-title']/a/@href").get()
        


            yield{
                'title' : title,
                'author' : author,
                'rel_url' : rel_url
            }

标签: xpathweb-scrapingscrapy

解决方案


您可以使用此代码收集所需信息:

import scrapy


class AfgSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['www.pajhwok.com/en']
    start_urls = ['https://www.pajhwok.com/en/security-crime']

    def parse(self, response):
        container = response.css("div#taxonomy-page-block div.node-article")
        for x in container:
            title = x.xpath(".//h2[@class='node-title']/a/text()").get()
            author = x.xpath(".//div[@class='field-item even']/a/text()").get()
            rel_url = x.xpath(".//h2[@class='node-title']/a/@href").get()

            yield{
              'title': title,
              'author': author,
              'rel_url': rel_url
            }

问题是你编码container = response.xpath("//div[@id='taxonomy-page-block']")

只返回一行,这是因为id在整个页面中应该是唯一的,class几个标签可以相同


推荐阅读