首页 > 解决方案 > Scrapy xml管道

问题描述

我需要制作一个蜘蛛,它必须为任何文章输出一个 xml 文件。

管道.py:

from scrapy.exporters import XmlItemExporter
from datetime import datetime

class CommonPipeline(object):
    def process_item(self, item, spider):
        return item

class XmlExportPipeline(object):
    def __init__(self):
        self.files = {}

    def process_item(self, item, spider):
        file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()
        self.exporter.export_item(item)
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()
        return item

输出:

<?xml version="1.0" encoding="utf-8"?>
    <items>
        <item>
            <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora  </text_img>
            <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
            <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
            <content>    Nelson Argaña, hijo de Luis María Arg ...</content>
            <sum_content>4805</sum_content>
            <time>14:30:06</time>
            <date>20190323</date>
        </item>
    </items>

但我需要这样的输出:

<?xml version="1.0" encoding="iso-8859-1"?>
    <article>
        <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora  </text_img>
        <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
        <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
        <content>    Nelson Argaña, hijo de Luis María Arg ...</content>
        <sum_content>4805</sum_content>
        <time>14:30:06</time>
        <date>20190323</date>
    </article>

设置.py:

ITEM_PIPELINES = {
    'common.pipelines.XmlExportPipeline': 300,
}
FEED_EXPORTERS_BASE = {
    'xml': 'scrapy.contrib.exporter.XmlItemExporter',
}

我尝试在 settings.py 中添加:

FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]

但不工作。

我使用 Scrapy 1.4.0

标签: pythonxmlscrapypipeline

解决方案


推荐阅读