python - Scrapy 每个起始 url 输出一个 CSV 文件
问题描述
我想为每个 start_url 输出 1 个 CSV 文件。我制作了一个管道,它只输出一个包含所有 url 信息的文件,但不知道如何输出多个。
管道.py
class CSVPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ['date', 'move', 'bank', 'call', 'price']
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
print('Starting csv blank line cleaning')
with open('%s_items.csv' % spider.name, 'r') as f:
reader = csv.reader(f)
original_list = list(reader)
cleaned_list = list(filter(None,original_list))
with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
wr = csv.writer(output_file, dialect='excel')
for data in cleaned_list:
wr.writerow(data)
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
class SentimentPipeline(object):
def process_item(self, item, spider):
return item
我一直在运行:
scrapy crawl spider -o spider.csv
我需要一个新命令吗?对scrapy来说非常新。谢谢!
解决方案
您需要在 pipelines.py 文件中创建如下所示的 CSV 项目管道
class PerUrlCsvExportPipeline:
def open_spider(self, spider):
self.url_to_exporter = {}
def close_spider(self, spider):
for exporter in self.url_to_exporter.values():
exporter.finish_exporting()
def _exporter_for_item(self, item):
url = item['url']
if url not in self.url_to_exporter:
f = open('{}.csv'.format(your_file_name), 'wb')
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.url_to_exporter[url] = exporter
return self.url_to_exporter[url]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item)
exporter.export_item(item)
return item
然后将管道添加到您的 settings.py 文件中:
ITEM_PIPELINES = {
'your_project_name.pipelines.PerUrlCsvExportPipeline': 300,
}
推荐阅读
- python - 如何获得 LSTM 多元时间序列的多个回归输出
- dynamics-crm - Microsft Dynamics SL SoHeader 和 SalesPerson 关系 - 无法链接
- javascript - 如何将以下 javascript 对象数组从一种格式排序/映射/过滤到另一种格式
- numpy - 为什么 npy 文件的大小与 jpg、png 图像相比如此之大
- matlab - 如何将 RK4 ODE Solver 从一阶调整为二阶
- maven - Liquibase 无法从 sql 脚本中读取 UTF-8
- c# - How to generate an auto-incrementing ID number that resets yearly
- javascript - Vue中如何将格式化数据存储在数组中,watch和v-model不保存格式化数据?
- flutter - FLUTTER FIRESTORE:更新布尔字段:“String”类型不是“DocumentReference”类型的子类型
- jupyter-notebook - how to open a jupyter notebook on windows 8.1?