首页 > 解决方案 > 如何在终端中写入包含特定文本的输出

问题描述

我尝试使用 'scrapy' 来抓取网页 URL,但我不能使用 '>' 直接写入文件。

我还尝试使用“脚本”命令在终端屏幕上捕获文本,它有效,但它在短时间内编写了所有使用大量存储的内容。我打算夜跑,我担心我的存储空间已经满了。

例如来自终端的文本;

2020-11-05 17:22:10 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'seek.bloggang.com': <GET http://seek.bloggang.com>
2020-11-05 17:22:10 [scrapy.core.scraper] ERROR: Spider error processing <GET https://pantip.com/topic/39275448> (referer: https://pantip.com/tag/Marriage_Visa)
Traceback (most recent call last):
  File "/home/noah/.local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/home/noah/.local/lib/python3.6/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/noah/.local/lib/python3.6/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)

我只想要包含 URL(在标签中)的行

2020-11-05 17:22:10 [scrapy.core.scraper] ERROR: Spider error processing <GET https://pantip.com/topic/39275448> (referer: https://pantip.com/tag/Marriage_Visa)

你对这个案子有什么想法吗?

此致。

PS。我也给你附上了代码。

import scrapy

class BrickSetSpider(scrapy.Spider):
    name = "spider"
    allowed_domains = ['pantip.com']
    start_urls = ['https://pantip.com']

    def __init__(self):
        self.links=[]

    def parse(self, response):
        self.links.append(response.url)
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)

标签: pythonlinuxterminalscrapy

解决方案


推荐阅读