首页 > 解决方案 > 如何使用 scrapy 抓取一个充满 .html 文件的目录?

问题描述

我有一个充满 .html 文件的文件夹。有没有办法使用scrapy抓取数据?

我的尝试:

import scrapy
import os

LOCAL_FOLDER = 'html_files/'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

class MySpider(scrapy.Spider):
    name = 'mySpider'
    start_urls = [f"file://{BASE_DIR}/{LOCAL_FOLDER}"]

    def parse(self, response):
        rows = response.xpath('//div[@class="data"]//tbody/tr')
        print(rows)

结构体:

html_files/
    ├── b.html
    ├── c.html
    ├── d.html
    ├── e.html
    ├── f.html

任何指导将不胜感激。

标签: pythonhtmlweb-scrapingscrapy

解决方案


我在 html_files 目录中创建了 4 个 html 文件(1.html - 4.html):

import scrapy
import os


class TestSpider(scrapy.Spider):
    name = 'tempspider'
    path = r'html_files'
    base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

    def start_requests(self):
        for file in os.listdir(self.path):
            url = 'file:///' + os.path.join(self.base_dir, self.path, file)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(response.xpath('//text()').get())

输出:

[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C1.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C2.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C3.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C4.html> (referer: None)
html 1
html 2
html 3
html 4

推荐阅读