python - 如何使用 scrapy 抓取一个充满 .html 文件的目录?
问题描述
我有一个充满 .html 文件的文件夹。有没有办法使用scrapy抓取数据?
我的尝试:
import scrapy
import os
LOCAL_FOLDER = 'html_files/'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
class MySpider(scrapy.Spider):
name = 'mySpider'
start_urls = [f"file://{BASE_DIR}/{LOCAL_FOLDER}"]
def parse(self, response):
rows = response.xpath('//div[@class="data"]//tbody/tr')
print(rows)
结构体:
html_files/
├── b.html
├── c.html
├── d.html
├── e.html
├── f.html
任何指导将不胜感激。
解决方案
我在 html_files 目录中创建了 4 个 html 文件(1.html - 4.html):
import scrapy
import os
class TestSpider(scrapy.Spider):
name = 'tempspider'
path = r'html_files'
base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
def start_requests(self):
for file in os.listdir(self.path):
url = 'file:///' + os.path.join(self.base_dir, self.path, file)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.xpath('//text()').get())
输出:
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C1.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C2.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C3.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C4.html> (referer: None)
html 1
html 2
html 3
html 4
推荐阅读
- c# - 使用相同对象生成校验和时发生更改
- java - 消费者阅读 __consumer_offsets 传递不可读的消息
- assembly - 程序如何在堆栈中找到全局变量?
- java - Java - JFrame 不显示 ImageIcon
- python - 导入时未定义枚举对象 - Python 3 通过 Jupyter Notebook
- ios - Why my does blur view become transparent when I combine with UIImagview?
- html - Angular6/Typescript - Mapping of Objects in Arrays in Arrays
- fatal-error - GuzzleHttp\Exception\RequestException' 带有消息'cURL 错误 60:(参见 http://curl.haxx.se/libcurl/c/libcurl-errors.html)
- java - Java - 将 HTML 输入类型的文件转换为 java 文件对象
- angularjs - AngularJS 在没有 ng-repeat 的情况下获取数据