python - Scrapy with Splash 不会等待网站加载

问题描述

我正在尝试通过 Python 脚本调用 Splash 来渲染和抓取交互式网站，基本上遵循本教程：

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                endpoint='render.html',
                args={'wait': 0.5},
            )

    def parse(self, response):
        filename = 'mywebsite-%s.html' % '1'
        with open(filename, 'wb') as f:
            f.write(response.body)

输出看起来不错，但是它缺少一两秒后通过 ajax 加载的网站的一部分，这是我真正需要的内容。现在奇怪的是，如果我通过 web 界面直接访问容器内部的 Splash，设置相同的 URL，然后点击 Render 按钮，返回的响应是正确的。那么，唯一的问题是，为什么当 Python 脚本调用它时，它没有正确呈现网站？

标签： pythonscrapyscrapy-splashsplash-js-render

我已经尝试过 adrihanu 的建议，但没有奏效。过了一会儿，我想知道会发生什么，以及是否有可能执行 Splash UI 正在执行的相同脚本。所以，我了解到可以将 lua 脚本作为参数传递，并且它有效！

script1 = """
            function main(splash, args)
            assert (splash:go(args.url))
            assert (splash:wait(0.5))
            return {
                html = splash: html(),
                png = splash:png(),
                har = splash:har(),
            }
            end
          """

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                                endpoint='execute',
                                args={
                                    'html': 1,
                                    'lua_source': self.script1,
                                    'wait': 0.5,
                                }

python - Scrapy with Splash 不会等待网站加载

问题描述

解决方案

推荐阅读