web-crawler - Does Stormcrawler follow secondary JavaScript page content loads?
问题描述
From looking at my scraped results for webmd.com, it seems it may not and I guess it's way too much to expect that it would since that would be very complicated. But I figured I'd ask anyway to double check.
So, if I have a page that uses JavaScript to load its body after the initial page load, does Stormcrawler have any method by which it will wait for this secondary content to load and then scrape the page?
I imagine no crawler does this except very very high level and complicated crawlers like what Google or Bing might use - or maybe even they don't since it would require browser-level intelligence and complexity. The thought of how you'd even implement a behavior of this stature is anxiety-producing.
解决方案
StormCrawler has a selenium-based protocol implementation which delegates the navigation to a browser. There is a tutorial on our blog explaining how to use it. I tend to use Chromedriver and test with Chrome in visual mode for testing and debugging then switch it to headless in prod. Basically, you let the browser deal with the dynamic content. You can even implement navigation actions e.g. click button, fill form etc... This is useful for crawling specific sites but the performance is probably not great for a general crawl.
推荐阅读
- wso2 - wso2 micro-integrator 7.0.0 服务链高级
- php - Laravel Gate::any() 返回恢复的值
- c++ - 指针赋值在递归函数中不起作用
- c - 在 C 中查询目录的替代 dir 命令
- c# - 如何确保在 Windows 关机事件中上传 FTP 文件?
- javascript - 运行扩展测试时 VSCode/Mocha 过早退出
- ruby-on-rails - Rails 向版主或管理员重播消息 - 取决于 created_at 日期
- google-maps - 如何在颤动移动应用程序中的地图上显示底页?
- ios - 使用 HCVimeoVideoExtractor 播放 Vimeo 视频,无法获取特定质量的视频 URL
- javascript - nodejs中的嵌套for循环似乎是异步运行的