首页 > 解决方案 > Does Stormcrawler follow secondary JavaScript page content loads?

问题描述

From looking at my scraped results for webmd.com, it seems it may not and I guess it's way too much to expect that it would since that would be very complicated. But I figured I'd ask anyway to double check.

So, if I have a page that uses JavaScript to load its body after the initial page load, does Stormcrawler have any method by which it will wait for this secondary content to load and then scrape the page?

I imagine no crawler does this except very very high level and complicated crawlers like what Google or Bing might use - or maybe even they don't since it would require browser-level intelligence and complexity. The thought of how you'd even implement a behavior of this stature is anxiety-producing.

标签: web-crawlernutchstormcrawler

解决方案


StormCrawler has a selenium-based protocol implementation which delegates the navigation to a browser. There is a tutorial on our blog explaining how to use it. I tend to use Chromedriver and test with Chrome in visual mode for testing and debugging then switch it to headless in prod. Basically, you let the browser deal with the dynamic content. You can even implement navigation actions e.g. click button, fill form etc... This is useful for crawling specific sites but the performance is probably not great for a general crawl.


推荐阅读