java - 如何使用htmlunit + jsoup抓取使用javascript动态加载内容的网站
问题描述
https://www.reddit.com/r/buildapcsales/top/加载所有内容大约需要 3~ 秒。目前使用 jsoup 我只能抓取前 7 个线程,因为其他线程会在几秒钟后加载。我试图让 htmlunit 加载整个页面,然后使用 jsoup 来抓取所有线程标题。
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
Page page = webClient.getPage(url.toString());
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
// webClient.getOptions().setJavaScriptEnabled(true);
// webClient.getOptions().setThrowExceptionOnScriptError(true);
// webClient.waitForBackgroundJavaScript(50000);
// webClient.wait(5000);
// HtmlPage page = webClient.getPage(url.toString());
每当我将JavascriptEnabled设置为true时,我都会收到一百万个错误,但如果我将其设置为false。它不会出错,但是我仍然使用 jsoup 获得 7 个线程。
警告:脚本不是 JavaScript(类型:'application/json',语言:'')。跳过执行。2020 年 2 月 9 日下午 4:54:36 com.gargoylesoftware.htmlunit.javascript.DefaultJavaScriptErrorListener scriptException 严重:JavaScript 执行期间出错 ======= 异常开始 ======== 异常类=[net.sourceforge .htmlunit.corejs.javascript.EvaluatorException] com.gargoylesoftware.htmlunit.ScriptException:语法错误(https://www.redditstatic.com/desktop2x/vendors~Governance~Reddit.791bf381e13bfdc452ab.js#1) 在 com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:882) 在 net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:624) 在 net.sourceforge.htmlunit。 corejs.javascript.ContextFactory.call(ContextFactory.java:537) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:354) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java: 713) 在 com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:679) 在 com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:103) 在 com.gargoylesoftware.htmlunit.html.HtmlPage .loadJavaScriptFromUrl(HtmlPage.java:1104) 在 com.gargoylesoftware.htmlunit.html.HtmlPage。loadExternalJavaScriptFile(HtmlPage.java:984) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:361) at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:234) at com. gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:301) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:560) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:419)在 com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:336) 在 com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:488) 在 com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:469 ) 在 RedditScraper.main(RedditScraper.java:40)executeScriptIfNeeded(HtmlScript.java:361) 在 com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:234) 在 com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:301) 在 com. gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:560) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:419) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:336) at com .gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:488) 在 com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:469) 在 RedditScraper.main(RedditScraper.java:40)executeScriptIfNeeded(HtmlScript.java:361) 在 com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:234) 在 com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:301) 在 com. gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:560) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:419) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:336) at com .gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:488) 在 com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:469) 在 RedditScraper.main(RedditScraper.java:40)htmlunit.WebClient.loadWebResponseInto(WebClient.java:560) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:419) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:336) at com.gargoylesoftware .htmlunit.WebClient.getPage(WebClient.java:488) 在 com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:469) 在 RedditScraper.main(RedditScraper.java:40)htmlunit.WebClient.loadWebResponseInto(WebClient.java:560) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:419) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:336) at com.gargoylesoftware .htmlunit.WebClient.getPage(WebClient.java:488) 在 com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:469) 在 RedditScraper.main(RedditScraper.java:40)
这些是最初的几个错误中的一些
解决方案
我在尝试在HtmlUnit
. 然后我尝试了Selenium,它就像一个魅力。
推荐阅读
- python-2.7 - OS X:python 2.7.X 的多个版本
- c++ - static_cast 不起作用
- smartgwt - 以编程方式显示 ExpansionComponent
- gradle - 无法解析“/home/marcin/Projects/EduKotlinAcademy/web/build/node_modules_imported/kotlinx-html-js”中的“kotlin”
- node.js - 如果 nodejs 是旧版本,NPM init 在我的本地机器上会失败吗?
- javascript - Firebase 数据库规则不起作用
- php - 我无法将图像添加到广告恢复服务器
- git - 如何使用 git archive 从特定分支获取单个文件?
- html - 如何阻止 div 内并排的 3 个图像从换行到下一行?
- java - 从 ArrayList 中删除所有某些字符