java - Nutch - 解析自定义 HTML 元素

问题描述

我正在尝试抓取和索引（使用 Solr）我正在抓取的页面的特定部分。

到目前为止，使用所有默认配置，我正在抓取和索引我想要的页面，但在 Solr 中，我只有 2 个字段，标题和内容，其中包含我的页面文本，但它不完全是我想要的文本。

我想要实现的是拥有包含特定 div 内容的新字段。

<div class="myDiv"> Content I want to index </div>

到目前为止，我发现的是Extractor Plugin，这似乎是我想要的。

按照说明操作后，我无法解析数据，因为我收到以下错误并且我不明白出了什么问题。

我正在使用 Nutch 1.15

java.lang.Exception: java.lang.LinkageError: loader constraint violation: when resolving method "org.slf4j.impl.StaticLoggerBinder.getLoggerFactory()Lorg/slf4j/ILoggerFactory;" the class loader (instance of org/apache/nutch/plugin/PluginClassLoader) of the current class, org/slf4j/LoggerFactory, and the class loader (instance of sun/misc/Launcher$AppClassLoader) for the method's defining class, org/slf4j/impl/StaticLoggerBinder, have different Class objects for the type org/slf4j/ILoggerFactory used in the signature
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)

标签： javasolrnutch

似乎使用的 slf4j-api 是旧版本，与 Nutch 使用的不匹配。至少，我是这么理解的。

为了解决这个问题，我只是在/plugins/extractor/plugin.xml

删除这一行：<library name="slf4j-api-1.7.5.jar"/>

java - Nutch - 解析自定义 HTML 元素

问题描述

解决方案

推荐阅读