lucene - 如何在指南针(lucene)中指定是否存储字段内容?
问题描述
我试图了解生成 compass 2.2 索引的旧版应用程序是否存储字段的内容,我可以使用 luke.net 打开索引,据我了解它不是存储字段,它只是返回一个 id,大概是用于其他地方从数据库中选择
看到这个 lucene :Lucene Field.Store.YES vs Field.Store.NO
我怎么知道这个指南针应用程序是否使用相当于 lucene.net Field.Store.NO 的索引,这是 compass.cfg.xml :
<compass-core-config
xmlns="http://www.opensymphony.com/compass/schema/core-config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.opensymphony.com/compass/schema/core-config
http://www.opensymphony.com/compass/schema/compass-core-config.xsd">
<compass name="default">
<connection>
<!-- index path from a file dataUpdate.properties -->
<file path="/" />
</connection>
<searchEngine>
<analyzer name="default" type="CustomAnalyzer" analyzerClass="myclass.beans.search.PerFieldAnalyzer" >
<!-- example :
<setting name="PerField-fieldname" value="org.apache.lucene.analysis.standard.StandardAnalyzer" />
<setting name="PerFieldConfig-stopwords-fieldname" value="no:" />
<setting name="PerFieldConfig-stopwords-fieldname" value="yes:aa,bb" />
-->
<setting name="PerField-symbol" value="org.apache.lucene.analysis.standard.StandardAnalyzer" />
<setting name="PerFieldConfig-stopwords-symbol" value="no:" />
<setting name="PerField-isin" value="org.apache.lucene.analysis.standard.StandardAnalyzer" />
<setting name="PerFieldConfig-stopwords-isin" value="no:" />
<setting name="PerField-tipo_opzione" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-settore_cod" value="org.apache.lucene.analysis.KeywordAnalyzer" />
<setting name="PerField-trend_medio" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-trend_breve" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-trend_lungo" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-tipo_sts_cod" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-valuta" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-sottotipo_tit" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-tabella_rt" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-market" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-cod_segmento" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-tipo_tit" value="org.apache.lucene.analysis.KeywordAnalyzer"/>
<setting name="PerField-radiocor" value="org.apache.lucene.analysis.standard.StandardAnalyzer" />
<setting name="PerFieldConfig-stopwords-radiocor" value="no:" />
</analyzer>
</searchEngine>
<mappings>
<class name="myclass.tserver.beans.search.SearchIndex" />
</mappings>
<settings>
<setting name="compass.transaction.lockTimeout" value="180" />
</settings>
</compass>
</compass-core-config>
value="no:" 是否意味着不存储该值,或者不将其视为“停用词”?而例如 value="org.apache.lucene.analysis.standard.StandardAnalyzer" 意味着存储它
这是它似乎使用的分析器:
package myclass.tserver.beans.search;
import myclass.tserver.ejb.StubWrapper;
import java.lang.reflect.Constructor;
import java.lang.reflect.InvocationTargetException;
import java.util.Arrays;
import java.util.Collections;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.compass.core.CompassException;
import org.compass.core.config.CompassConfigurable;
import org.compass.core.config.CompassSettings;
public class PerFieldAnalyzer extends PerFieldAnalyzerWrapper implements CompassConfigurable {
private static final String FIELD_PREFIX = "PerField-";
private static final String FIELD_CONFIG_PREFIX = "PerFieldConfig-";
private static final String STOP_WORDS_PREFIX = "stopwords-";
private static final String NO_STOP_WORDS_PREFIX = "no-stopwords-";
public PerFieldAnalyzer() {
super(new StandardAnalyzer());
}
public void configure(CompassSettings settings) throws CompassException {
for (Object obj : settings.getProperties().keySet()) {
if (obj != null && obj instanceof String && ((String) obj).startsWith(FIELD_PREFIX)) {
String field = ((String) obj).substring(FIELD_PREFIX.length());
String value = settings.getSetting((String) obj);
if (value != null) {
String stopwordsParameter = settings.getSetting(FIELD_CONFIG_PREFIX + STOP_WORDS_PREFIX + field);
String[] stopwords = null;
if (stopwordsParameter != null) {
if (stopwordsParameter.trim().toLowerCase().startsWith("no:"))
// no stopwords
stopwords = new String[] {};
else if (stopwordsParameter.trim().toLowerCase().startsWith("yes:"))
// stopwords
stopwords = stopwordsParameter.trim().substring(4).split(",");
} else
// stopwords di default dello StandardAnalyzer
stopwords = null;
try {
Analyzer analyzer = getAnalyzer(value, stopwords);
addAnalyzer(field, analyzer);
} catch (Exception e) {
new CompassException("Unable to set analyzer for field " + field + " : ", e);
}
}
}
}
}
private Analyzer getAnalyzer(String classname, String[] stopwords) throws ClassNotFoundException, SecurityException,
NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException,
InvocationTargetException {
Class<Analyzer> myclass = (Class<Analyzer>) Class.forName(classname);
if (stopwords == null) {
Constructor<Analyzer> myConstructor = myclass.getConstructor();
return (Analyzer) myConstructor.newInstance();
} else {
Constructor<Analyzer> myConstructor = myclass.getConstructor(String[].class);
return (Analyzer) myConstructor.newInstance((Object)stopwords);
}
}
}
解决方案
要知道为 lucene 文档存储了哪些字段,最简单的方法是通过 lucene 打开索引并读入文档,然后查看文档的字段列表。已编入索引但未存储的字段不会显示在文档的字段列表中。
这是我为您编写的 Lucene.Net 4.8 中的示例,希望可以让您很好地了解如何检查为文档存储了哪些字段。如果您使用的是 Java 而不是 C#,那么您的语法当然会有所不同,并且您将使用旧版本的 Lucene。但是这段代码应该可以让你走得很远。
在此示例中,添加了两个文档,每个文档具有三个字段。但是三个字段中只有两个被存储,即使所有三个字段都被索引了。我在代码中添加了注释,您可以在其中查看为每个文档存储了哪些字段。在此示例中,每个文档只有两个字段将在d.Fields
列表中,因为只存储了两个字段。
[Fact]
public void StoreFieldsList() {
Directory indexDir = new RAMDirectory();
Analyzer standardAnalyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);
IndexWriterConfig indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, standardAnalyzer);
IndexWriter writer = new IndexWriter(indexDir, indexConfig);
Document doc = new Document();
doc.Add(new StringField("examplePrimaryKey", "001", Field.Store.YES));
doc.Add(new TextField("exampleField", "Unique gifts are great gifts.", Field.Store.YES));
doc.Add(new TextField("notStoredField", "Some text to index only.", Field.Store.NO));
writer.AddDocument(doc);
doc = new Document();
doc.Add(new StringField("examplePrimaryKey", "002", Field.Store.YES));
doc.Add(new TextField("exampleField", "Everyone is gifted.", Field.Store.YES));
doc.Add(new TextField("notStoredField", "Some text to index only. Two.", Field.Store.NO));
writer.AddDocument(doc);
writer.AddDocument(doc);
writer.Commit();
DirectoryReader reader = writer.GetReader(applyAllDeletes:true);
for (int i = 0; i < reader.NumDocs; i++) {
Document d = reader.Document(i);
for (int j = 0; j < d.Fields.Count; j++) {
IIndexableField field = d.Fields[j];
string fieldName = field.Name; //<--This field is a stored field for this document.
}
}
}
推荐阅读
- sharepoint - 无法再以编程方式访问 SharePoint Online
- kubernetes - 在 Kubernetes 中动态分配端口号?
- php - 如何使用日期时间字段仅更新最旧的记录?
- c++ - 为什么在这种情况下调用非常量右值移动构造函数?
- javascript - setInterval 使用 node.js 抛出的 TypeError
- ios - 如何在 iOS 中借助 ARKit 测量设备与面部的距离?
- spring - 插入带有空变量的记录不会命中 DataAccessException try/catch 块
- python - 如何使用 Python 在 ExecuteScript 中操作两个 csv 流文件?
- batch-file - 如何在 Windows 批处理文件中设置我的变量?
- asp.net-mvc - 我应该为控制器提供多个视图吗?