首页 > 解决方案 > 使用词干运算符的 AppEngine 全文文档索引搜索

问题描述

我正在评估 AppEngine 文档索引全文搜索,并在使用词干运算符“~”时遇到了一些问题。基本上,我创建了几个测试文档的索引,所有文档都有一个标题字段。该字段的一些示例值是:

"Houses Desks Tables"
"referer image vod event"
"events with cats and dogs and"
"names very interesting days"

我正在使用 Java,我的查询代码片段如下所示:

Document doc = Document.newBuilder().setId(key)
    .addField(Field.newBuilder().setName("title").setText(title))
    .addField(Field.newBuilder().setName("type").setText(type))            
    .addField(Field.newBuilder().setName("username").setText(username))
    .build();
DocumentSearchIndexService.getInstance().indexDocument(indexName, doc);
IndexSpec indexSpec = IndexSpec.newBuilder().setName(indexName).build();
Index index = SearchServiceFactory.getSearchService().getIndex(indexSpec);
return index.search("title = ~"+searchText);

但是,返回的结果将始终只匹配精确的单数或复数形式:

query cat, return nothing
query dog, return nothing
query name, return nothing
query house, return nothing

query cats, return "events with cats and dogs and"
query dogs, return "events with cats and dogs and"
query names, return "names very interesting days"
query houses, return "Houses Desks Tables"

所以我真的很迷茫,因为条目是如何返回的,或者我的查询构造方式不正确。

标签: javagoogle-app-enginefull-text-search

解决方案


请注意,如果您在标准环境中使用 Java Development Server for Java 8,则不会实现词干提取。

如果您在 App Engine 上部署应用程序,请使用此处的 Utils.java 类正确索引您的文档。

我为Google Cloud Platform克隆了存储库java-docs-samples,转到该appengine-java8/search文件夹​​并按以下方式修改了SearchServlet.java类的代码,以便包含带有词干运算符“~”的查询:

...
  @Override
  public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
    PrintWriter out = resp.getWriter();
    Document doc =
        Document.newBuilder()
            .setId("theOnlyPiano")
            .addField(Field.newBuilder().setName("product").setText("cats and dogs"))
            .addField(Field.newBuilder().setName("maker").setText("Yamaha"))
            .addField(Field.newBuilder().setName("price").setNumber(4000))
            .build();
    try {
      Utils.indexADocument(SEARCH_INDEX, doc);
    } catch (InterruptedException e) {
      // ignore
    }
    // [START search_document]
    final int maxRetry = 3;
    int attempts = 0;
    int delay = 2;
    while (true) {
      try {
        String searchText = "cat";
        String queryString = "product = ~"+searchText;
        Results<ScoredDocument> results = getIndex().search(queryString);

        // Iterate over the documents in the results
        for (ScoredDocument document : results) {
          // handle results
          out.print("product: " + document.getOnlyField("product").getText());
          //out.println(", price: " + document.getOnlyField("price").getNumber());
        }
      } catch (SearchException e) {
        if (StatusCode.TRANSIENT_ERROR.equals(e.getOperationResult().getCode())
            && ++attempts < maxRetry) {
          // retry
          try {
            Thread.sleep(delay * 1000);
          } catch (InterruptedException e1) {
            // ignore
          }
          delay *= 2; // easy exponential backoff
          continue;
        } else {
          throw e;
        }
      }
      break;
    }
    // [END search_document]
    // We don't test the search result below, but we're fine if it runs without errors.
    out.println(" Search performed");
    Index index = getIndex();
    // [START simple_search_1]
    index.search("rose water");
    // [END simple_search_1]
    // [START simple_search_2]
    index.search("1776-07-04");
    // [END simple_search_2]
    // [START simple_search_3]
    // search for documents with pianos that cost less than $5000
    index.search("product = ~cat AND price < 5000");
    // [END simple_search_3]
  }
}

并且我能够验证词干运算符是否适用于复数形式的“~”(如猫、狗等)。但请注意,正如文档中提到的,词干算法有其局限性

注意。如果您想复制我所做的步骤,请不要忘记在将应用程序部署到 App Engine 之前注释SearchServletTest.javamvn appengine:deploy类的测试部分。该文件应如下所示:

...
  @After
  public void tearDown() {
    helper.tearDown();
  }

  @Test
  public void doGet_successfulyInvoked() throws Exception {
  //  servletUnderTest.doGet(mockRequest, mockResponse);
  //  String content = responseWriter.toString();
  //  assertWithMessage("SearchServlet response").that(content).contains("maker: Yamaha");
  //  assertWithMessage("SearchServlet response").that(content).contains("price: 4000.0");
  }
}


推荐阅读