首页 > 解决方案 > Elasticsearch 无法使用 Java API 查询获取超过 10 个文档

问题描述

正在从一个名为documentsfrom that filepath 的索引读取文件路径,并读取文件并将这些文件内容索引到另一个名为documents_attachmentusing java code 的索引中。

因此,在第一个过程中,一次无法获取多个10记录,它只提供索引中10的记录。我的索引中document有多个100000记录。doucment

我如何一次获取所有100000记录。

我已经尝试过使用searchSourceBuilder.size(10000);它的索引直到10K记录不超过这个,并且这种方法不允许我提供超过10000大小的内容。

请在下面找到我正在使用的 java 代码。

public class DocumentIndex {

private final static String INDEX = "documents";  
private final static String ATTACHMENT = "document_attachment"; 
private final static String TYPE = "doc";
private static final Logger logger = Logger.getLogger(Thread.currentThread().getStackTrace()[0].getClassName());

public static void main(String args[]) throws IOException {


    RestHighLevelClient restHighLevelClient = null;
    Document doc=new Document();

    logger.info("Started Indexing the Document.....");

    try {
        restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
                new HttpHost("localhost", 9201, "http")));
    } catch (Exception e) {
        System.out.println(e.getMessage());
    }


    //Fetching Id, FilePath & FileName from Document Index. 
    SearchRequest searchRequest = new SearchRequest(INDEX); 
    searchRequest.types(TYPE);
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    QueryBuilder qb = QueryBuilders.matchAllQuery();
    searchSourceBuilder.query(qb);
    //searchSourceBuilder.size(10000); 
    searchRequest.source(searchSourceBuilder);
    SearchResponse searchResponse = null;
    try {
         searchResponse = restHighLevelClient.search(searchRequest);
    } catch (IOException e) {
        e.getLocalizedMessage();
    }

    SearchHit[] searchHits = searchResponse.getHits().getHits();
    long totalHits=searchResponse.getHits().totalHits;
    logger.info("Total Hits --->"+totalHits);


    File all_files_path = new File("d:\\All_Files_Path.txt");
    File available_files = new File("d:\\Available_Files.txt");
    File missing_files = new File("d:\\Missing_Files.txt");
    all_files_path.deleteOnExit();
    available_files.deleteOnExit();
    missing_files.deleteOnExit();
    all_files_path.createNewFile();
    available_files.createNewFile();
    missing_files.createNewFile();

    int totalFilePath=1;
    int totalAvailableFile=1;
    int missingFilecount=1;

    Map<String, Object> jsonMap ;
    for (SearchHit hit : searchHits) {

        String encodedfile = null;
        File file=null;

        Map<String, Object> sourceAsMap = hit.getSourceAsMap();


        if(sourceAsMap != null) {  
            doc.setId((int) sourceAsMap.get("id"));
            doc.setApp_language(String.valueOf(sourceAsMap.get("app_language")));
        }

        String filepath=doc.getPath().concat(doc.getFilename());



        try(PrintWriter out = new PrintWriter(new FileOutputStream(all_files_path, true))  ){
            out.println("FilePath Count ---"+totalFilePath+":::::::ID---> "+doc.getId()+"File Path --->"+filepath);
        }

        file = new File(filepath);
        if(file.exists() && !file.isDirectory()) {
            try {
                  try(PrintWriter out = new PrintWriter(new FileOutputStream(available_files, true))  ){
                        out.println("Available File Count --->"+totalAvailableFile+":::::::ID---> "+doc.getId()+"File Path --->"+filepath);
                        totalAvailableFile++;
                    }
                FileInputStream fileInputStreamReader = new FileInputStream(file);
                byte[] bytes = new byte[(int) file.length()];
                fileInputStreamReader.read(bytes);
                encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
                fileInputStreamReader.close();
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }
        }
        else
        {
            PrintWriter out = new PrintWriter(new FileOutputStream(missing_files, true));
            out.close();
            missingFilecount++;
        }

        jsonMap = new HashMap<>();
        jsonMap.put("id", doc.getId());
        jsonMap.put("app_language", doc.getApp_language());
        jsonMap.put("fileContent", encodedfile);

        String id=Long.toString(doc.getId());

        IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
                .source(jsonMap)
                .setPipeline(ATTACHMENT);

        PrintStream printStream = new PrintStream(new File("d:\\exception.txt"));
        try {
            IndexResponse response = restHighLevelClient.index(request);

        } catch(ElasticsearchException e) {
            if (e.status() == RestStatus.CONFLICT) {
            }
            e.printStackTrace(printStream);
        }

        totalFilePath++;


    }

    logger.info("Indexing done.....");
}

}

标签: javaelasticsearchelastic-stack

解决方案


如果您有足够的内存,请将索引设置index.max_result_window从 10000 增加到您需要的数字。

请参阅https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#dynamic-index-settings

但是请注意,这不会无限扩展。搜索请求占用堆内存和时间与 from + size 成正比。此设置用于限制该内存,如果将其设置得太高,您将耗尽内存。

最简单的设置方法是通过 REST API:

PUT /my-index/_settings
{
    "index" : {
        "max_result_window" : 150000
    }
}

推荐阅读