首页 > 解决方案 > Perform JOIN operation in an efficient way in real time on two data sources produced by Elasticsearch?

问题描述

We have two very large flat Elasticsearch indices: variant and genotype. Each item of the latter index - genotype - has variantId to connect to a document in the variant index. We need to do a regular JOIN operation - for each variant document in a list get all of the genotype ones as a list - but Elasticsearch is not able to perform it in any efficient way. So, we are wondering if such a JOIN can be performed outside ES by some separately run tool/engine (Kafka, Spark, we are not sure)? We could issue two ES queries and feed them into the engine which would output joined result. What would be the best tool for the goal? Could anyone give some links to resources or to how this particular problem could be addressed?

Previously we tried to just store genotype as a nested field in the variant documents but because there are too many of both of them and there is a hard limit on the number of nested fields we needed to abandon that idea.

标签: elasticsearchinner-join

解决方案


推荐阅读