首页 > 解决方案 > 如何将整个 XML 数据库摄取到 Elastic Search?

问题描述

假设我有 20 个 XML 文件,它们是整个数据库。是否可以将所有这 20 个 XML 文件提取到 Elastic Search 中?如果是,有什么可用选项?

标签: elasticsearch

解决方案


对于 Python3,我建议使用xmltodict

pip install xmltodict elasticsearch

我想xml文件有记录:

<records>
    <record>...</record>
    ...
    <record>...</record>
</records>

所以他们必须被分成记录。

使用以下内容编辑名为“load.py”的脚本:

import sys
import xmltodict
import json
from elasticsearch import Elasticsearch

INDEX="xmlfiles"
TYPE= "record"

def xml_to_actions(xmlcontent):
    for record in xmlcontent["records"]:
        yield ('{ "index" : { "_index" : "%s", "_type" : "%s" }}'% (INDEX, TYPE))
        yield (json.dumps(record, default=int))

e = Elasticsearch()  # no args, connect to localhost:9200
if not e.indices.exists(INDEX):
    raise RuntimeError('index does not exists, use `curl -X PUT "localhost:9200/%s"` and try again'%INDEX)

for f in sys.argv:
    with open(f, "rt") as fin:
        r = e.bulk(xml_to_actions(xmldict.parse(fin)))  # return a dict
        print(f, not r["errors"])

使用它:python load.py xml1.xml xml2.xml ... xml20.xml


推荐阅读