首页 > 解决方案 > 如何将 spaCy NER 数据集格式转换为 Flair 格式?

问题描述

我已经使用 dataturks 标记了一个数据集来训练spaCyNER,一切正常,但是,我刚刚意识到它Flair有不同的格式,我只是想知道是否有办法将我的“spaCy 的 NER”json 数据集格式转换为Flair格式:

George N B-PER
Washington N I-PER
前往
PO
Washington N B-LOC

然而 spaCy 的格式如下:

[("乔治华盛顿去了华盛顿",
{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]})]

标签: pythonnlpspacynamed-entity-recognitionflair

解决方案


Flair使用BILUO方案,句子之间有空行,所以你需要使用bliuo_tags_from_offsets

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

ents = [("George Washington went to Washington",{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]}),
         ("Uber blew through $1 million a week", {'entities':[(0, 4, 'ORG')]}),
       ]

with open("flair_ner.txt","w") as f:
    for sent,tags in ents:
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        for word,tag in zip(doc, biluo):
            f.write(f"{word} {tag}\n")
        f.write("\n")

输出:

George U-PER
Washington U-PER
went O
to O
Washington U-LOC

Uber U-ORG
blew O
through O
$ O
1 O
million O
a O
week O

请注意,仅训练这NER一点似乎就足够了。如果您希望添加 pos 标记,则需要创建从Universal Pos Tags到 Flair 简化方案的映射。例如:

tag_mapping = {'PROPN':'N','VERB':'V','ADP':'P','NOUN':'N'} # create your own
with open("flair_ner.txt","w") as f:
    for pair in ents:
        sent,tags = pair
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        try:
            for word,tag in zip(doc, biluo):
                f.write(f"{word} {tag_mapping[word.pos_]} {tag}\n")
#                 f.write(f"{word} {tag_mapping.get(word.pos_,'None')} {tag}\n")
        except KeyError:
            print(f"''{word.pos_}' tag is not defined in tag_mapping")
        f.write("\n")

输出:

''SYM' tag is not defined in tag_mapping'

推荐阅读