stanford-nlp - 斯坦福核心 NLP NER 输出

问题描述

我已经使用 grep 和 awk 从斯坦福 CRF-NER 'inline XML' 中提取命名实体以用于英语文本，并且我希望将相同的更大工作流程用于其他人类语言。

我一直在用法语进行一些试验（西班牙语似乎给我抛出了一个 Java 错误，这是另一个故事），并使用java -cp stanford-corenlp-4.0.0/stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models-french.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -properties StanfordCoreNLP-french.properties -file french.txt -outputFormat text我得到标准文本输出，其中每个句子都有每种类型的注释，包括多字正确组合在一起的实体，如下所示：

Extracted the following NER entity mentions:
Puget Sound LOC I-LOC:0.9822963367809222
lac Washington  LOC I-LOC:0.9908561818309122
Canada  LOC I-LOC:0.9804363858330243
États-Unis  LOC I-LOC:0.9973224740712531

我知道可以解析它，但是当我真的只想要整个文件中的实体列表时，这似乎是很多浪费的处理。

我还能够使用java -cp stanford-corenlp-4.0.0/stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models-french.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -properties StanfordCoreNLP-french.properties -file french.txt -output.columns word,ner -outputFormat conll

Puget   I-LOC
Sound   I-LOC
et  O
le  O
lac I-LOC
Washington  I-LOC
,   O
à   O
environ O
155 O
km  O
à   O
le  O
sud O
de  O
la  O
frontière   O
entre   O
le  O
Canada  I-LOC
et  O
les O
États-Unis  I-LOC
.   O

除了有点混乱之外，这还会分解多词实体，从而无法大规模缝合在一起。

我更喜欢获得内联 xml（例如<LOCATION>Puget</LOCATION><LOCATION>Sound</LOCATION>），因为我已经开发了一个工作流来使用它，但是如果这不可能，是否至少有一种方法可以获得一个 TSV 输出（如conll早期版本），它可以将多个分组文本输出中的单词实体？

我已经研究了实体提及注释器，但我无法弄清楚，如果它需要培训，那么我宁愿不使用它。默认文本输出的分组足以满足我的需要。

标签： stanford-nlpnamed-entity-recognition

stanford-nlp - 斯坦福核心 NLP NER 输出

问题描述

解决方案

推荐阅读