首页 > 解决方案 > Stanza 没有像预期的那样标记句子;我可以使用换行符作为启发式吗?

问题描述

我正在尝试为单词对齐任务预处理我的文本数据。

我有一个句子的文本文件。每个句子都换行:

a man in an orange hat starring at something . 
a boston terrier is running on lush green grass in front of a white fence . 
a girl in karate uniform breaking a stick with a front kick . 
five people wearing winter jackets and helmets stand in the snow , with snowmobiles in the background . 
people are fixing the roof of a house . 
a man in light colored clothing photographs a group of men wearing dark suits and hats standing around a woman dressed in a strapless gown .

我正在使用 Stanza 来标记句子:

! pip install stanza
import stanza
stanza.download("en", model_dir="/content/drive/MyDrive/Internship/")
nlp_en = stanza.Pipeline("en", dir= "/content/drive/MyDrive/Internship/", processors = "tokenize, pos, lemma, mwt")

with open("/content/drive/MyDrive/Internship/EN_sample.txt", "r", encoding='utf8') as english:
  english = english.read()

doc_en = nlp_en(english)

en_token = []
for i, sentence in enumerate(doc_en.sentences):
    list_of_tokens = [sent.text for sent in sentence.tokens]
    en_token.append(list_of_tokens)

我的预期输出是:

[["a", "man", "in", "an", "orange", "hat", "starring", "at", "something", "."],  ["a", "boston", "terrier", "is", "running", "on", "lush", "green", "grass", "in", "front", "of", "a", "white", "fence", "."],  
["a", "girl", "in", "karate", "uniform", "breaking", "a", "stick", "with", "a", "front", "kick", "."], 
["five", "people", "wearing", "winter", "jackets", "and", "helmets", "stand", "in", "the", "snow", ",", "with", "snowmobiles", "in", "the", "background", "."], 
["people", "are", "fixing", "the", "roof", "of", "a", "house", "."], 
["a", "man", "in", "light", "colored", "clothing", "photographs", "a", "group", "of", "men", "wearing", "dark", "suits", "and", "hats", "standing", "around", "a", "woman", "dressed", "in", "a", "strapless", "gown", "."]] 

本质上,一个列表列表,每个句子都在自己的列表中,并且它的单词被标记化。

但是,我得到的输出是这样的:

[["a", "man", "in", "an", "orange", "hat", "starring", "at", "something", "."],  ["a", "boston", "terrier", "is", "running", "on", "lush", "green", "grass", "in", "front", "of", "a", "white", "fence", "."],  
["a", "girl", "in", "karate", "uniform", "breaking", "a", "stick", "with", "a", "front", "kick", ".", "five", "people", "wearing", "winter", "jackets", "and", "helmets", "stand", "in", "the", "snow", ",", "with", "snowmobiles", "in", "the", "background", ".", "people", "are", "fixing", "the", "roof", "of", "a", "house", "."], 
["a", "man", "in", "light", "colored", "clothing", "photographs", "a", "group", "of", "men", "wearing", "dark", "suits", "and", "hats", "standing", "around", "a", "woman", "dressed", "in", "a", "strapless", "gown", "."]]

在某些情况下,Stanza 似乎忽略了句子边界。

有人知道如何解决这个问题吗?

由于每个句子都以换行符开头,是否可以简单地在每个换行符处强制一个新列表,然后执行单词标记?如果是,我会怎么做?

提前感谢您的任何帮助和建议。

标签: pythonnlptokenizestanza

解决方案


推荐阅读