首页 > 解决方案 > 有没有办法概括 Spacy 的 retokenizer.split 的论点中的正交?

问题描述

我正在尝试从文本文件中修复错误合并的西班牙语单词,并且我正在使用 Spacy 的 retokenizer.split,但是,我想在 retokenizer.split 中概括 orth 的参数。我有下一个代码

doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare"] # Example: words to be split
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)
with doc.retokenize() as retokenizer:
    for match_id, start, end in matches:
        heads = [(doc[start],1), doc[start]]
        attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
        orths= [str(doc[start]),str(doc[end])]
    retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]
print(token_split) 

但是当我以这种方式放置 orthorths= [str(doc[start]),str(doc[end])]而不是["words","are"]我得到这个错误时:

ValueError: [E117] 新拆分的标记必须与原始标记的文本匹配。新的 orths:wordsarewrongly。旧文本:wordsare。

我想要一些关于概括的帮助,因为我希望代码不仅可以修复单词wordare,还可以修复单词weneed和文件可能包含的其他单词。

标签: pythonsplitnlpspacy

解决方案


在你的例子中我会改变的是:

  1. words = ["wordsare"]towords = ["wordsare","weneed"] 那是拼写错误的单词列表。

  2. 添加将该映射拆分到第一个列表的规则:splits = {"wordsare":["words","are"], "weneed":["we","need"]}

  3. orths= [str(doc[start]),str(doc[end])]orths= splits[doc[start:end].text] 这是一个拆分列表以替换找到的匹配项。您的原件[str(doc[start]),str(doc[end])]没有太多意义。

  4. 进入retokenizer.split循环。

  5. 考虑添加另一个字典attrs

一旦你有了它,你就有了一个工作和概括的例子:

import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")

doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare","weneed"] # Example: words to be split
splits = {"wordsare":["words","are"], "weneed":["we","need"]}
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)

with doc.retokenize() as retokenizer:
    for match_id, start, end in matches:
        heads = [(doc[start],1), doc[start]]
        attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
        orths= splits[doc[start:end].text]           
        retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]
print(token_split) 
['the', 'words' ,'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']

请注意,如果您只关心标记化,则可以使用更简单、甚至更快的方法来做同样的事情:

[splits[tok.text] if tok.text in words else tok.text for tok in doc]
['the', 'words', 'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']

另请注意,在第一个示例中,在attrs某些情况下是固定的并且分配错误。你可以通过制作另一个字典来解决这个问题,但更简洁和干净的方式来拥有一个功能齐全的管道,是重新定义标记器并让spacy你做剩下的事情:

from spacy.tokens import Doc
nlp.make_doc = lambda txt: Doc(nlp.vocab, [i for l in [splits[tok.text] if tok.text in words else [tok.text] for tok in nlp.tokenizer(txt)] for i in l])
doc2 = nlp("the wordsare wronly merged and weneed split them")
for tok in doc2:
    print(f"{tok.text:<10}", f"{tok.pos_:<10}", f"{tok.dep_:<10}")
the        DET        det       
words      NOUN       nsubjpass 
are        AUX        auxpass   
wronly     ADV        advmod    
merged     VERB       ROOT      
and        CCONJ      cc        
we         PRON       nsubj     
need       VERB       aux       
split      VERB       conj      
them       PRON       dobj 

推荐阅读