我正在尝试从文本文件中修复错误合并的西班牙语单词,并且我正在使用 Spacy 的 retokenizer.split,但是,我想在 retokenizer.split 中概括 orth 的参数。我有下一个代码

doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare"] # Example: words to be split
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)
with doc.retokenize() as retokenizer:
    for match_id, start, end in matches:
        heads = [(doc[start],1), doc[start]]
        attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
        orths= [str(doc[start]),str(doc[end])]
    retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]

但是当我以这种方式放置 orthorths= [str(doc[start]),str(doc[end])]而不是["words","are"]我得到这个错误时:

ValueError: [E117] 新拆分的标记必须与原始标记的文本匹配。新的 orths:wordsarewrongly。旧文本:wordsare。


  1. words = ["wordsare"]towords = ["wordsare","weneed"] 那是拼写错误的单词列表。

  2. 添加将该映射拆分到第一个列表的规则:splits = {"wordsare":["words","are"], "weneed":["we","need"]}

  3. orths= [str(doc[start]),str(doc[end])]orths= splits[doc[start:end].text] 这是一个拆分列表以替换找到的匹配项。您的原件[str(doc[start]),str(doc[end])]没有太多意义。

  4. 进入retokenizer.split循环。

  5. 考虑添加另一个字典attrs


import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")

doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare","weneed"] # Example: words to be split
splits = {"wordsare":["words","are"], "weneed":["we","need"]}
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)

with doc.retokenize() as retokenizer:
    for match_id, start, end in matches:
        heads = [(doc[start],1), doc[start]]
        attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
        orths= splits[doc[start:end].text]           
        retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]
['the', 'words' ,'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']


[splits[tok.text] if tok.text in words else tok.text for tok in doc]
['the', 'words', 'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']


from spacy.tokens import Doc
nlp.make_doc = lambda txt: Doc(nlp.vocab, [i for l in [splits[tok.text] if tok.text in words else [tok.text] for tok in nlp.tokenizer(txt)] for i in l])
doc2 = nlp("the wordsare wronly merged and weneed split them")
for tok in doc2:
    print(f"{tok.text:<10}", f"{tok.pos_:<10}", f"{tok.dep_:<10}")
the        DET        det       
words      NOUN       nsubjpass 
are        AUX        auxpass   
wronly     ADV        advmod    
merged     VERB       ROOT      
and        CCONJ      cc        
we         PRON       nsubj     
need       VERB       aux       
split      VERB       conj      
them       PRON       dobj 
