python - 有没有办法概括 Spacy 的 retokenizer.split 的论点中的正交?
问题描述
我正在尝试从文本文件中修复错误合并的西班牙语单词,并且我正在使用 Spacy 的 retokenizer.split,但是,我想在 retokenizer.split 中概括 orth 的参数。我有下一个代码
doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare"] # Example: words to be split
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
heads = [(doc[start],1), doc[start]]
attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
orths= [str(doc[start]),str(doc[end])]
retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]
print(token_split)
但是当我以这种方式放置 orthorths= [str(doc[start]),str(doc[end])]
而不是["words","are"]
我得到这个错误时:
ValueError: [E117] 新拆分的标记必须与原始标记的文本匹配。新的 orths:wordsarewrongly。旧文本:wordsare。
我想要一些关于概括的帮助,因为我希望代码不仅可以修复单词wordare,还可以修复单词weneed和文件可能包含的其他单词。
解决方案
在你的例子中我会改变的是:
words = ["wordsare"]
towords = ["wordsare","weneed"]
那是拼写错误的单词列表。添加将该映射拆分到第一个列表的规则:
splits = {"wordsare":["words","are"], "weneed":["we","need"]}
orths= [str(doc[start]),str(doc[end])]
orths= splits[doc[start:end].text]
这是一个拆分列表以替换找到的匹配项。您的原件[str(doc[start]),str(doc[end])]
没有太多意义。进入
retokenizer.split
循环。考虑添加另一个字典
attrs
一旦你有了它,你就有了一个工作和概括的例子:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare","weneed"] # Example: words to be split
splits = {"wordsare":["words","are"], "weneed":["we","need"]}
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
heads = [(doc[start],1), doc[start]]
attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
orths= splits[doc[start:end].text]
retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]
print(token_split)
['the', 'words' ,'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']
请注意,如果您只关心标记化,则可以使用更简单、甚至更快的方法来做同样的事情:
[splits[tok.text] if tok.text in words else tok.text for tok in doc]
['the', 'words', 'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']
另请注意,在第一个示例中,在attrs
某些情况下是固定的并且分配错误。你可以通过制作另一个字典来解决这个问题,但更简洁和干净的方式来拥有一个功能齐全的管道,是重新定义标记器并让spacy
你做剩下的事情:
from spacy.tokens import Doc
nlp.make_doc = lambda txt: Doc(nlp.vocab, [i for l in [splits[tok.text] if tok.text in words else [tok.text] for tok in nlp.tokenizer(txt)] for i in l])
doc2 = nlp("the wordsare wronly merged and weneed split them")
for tok in doc2:
print(f"{tok.text:<10}", f"{tok.pos_:<10}", f"{tok.dep_:<10}")
the DET det
words NOUN nsubjpass
are AUX auxpass
wronly ADV advmod
merged VERB ROOT
and CCONJ cc
we PRON nsubj
need VERB aux
split VERB conj
them PRON dobj
推荐阅读
- android - 如何在 Eclipse 中为 Gluon 移动应用程序配置 Android 的 USB 调试?
- c# - 通过 DataBindings 连接两个 DevExpress GridControlls/Views?
- python - 无法使用 pip 在 Python 中安装 REBOUND
- scala - Scala:过滤元组列表以获取 nonEmptyList
- laravel - 如何检查用户是否有喜欢的属性
- angularjs - 一个表格行中的两个 ng-repeat (tr)
- java - NestedScrollView 内的 RecyclerView - 不需要的滚动开始
- r - 舍入所有数字列,但表中只有一个
- powershell - 无法识别搜索过滤器
- java - 在控制器中抛出异常时,Spring Websecurity 在“忽略”资源上抛出 401