首页 > 解决方案 > 使用 SpaCy 对名称和昵称进行词形还原

问题描述

我想要一个给定文本中出现的单词图表。上面的代码运行良好,但它将“Mathew”和“Mat”视为两个不同的词。

如何让 SpaCy 认为它是同一个词?

def cleanup_text(docs):
    texts = []
    counter = 1
    for doc in docs:
        if counter % 100 == 0:
            print('Processed {} out of {}'.format(counter, len(docs)))
        counter += 1
        doc = nlp(doc, disable=['parser', 'ner'])
        tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
        tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
        tokens = ' '.join(tokens)
        texts.append(tokens)
    return pd.Series(texts)

def make_barplot_for_author(texts):
    text_clean = cleanup_text(texts)
    text_clean = ' '.join(author_clean).split()
    text_clean = [word for word in texts_clean if word not in '\'s']
    text_counts = Counter(texts_clean)
    NUM_WORDS = 25
    text_common_words = [word[0] for word in texts_counts.most_common(NUM_WORDS)]
    text_common_counts = [word[1] for word in texts_counts.most_common(NUM_WORDS)]
    plt.figure(figsize=(15, 12))
    sns.barplot(x=text_common_counts, y=text_common_words)
    plt.title('Words that Apo use frequently', fontsize=20)
    plt.show()

标签: nlpspacy

解决方案


好吧,在使用 NLTK 检查此处传递给函数之前,您需要 lemmatize


推荐阅读