python-3.x - 使用 spacy 的 POS 模式挖掘
问题描述
我正在尝试在 python 3 中使用 spacy 从文本中提取语言特征。我的输入看起来像这样
Sent_id Text
1 I am exploring text analytics using spacy
2 amazing spacy is going to help me
我正在寻找这样的输出,方法是将单词提取为具有我提供的特定 POS 模式的 trigram/bigram 短语。如 NOUN VERB NOUN、ADJ NOUN 等,并保留数据框结构。如果一个句子中有多个短语,则必须用新短语复制记录。
Sent_id Text Feature Pattern
1 I am exploring text analytics using spacy exploring text analytics VERB NOUN NOUN
1 I am exploring text analytics using spacy analytics using spacy NOUN VERB NOUN
2 amazing spacy is going to help me amazing spacy ADJ NOUN
解决方案
代码在评论中解释
import spacy
import pandas as pd
import re
# Load spacy model once and reuse
nlp = spacy.load('en_core_web_sm')
# The dataframe with text
df = pd.DataFrame({
'Sent_id': [1,2],
'Text': [ "I am exploring text analytics using spacy", "amazing spacy is going to help me"]
})
# Patters we are intrested in
patterns = ["VERB NOUN", "NOUN VERB NOUN"]
# Convert each pattern into regular expression
re_patterns = [" ".join(["(\w+)_!"+pos for pos in p.split()]) for p in patterns]
def extract(nlp, text, patterns, re_patterns):
"""Extracts the pieces in text maching the POS pattern in patterns
args:
nlp : Loaded Spicy model object
text: The input text
patterns: The list of patters to be searched
re_patterns: The patterns converted into regex
returns: A list of tuples of form (t,p) where
t is the part of text matching the pattern p in patterns
"""
doc = nlp(text)
matches = list()
text_pos = " ".join([token.text+"_!"+token.pos_ for token in doc])
for i, pattern in enumerate(re_patterns):
for result in re.findall(pattern, text_pos):
matches.append([" ".join(result), patterns[i]])
return matches
# Test it
print (extract(nlp, "A sleeping cat and walking dog", patterns, re_patterns))
# Returns
# [['sleeping cat', 'VERB NOUN'], ['walking dog', 'VERB NOUN']]
# Extract the matched patterns
df['matches'] = df['Text'].apply(lambda x: extract(nlp,x,patterns,re_patterns))
# Convert the list of tuples into rows
df = df.matches.apply(pd.Series).merge(df, right_index = True, left_index = True).drop(["matches"], axis = 1)\
.melt(id_vars = ['Sent_id', 'Text'], value_name = "matches").drop("variable", axis = 1)
# Add the matched text and matched patterns into new columns
df[['matched_text','matched_pattern']]= df.matches.apply(pd.Series)
# Drop the column and cleanup
df = df.drop("matches", axis = 1).sort_values('Sent_id')
df = df.drop_duplicates(subset =["matched_text", "matched_pattern"], keep='last')
输出:
Sent_id Text matched_text matched_pattern
0 1 I am exploring text analytics using spacy exploring text VERB NOUN
2 1 I am exploring text analytics using spacy using spacy VERB NOUN
4 1 I am exploring text analytics using spacy analytics using spacy NOUN VERB NOUN
1 2 amazing spacy is going to help me NaN NaN
推荐阅读
- java - 调用 ghost admin api 时签名无效
- c# - 无法通过 UserManager 更新用户
在 asp.net core 2.2 应用程序中使用 ef core 2.2 - python - UDP数据包只接收一次并在循环中重复,如何在循环中接收实时变化的数据?
- python-3.x - 从给定名称的类列表中实例化一个类作为字符串
- postgresql - 无法在 Docker 上将 Lumen 与 PostgreSQL 连接
- mysql - 不要在数据库中创建字段
- javascript - laravel如何实时显示用户输入?
- r - sjPlot - plot_model() 改变点颜色和线条 [R]
- r - R 得到一个简单的 RStudio 片段来运行
- java - 在 Java 8 中为集成测试构建 Map 的重构方法 - 函数式编程的机会?