python - 如何查找 NLP 单词计数并绘制它？

问题描述

我正在做一些 NLP 工作

我的原始数据框是df_all

Index    Text
1        Hi, Hello, this is mike, I saw your son playing in the garden...
2        Besides that, sometimes my son studies math for fun...
3        I cannot believe she said that. she always says such things...

我将文本转换为 BOW 数据框

所以我的数据框df_BOW现在看起来像这样

Index    Hi   This   my   son   play   garden ...
1        3    6      3    0     2       4
2        0    2      4    4     3       1
3        0    2      0    7     3       0

我想找出每个单词在语料库中出现的次数

cnt_pro = df_all['Text'].value_counts()
plt.figure(figsize=(12,4))
sns.barplot(cnt_pro.index, cnt_pro.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Word', fontsize=12)
plt.xticks(rotation=90)
plt.show();

得到这样的热门词

但我得到这张图表没有显示任何信息

我该如何解决？

标签： pythonnlpseaborn

解决方案

我不确定您是如何创建df_BOW的，但它不是理想的绘图格式。

df_all = pd.DataFrame(
    {
        "text": [
            "Hi, Hello, this is mike, I saw your son playing in the garden",
            "Besides that, sometimes my son studies math for fun",
            "I cannot believe she said that. she always says such things",
        ]
    }
)

与RF Adriaansen 的回答类似，我们可以使用正则表达式来提取单词，但我们只会使用 pandas 方法：

counts = df["text"].str.findall(r"(\w+)").explode().value_counts()

Series.str.findall：应用正则表达式(\w+)来捕获所有单词。这将返回一个Series列表。
Series.explode：将列表中的每个元素转换为一行。
Series.value_counts：返回包含唯一值计数的系列。

counts是一个序列，索引是单词，值是计数：

son          2
she          2
I            2
...
says         1
garden       1
math         1
Name: text, dtype: int64

然后绘制：

fig, ax = plt.subplots(figsize=(6,5))
sns.barplot(x=counts.index, y=counts.values, ax=ax)
ax.set_ylabel('Number of Occurrences', fontsize=12)
ax.set_xlabel('Word', fontsize=12)
ax.xaxis.set_tick_params(rotation=90)

如果你只想要 N 个最常见的词，你可以nlargest像这样使用：

top_10 = counts.nlargest(10)

并以相同的方式绘制。

python - 如何查找 NLP 单词计数并绘制它？

问题描述

解决方案

推荐阅读