首页 > 解决方案 > 如何计算数据集中每一行的词频

问题描述

我在这样的数据集中有一列文本:

Text
This is a long string of words
words have many types
each type represents one thing
thing are different
where are these words

我想计算整列中每一行的单词频率。我的预期结果是这样或其他格式:

Text.                               Count
this is a long string of words     this:1, is :1, a:1, long:1.....
words have many types              words:3, have:1....
each type represents one thing     ......
thing are different                thing:2, are:2
where are these words              .......

我如何使用 python 来做到这一点?

标签: pythontextcountword-frequency

解决方案


尝试Counter

from collections import Counter
df["Count"] = df['Text'].str.lower().str.split().apply(Counter)

>>> df
                             Text                                              Count
0  This is a long string of words  {'this': 1, 'is': 1, 'a': 1, 'long': 1, 'strin...
1           words have many types     {'words': 1, 'have': 1, 'many': 1, 'types': 1}
2  each type represents one thing  {'each': 1, 'type': 1, 'represents': 1, 'one':...
3             thing are different             {'thing': 1, 'are': 1, 'different': 1}
4           where are these words     {'where': 1, 'are': 1, 'these': 1, 'words': 1}

推荐阅读