python - 从包含文本的列中获取所有行的词频
问题描述
基于(简化的)DataFrame
import pandas as pd
texts = pd.DataFrame({"description":["This is one text","and this is another one"]})
print(texts)
description
0 This is one text
1 and this is another on
我想用描述列中一组词的词频创建系列。
预期结果应如下所示:
counts
this 2
is 2
one 2
text 1
and 1
another 1
我试过
print(pd.Series(' '.join(str(texts.description)).split(' ')).value_counts())
但得到了
139
e 8
t 7
i 6
n 5
o 5
s 5
d 3
a 3
h 3
p 2
: 2
c 2
r 2
\n 2
T 1
0 1
j 1
x 1
1 1
N 1
m 1
, 1
y 1
b 1
dtype: int64
解决方案
您的代码失败了,因为str(texts.description)
给出了:
'0 This is one text\n1 and this is another one\nName: description, dtype: object'
即系列的字符串表达式,几乎等价于print(texts.description)
. 当你这样做时join(str(texts.description)
,上面的字符串将转换为字符列表,剩下的你就知道了。
尝试:
(texts.description
.str.lower()
.str.split(expand=True)
.stack().value_counts()
)
输出:
this 2
one 2
is 2
another 1
and 1
text 1
dtype: int64
推荐阅读
- c++ - 为 c++ 正确构建 dlib 后的未知错误,同时导入它
- asp.net-core - CORS 政策错误
- postgresql - 比较 2 列并在 PostgreSQL 中的新表中输出差异
- node.js - Implementing Spotify's authorization flow using NextJS's api routes throws cors error
- javascript - 带有反例的 JavaScript 闭包问题
- python - 如何将用户输入从函数显示到熊猫数据框中?
- webdriver - How to filter array with text and numbers
- node.js - Port forwarding a Node http server through Asus router. Unable to connect
- python - 按出现次数将计数器添加到列
- python - How to save image with background removal using python opencv