python-3.x - 如何获取数据框中每一行的特定单词的频率
问题描述
我正在尝试创建一个从数据框中获取特定单词频率的函数。我正在使用 Pandas 将 CSV 文件转换为数据框,并使用 NLTK 对文本进行标记。我能够获得整列的计数,但我很难获得每一行的频率。以下是我到目前为止所做的。
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from collections import defaultdict
words = [
"robot",
"automation",
"collaborative",
"Artificial Intelligence",
"technology",
"Computing",
"autonomous",
"automobile",
"cobots",
"AI",
"Integration",
"robotics",
"machine learning",
"machine",
"vision systems",
"systems",
"computerized",
"programmed",
"neural network",
"tech",
]
def analze(file):
# count = defaultdict(int)
df = pd.read_csv(file)
for text in df["Text"]:
tokenize_text = word_tokenize(text)
for w in tokenize_text:
if w in words:
count[w] += 1
analze("Articles/AppleFilter.csv")
print(count)
输出:
defaultdict(<class 'int'>, {'automation': 283, 'robot': 372, 'robotics': 194, 'machine': 220, 'tech': 41, 'systems': 187, 'technology': 246, 'autonomous': 60, 'collaborative': 18, 'automobile': 6, 'AI': 158, 'programmed': 12, 'cobots': 2, 'computerized': 3, 'Computing': 1})
目标:获取每一行的频率
{'automation': 5, 'robot': 1, 'robotics': 1, ...
{'automobile': 1, 'systems': 1, 'technology': 1,...
{'AI': 1, 'cobots: 1, computerized': 3,....
CVS 文件格式:
Title | Text | URL
我尝试了什么:
count = defaultdict(int)
df = pd.read_csv("AppleFilterTest01.csv")
for text in df["Text"].iteritems():
for row in text:
print(row)
if row in words:
count[w] += 1
print(count)
输出:
defaultdict(<class 'int'>, {})
如果有人可以提供任何指导、提示或帮助,我将不胜感激。谢谢你。
解决方案
这是一个简单的解决方案,它使用collections.Counter
:
要复制/粘贴的示例:
0,review_body
1,this is the first 8 issues of the series. this is the first 8 issues of the series.
2,I've always been partial to immutable laws. I've always been partial to immutable laws.
3,This is a book about first contact with aliens. This is a book about first contact with aliens.
4,This is quite possibly *the* funniest book. This is quite possibly *the* funniest book.
5,The story behind the book is almost better than your mom. The story behind the book is almost better than your mom.
进口必需品:
import pandas as pd
from collections import Counter
df = pd.read_clipboard(header=0, index_col=0, sep=',')
.str.split()
然后使用:apply()
_Counter
df1 = df.review_body.str.split().apply(lambda x: Counter(x))
print(df1)
0
1 {'this': 2, 'is': 2, 'the': 4, 'first': 2, '8'...
2 {'I've': 2, 'always': 2, 'been': 2, 'partial':...
3 {'This': 2, 'is': 2, 'a': 2, 'book': 2, 'about...
4 {'This': 2, 'is': 2, 'quite': 2, 'possibly': 2...
5 {'The': 2, 'story': 2, 'behind': 2, 'the': 2, ...
做dict(Counter(x))
inside apply()
,.to_dict()
最后等以获得您需要的输出格式。
希望这会有所帮助。
推荐阅读
- python - 如何使用 sudo 运行 jupyterhub。错误:找不到命令
- javascript - React - 传递道具,导航器中的设置错误?
- jquery - 单击后禁用按钮或立即关闭模式
- javascript - 使用简单数组中的 javascript 创建动态 html 表
- java - spring boot 无法启动嵌入式tomcat
- c# - 如何在 C# 中使用 SAP B1 屏幕画家表单字段
- python-3.x - 没有名为 ncdump 的模块
- c - 如何初始化只有空格而没有垃圾的字符串?
- django - 创建时的 django 字段编辑(创建时的值,而不是更新)
- c++ - C ++中的多态映射键