pandas - 如何从数据框列中计算 tfidf 分数并提取具有最小分数阈值的单词
问题描述
我采用了一列数据集,其中每一行都有文本形式的描述。我正在尝试查找 tf-idf 大于某个值 n 的单词。但是代码给出了一个分数矩阵,我如何对分数进行排序和过滤并查看相应的单词。
tempdataFrame = wineData.loc[wineData.variety == 'Shiraz',
'description'].reset_index()
tempdataFrame['description'] = tempdataFrame['description'].apply(lambda
x: str.lower(x))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(tempDataFrame['description'])
Sample Data:
description
This tremendous 100% varietal wine hails from Oakville and was aged over
three years in oak. Juicy red-cherry fruit and a compelling hint of caramel
greet the palate, framed by elegant, fine tannins and a subtle minty tone in
the background. Balanced and rewarding from start to finish, it has years
ahead of it to develop further nuance. Enjoy 2022–2030.
解决方案
在没有完整的葡萄酒描述数据框列的情况下,您提供的示例数据分为三个句子,以便创建一个数据框,其中一列名为“描述”,三行。然后将该列传递给 tf-idf 进行分析,并创建一个包含特征及其分数的新数据框。随后使用 pandas 过滤结果。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
doc = ['This tremendous 100% varietal wine hails from Oakville and was aged over \
three years in oak.', 'Juicy red-cherry fruit and a compelling hint of caramel \
greet the palate, framed by elegant, fine tannins and a subtle minty tone in \
the background.', 'Balanced and rewarding from start to finish, it has years \
ahead of it to develop further nuance. Enjoy 2022–2030.']
df_1 = pd.DataFrame({'Description': doc})
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(df_1['Description'])
# New data frame containing the tfidf features and their scores
df = pd.DataFrame(score.toarray(), columns=tfidf.get_feature_names())
# Filter the tokens with tfidf score greater than 0.3
tokens_above_threshold = df.max()[df.max() > 0.3].sort_values(ascending=False)
tokens_above_threshold
Out[29]:
wine 0.341426
oak 0.341426
aged 0.341426
varietal 0.341426
hails 0.341426
100 0.341426
oakville 0.341426
tremendous 0.341426
nuance 0.307461
rewarding 0.307461
start 0.307461
enjoy 0.307461
develop 0.307461
balanced 0.307461
ahead 0.307461
2030 0.307461
2022â 0.307461
finish 0.307461
推荐阅读
- javascript - JS过滤数组,但如果反向存在则不过滤
- c# - C# 组合框中的自定义类,Linq 动态转换为字符串崩溃
- sql - 创建一个 master_id 来识别 postgresql 中的欺骗
- laravel - 为什么 laravel 查询没有获取正确的数据?
- python - OpenCV 用各种方法在较差的图像上绘制轮廓
- javascript - 重构 setState 使其内部有一个函数
- c# - 谁能指导我使用 Dynamics 365 的构造函数的 ClientCredential(String, ISecureClientSecret) 选项?
- javascript - 如何在javascript中将值传递到函数之外?
- .net-core - Dapper - 多重映射和多个返回游标
- x509certificate - 密码套件中的 MAC 指的是什么?