python - 了解 TfidfVectorizer 中的前 n 个 tfidf 功能
问题描述
我试图更好地理解TfidfVectorizer
。scikit-learn
下面的代码有两个文件doc1 = The car is driven on the road
,doc2 = The truck is driven on the highway
. 通过调用fit_transform
tf-idf 权重的向量化矩阵来生成。
根据tf-idf
值矩阵,不应该highway,truck,car
是最上面的词而不是highway,truck,driven
ashighway = truck= car= 0.63 and driven = 0.44
吗?
#testing tfidfvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(tokenizer= lambda x:x.split(),stop_words = 'english')
response = vectorizer.fit_transform(tn)
feature_array = np.array(vectorizer.get_feature_names()) #list of features
print(feature_array)
print(response.toarray())
sorted_features = np.argsort(response.toarray()).flatten()[:-1] #index of highest valued features
print(sorted_features)
#printing top 3 weighted features
n = 3
top_n = feature_array[sorted_features][:n]
print(top_n)
['car' 'driven' 'highway' 'road' 'truck']
[[0.6316672 0.44943642 0. 0.6316672 0. ]
[0. 0.44943642 0.6316672 0. 0.6316672 ]]
[2 4 1 0 3 0 3 1 2]
['highway' 'truck' 'driven']
解决方案
从结果可以看出,tf-idf 矩阵确实给highway
, truck
, car
(and truck
) 赋予了更高的分数:
tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(stop_words = 'english')
response = vectorizer.fit_transform(tn)
terms = vectorizer.get_feature_names()
pd.DataFrame(response.toarray(), columns=terms)
car driven highway road truck
0 0.631667 0.449436 0.000000 0.631667 0.000000
1 0.000000 0.449436 0.631667 0.000000 0.631667
问题是您通过展平阵列进行的进一步检查。要获得所有行的最高分,您可以改为执行以下操作:
max_scores = response.toarray().max(0).argsort()
np.array(terms)[max_scores[-4:]]
array(['car', 'highway', 'road', 'truck'], dtype='<U7')
其中最高分数是在数据框中有0.63
分数的特征名称。
推荐阅读
- cloud - 云原生环境中的连接池
- python-3.x - 获取重用套接字组中单个套接字的所有请求
- google-bigquery - Bigquery:“table_a”联合所有“table_b”并左连接“table_b”与“mappingstation”表
- c++ - v8 嵌入程序创建 v8::Context 是否有限制?
- json - 使用 Heroku 自定义域与 Spring Boot 数据库交互
- autodesk-realitycapture - 在 autodesk-realitycapture 文件 api 中上传文件时,文件名是否重要
- listview - SwiftUI 列表查看列表中不可选择的行
- javascript - jQuery 插件 Bootstrap 4
- html - 有没有办法在使用 rvest 工具时从网页抓取 HTML 表格数据,这些数据一直显示为“”?
- vue.js - 如何使用 Vuex 和 Axios 发布请求获取最新 ID