首页 > 解决方案 > 我在 tfidf 归一化列中使用余弦相似度。但是我遇到了内存错误

问题描述

这是我的数据集:

                                                URI                 name  \
0        <http://dbpedia.org/resource/Digby_Morrell>        Digby Morrell   
1       <http://dbpedia.org/resource/Alfred_J._Lewy>       Alfred J. Lewy   
2        <http://dbpedia.org/resource/Harpdog_Brown>        Harpdog Brown   
3  <http://dbpedia.org/resource/Franz_Rottensteiner>  Franz Rottensteiner   
4               <http://dbpedia.org/resource/G-Enka>               G-Enka   

                                                text  
0  digby morrell born 10 october 1979 is a former...  
1  alfred j lewy aka sandy lewy graduated from un...  
2  harpdog brown is a singer and harmonica player...  
3  franz rottensteiner born in waidmannsfeld lowe...  
4  henry krvits born 30 december 1974 in tallinn ...  

这是我的代码。这个 cosine_similarity 在这里不起作用。它给出了一个 MemoryError。如何解决这个问题?

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection  import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(min_df = 1, stop_words='english')
X_cv = cv.fit_transform(df['text'])
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(X_cv)

标签: scikit-learntf-idfcosine-similarity

解决方案


推荐阅读