machine-learning - 如何找出垃圾邮件中最常用的 15 个词?
问题描述
我已经训练了一个线性 SVM 来根据单词将电子邮件分类为垃圾邮件或非垃圾邮件。我首先使用以下代码将电子邮件转换为处理后的文本:
def processEmail(email):
email = email.lower()
#replace strings like <html> with a space
email = re.sub("<[^<>]+"," ",email)
#ruplace numbers with strings
email = re.sub("[0-9]+","number",email)
#replace anything that starts with http:// or https:// with httpaddr
email = re.sub("(http|https)://[^\s]*","httpaddr",email)
#replace strings with @ in the middle with emailaddr as they are strings
email = re.sub("[^\s] + @[^\s]","emailaddr",email)
#repace $ with dollar
email = re.sub("[$]+","dollar",email)
#replace >, , , ?
email = re.sub("[\>\>\,\?]","",email)
print("--------------------------------Pre-processed Email------------------------")
print(email)
return email
我得到了一个词袋或常用词的词汇表,我使用以下方法将其转换成字典:
def getVocabDict():
vocab_txt = open("C:/Users/dynam/Desktop/Coursera AndrewNg/machine-learning-ex6/machine-learning-ex6/ex6/vocab.txt","r")
vocab_dict = {}
for line in vocab_txt:
(key,val) = line.split() #default splitting is using space
vocab_dict.update({key:val})
return vocab_dict
在此之后,我使用以下方法将电子邮件转换为令牌:
def email2Token(Iemail):
#initialize the stemmer software
stemmer = nltk.stem.porter.PorterStemmer()
email = processEmail(Iemail)
#split the email into individual words
tokens = re.split("[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%\\n]",email)
print("------------------------Email after splitting into individual words/tokens------------------")
print(tokens)
#apply stemmer to each word
stemmed_tokens = []
for token in tokens:
#use porter stemmer to stem the word
stemmed_token = stemmer.stem(token)
stemmed_tokens.append(stemmed_token)
print("---------stemmed token-------------")
print(stemmed_token)
return stemmed_tokens
然后我将电子邮件转换为特征向量,其中第一个元素将代表我在我编码的词汇词典中出现的电子邮件中的单词:
def email2featureVec(Iemail,vocab_dict):
n = len(vocab_dict)
emailrec = email2Token(Iemail)
print("---------The token recieved by feature vector converter-----------")
print(emailrec)
email_feature = np.zeros((n,1))
indx = 0
for i in emailrec:
if i in vocab_dict.values():
email_feature[indx,0] = 1
else:
email_feature[indx,0] = 0
indx+=1
print("--------------------------Email feature vec----------------------------------")
print(email_feature)
return email_feature
最后,我创建了一个线性 SVM 模型并在训练数据集 X 及其标签 y 上对其进行训练:
#creating instance of an SVM with C = 0.1
linear_svm = svm.SVC(C = 0.1,kernel = "linear")
#fitting SVM to our X-matrix given labels y
linear_svm.fit(X,y.flatten())
现在我想知道如何获得 15 个最重要的词来分类垃圾邮件?我认为我必须使用系数来找出答案,但我的系数是:
for i in linear_svm.coef_:
for j in i:
print(j)
0.007932077307221794
0.015633235616866917
0.055464916277558125
-0.013416103446075411
-0.06619756700850743
0.03659516600411697
0.18337597875664702
-0.02488628335729145 and so on ........
我尝试使用:
sorted_arr = np.sort(linear_svm.coef_,axis = None)[::-1]
for i in sorted_arr:
print(vocab_dict[(i)])
但是会弹出一个错误:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-32-9027571acfa4> in <module>()
1 sorted_arr = np.sort(linear_svm.coef_,axis = None)[::-1]
2 for i in sorted_arr:
----> 3 print(vocab_dict[(i)])
KeyError: 0.5006137361746403
解决方案
推荐阅读
- r - 在R中创建一个带有特殊字符的字符串
- python - 程序被龙卷风卡住了
- python - 以更清晰的方式绘制线条和形状
- sql - 构建 SQL 查询以根据辅助表自动添加计算列
- twilio - Twilio - 来电转接电话的虚名
- python - 给定一个列表,打印列表中每个整数出现的次数
- r - 如何将变量未存储在同一行且缺少列到列的标准分隔符的文本文件读入R?
- r - 从列表中删除数据框没有观察
- python - Zelle 和 Graphics.py - 我怎样才能让图形窗口始终位于顶部?
- javascript - Three.js Raycaster 仅根据材料选择导入的 gltf 模型的部分