首页 > 解决方案 > 无法使用朴素贝叶斯和多个特征预处理二进制文本分类中的数据

问题描述

我有一个二元分类问题,我想将我的数据分为两组:汽车公司和非汽车公司。我爬取了网站并提取了以下特征(简化):

  1. domain:我爬取的网站
  2. asn:服务器的自治系统编号
  3. 机器人:如果网站激活了 robots.txt
  4. 电子邮件:网站所有者的 amil 地址
  5. diff_days_stand:网站上线的天数
  6. html_title:网站解析后的html标题

我尝试了一个基线模型,其中 X 是“html_title”,y 是“carcompany”,并达到了 0.95 的准确度,非常好。我选择了 Complementary NB 而不是 Multinomial,因为我知道用于分类的最终数据将是不平衡的。我想在预测中添加更多特征(列),即使我知道条件独立的假设可能会被违反。

但是我无法管理预处理(包括数据框)。再次阅读 NB 后,我现在有疑问,所以我的问题是:

  1. 朴素贝叶斯可以与多个特征(列)一起使用吗?
  2. NaiveBayes 可以用于具有多类特征(字符串、整数、布尔值)的文本分类吗?如果我将它们全部转换为字符串怎么办?
  3. 我的代码错了吗?在哪里?

提前致谢 :)

导入包

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import ComplementNB

创建数据

dummy = {"domain":["a.de","b.de","c.de","d.de","e.de","f.de","g.de","h.de","i.de","j.de","k.de","l.de","m.de","n.de","o.de","p.de","q.de","r.de","s.de","t.de","u.de","v.de","w.de","x.de","y.de","z.de","aa.de","bb.de","cc.de"],
"asn":["123","789","491","238","148","369","123","458","231","549","894","153","654","658","987","369","258","147","852","963","741","652","365","547","785","985","589","632","456"],
"robots":["True","Test","False","True","False","False","False","False","True","False","False","True","False","True","True","Test","False","True","True","True","False","True","True","False","False","True","False","False","False"],
"email":["@a.de","@b.de","@c.de","@d.de","@e.de","@f.de","@g.de","@h.de","@i.de","@j.de","@k.de","@l.de","@m.de","@n.de","@o.de","@p.de","@q.de","@r.de","@s.de","@t.de","@u.de","@v.de","@w.de","@x.de","@y.de","@z.de","@aa.de","@bb.de","@cc.de"],
"diff_days_stand":["0.9","0.8","0.7","0.6","0.5","0.4","0.3","0.2","0.1","0.9","0.8","0.7","0.6","0.5","0.9","0.8","0.7","0.6","0.5","0.4","0.3","0.2","0.1","0.9","0.8","0.7","0.6","0.5","0.1"],
"html_title":["audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes"]}
dummy = pd.DataFrame(dummy)
stopwords = ['a','ab','aber','ach','acht']

将数据转换为字符串(如果将 int 和布尔值转换为字符串是正确的,则不确定)

list1 = ['domain', 'asn', 'robots', 'email', 'diff_days_stand', 'html_title'] 
for i in list1:
    dummy[i] = dummy[i].astype(str)

准备培训数据

train_t = dummy.loc[0:9,("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
train_f = dummy.loc[10:19,("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
rest    = dummy.loc[20:30, ("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()

train_t["carcompany"] = 1
train_f["carcompany"] = 0
train_tot = train_f.append(train_t)
train_tot = train_tot.drop(labels="index", axis=1)

y = train_tot["carcompany"]
X_train, X_test, y_train, y_test = train_test_split(train_tot, y , test_size=0.25, random_state=53)

这就是问题所在

cv = CountVectorizer(stop_words=stopwords)
X_train_transformed =  cv.fit_transform(X_train)
X_test_transformed = cv.transform(X_test)

X_train 是一个 4x4 的稀疏矩阵。它应该更大,具有附加功能

cb = ComplementNB(alpha=1.0, fit_prior=True, class_prior=None, norm=False)
cb.fit(X_train_transformed, y_train, sample_weight=None)

pred = cb.predict(X_test_transformed)
score = cb.score(X_test_transformed, y_test)

根据我的尝试,我还收到了以下消息:

ValueError:发现样本数量不一致的输入变量:[7, 15]

NotFittedError:CountVectorizer - 未安装词汇。

AttributeError:“numpy.ndarray”对象没有属性“lower”

标签: dataframescikit-learnnaivebayescountvectorizer

解决方案


你需要矢量化你的文本数据列(我认为是 html_title)而不是整个 X_train

cv = CountVectorizer(stop_words=stopwords)
X_train_transformed =  cv.fit_transform(X_train['html_title'])

textual_feature =  pd.DataFrame(X_train_transformed.todense(), columns =cv.get_feature_names())

现在向此数据框添加您认为可以提高模型预测能力的其他功能


推荐阅读