dataframe - 无法使用朴素贝叶斯和多个特征预处理二进制文本分类中的数据
问题描述
我有一个二元分类问题,我想将我的数据分为两组:汽车公司和非汽车公司。我爬取了网站并提取了以下特征(简化):
- domain:我爬取的网站
- asn:服务器的自治系统编号
- 机器人:如果网站激活了 robots.txt
- 电子邮件:网站所有者的 amil 地址
- diff_days_stand:网站上线的天数
- html_title:网站解析后的html标题
我尝试了一个基线模型,其中 X 是“html_title”,y 是“carcompany”,并达到了 0.95 的准确度,非常好。我选择了 Complementary NB 而不是 Multinomial,因为我知道用于分类的最终数据将是不平衡的。我想在预测中添加更多特征(列),即使我知道条件独立的假设可能会被违反。
但是我无法管理预处理(包括数据框)。再次阅读 NB 后,我现在有疑问,所以我的问题是:
- 朴素贝叶斯可以与多个特征(列)一起使用吗?
- NaiveBayes 可以用于具有多类特征(字符串、整数、布尔值)的文本分类吗?如果我将它们全部转换为字符串怎么办?
- 我的代码错了吗?在哪里?
提前致谢 :)
导入包
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import ComplementNB
创建数据
dummy = {"domain":["a.de","b.de","c.de","d.de","e.de","f.de","g.de","h.de","i.de","j.de","k.de","l.de","m.de","n.de","o.de","p.de","q.de","r.de","s.de","t.de","u.de","v.de","w.de","x.de","y.de","z.de","aa.de","bb.de","cc.de"],
"asn":["123","789","491","238","148","369","123","458","231","549","894","153","654","658","987","369","258","147","852","963","741","652","365","547","785","985","589","632","456"],
"robots":["True","Test","False","True","False","False","False","False","True","False","False","True","False","True","True","Test","False","True","True","True","False","True","True","False","False","True","False","False","False"],
"email":["@a.de","@b.de","@c.de","@d.de","@e.de","@f.de","@g.de","@h.de","@i.de","@j.de","@k.de","@l.de","@m.de","@n.de","@o.de","@p.de","@q.de","@r.de","@s.de","@t.de","@u.de","@v.de","@w.de","@x.de","@y.de","@z.de","@aa.de","@bb.de","@cc.de"],
"diff_days_stand":["0.9","0.8","0.7","0.6","0.5","0.4","0.3","0.2","0.1","0.9","0.8","0.7","0.6","0.5","0.9","0.8","0.7","0.6","0.5","0.4","0.3","0.2","0.1","0.9","0.8","0.7","0.6","0.5","0.1"],
"html_title":["audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes"]}
dummy = pd.DataFrame(dummy)
stopwords = ['a','ab','aber','ach','acht']
将数据转换为字符串(如果将 int 和布尔值转换为字符串是正确的,则不确定)
list1 = ['domain', 'asn', 'robots', 'email', 'diff_days_stand', 'html_title']
for i in list1:
dummy[i] = dummy[i].astype(str)
准备培训数据
train_t = dummy.loc[0:9,("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
train_f = dummy.loc[10:19,("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
rest = dummy.loc[20:30, ("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
train_t["carcompany"] = 1
train_f["carcompany"] = 0
train_tot = train_f.append(train_t)
train_tot = train_tot.drop(labels="index", axis=1)
y = train_tot["carcompany"]
X_train, X_test, y_train, y_test = train_test_split(train_tot, y , test_size=0.25, random_state=53)
这就是问题所在
cv = CountVectorizer(stop_words=stopwords)
X_train_transformed = cv.fit_transform(X_train)
X_test_transformed = cv.transform(X_test)
X_train 是一个 4x4 的稀疏矩阵。它应该更大,具有附加功能
cb = ComplementNB(alpha=1.0, fit_prior=True, class_prior=None, norm=False)
cb.fit(X_train_transformed, y_train, sample_weight=None)
pred = cb.predict(X_test_transformed)
score = cb.score(X_test_transformed, y_test)
根据我的尝试,我还收到了以下消息:
ValueError:发现样本数量不一致的输入变量:[7, 15]
NotFittedError:CountVectorizer - 未安装词汇。
AttributeError:“numpy.ndarray”对象没有属性“lower”
解决方案
你需要矢量化你的文本数据列(我认为是 html_title)而不是整个 X_train
cv = CountVectorizer(stop_words=stopwords)
X_train_transformed = cv.fit_transform(X_train['html_title'])
textual_feature = pd.DataFrame(X_train_transformed.todense(), columns =cv.get_feature_names())
现在向此数据框添加您认为可以提高模型预测能力的其他功能
推荐阅读
- php - 表单提交后PHP保持输入类型=日期值
- git - Git Flow将发布分支合并回开发
- html - 放大和缩小时元素无法正确缩放
- android - 永远不会调用 TELEPHONY_SERVICE 的 onDisplayInfoChanged
- express - 使用 OIDC 授权码通过 insomnia 调用 REST API
- c++ - 如何解析由 CRT 函数形成的日期字符串?
- pandas-groupby - 如何将我的列与大型数据集中的多级列分组我无法选择要分组的列
- android - 如何使某个组件具有与其线性布局之外的另一个组件的相对布局
- batch-file - 从批处理脚本在 MobaXterm 中启动多个会话
- python - 修改 SkLearn RandomForestClassifier 以使用不同的引导方法