首页 > 解决方案 > Complement Naive Bayes and weighted class in sklearn

问题描述

I'm trying to implement a complement naive bayes classifier using sklearn. My data have very imbalanced classes (30k samples of class 0 and 6k samples of the 1 class) and I'm trying to compensate this using weighted class.

Here is the shape of my dataset:

enter image description here

I tried to use the compute compute_class_weight function to calcute the weights and then pass it to the fit function when training my model:

import numpy as np
import seaborn as sn
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.naive_bayes import ComplementNB

#Import the csv data
data = pd.read_csv('output_pt900.csv')

#Create the header of the csv file
header = []

for x in range(0,2500):
    header.append('pixel' + str(x))
header.append('status')

#Add the header to the csv data
data.columns = header

#Replace the b's and the f's in the status column by 0 and 1 
data['status'] = data['status'].replace('b',0)
data['status'] = data['status'].replace('f',1)

print(data)

#Drop the NaN values
data = data.dropna()

#Separate the features variables and the status
y = data['status']
x = data.drop('status',axis=1)

#Split the original dataset into two other: train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

all_together = y_train.to_numpy()
unique_classes = np.unique(all_together)

c_w = class_weight.compute_class_weight('balanced', unique_classes, all_together)

clf = ComplementNB()

clf.fit(x_train,y_train, c_w)

y_predict = clf.predict(x_test)

cm = confusion_matrix(y_test, y_predict)

svm = sn.heatmap(cm, cmap='Blues', annot=True, fmt='g')
figure=svm.get_figure()
figure.savefig('confusion_matrix_cnb.png', dpi=400)
plt.show()

but I got thesse error:

ValueError: sample_weight.shape == (2,), expected (29752,)!

Anyone knows how to use weighted class in sklearn models?

标签: pythonmachine-learningscikit-learnnaivebayes

解决方案


compute_class_weight returns an array of length equal to the number of unique classes with the weight to assign to instances of each class (link). So if there are 2 unique classes, c_w has length 2, containing the weight that should be assigned to samples with label 0 and 1.

When calling fit for your model, the weight for each sample is expected by the sample_weight argument. This should explain the error you received. To solve this issue, you need to use c_w returned by compute_class_weight to create an array of individual sample weights. You could do this with [c_w[i] for i in all_together]. Your fit call would ultimately look something like:

clf.fit(x_train, y_train, sample_weight=[c_w[i] for i in all_together])

推荐阅读