首页 > 解决方案 > Passing pandas NumPy arrays as feature vectors in scikit learn?

问题描述

I have a vector of 5 different values that I use as my sample value, and the label is a single integer of 0, 1, or 3. The machine learning algorithms work when I pass an array as a sample, but I get this warning. How do I pass feature vectors without getting this warning?

import numpy as np
from numpy import random

from sklearn import neighbors
from sklearn.model_selection import train_test_split
import pandas as pd

filepath = 'test.csv'

# example label values
index = [0,1,3,1,1,1,0,0]

# example sample arrays
data = []
for i in range(len(index)):
    d = []
    for i in range(6):
        d.append(random.randint(50,200))
    data.append(d)

feat1 = 'brightness'
feat2, feat3, feat4 = ['h', 's', 'v']
feat5 = 'median hue'
feat6 = 'median value'

features = [feat1, feat2, feat3, feat4, feat5, feat6]

df = pd.DataFrame(data, columns=features, index=index)
df.index.name = 'state'

with open(filepath, 'a') as f:
    df.to_csv(f, header=f.tell() == 0)

states = pd.read_csv(filepath, usecols=['state'])

df_partial = pd.read_csv(filepath, usecols=features)

states = states.astype(np.float32)
states = states.values
labels = states

samples = np.array([])
for i, row in df_partial.iterrows():
    r = row.values
    samples = np.vstack((samples, r)) if samples.size else r

n_neighbors = 5

test_size = .3
labels, test_labels, samples, test_samples = train_test_split(labels, samples, test_size=test_size)
clf1 = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf1 = clf1.fit(samples, labels)

score1 = clf1.score(test_samples, test_labels)

print("Here's how the models performed \nknn: %d %%" %(score1 * 100))

Warning:

"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). clf1 = clf1.fit(samples, labels)"

sklearn documentation for fit(self, X, Y)

标签: pandasscikit-learnpython-3.5

解决方案


尝试更换

states = states.values经过states = states.values.flatten()

或者

clf1 = clf1.fit(samples, labels)clf1 = clf1.fit(samples, labels.flatten()).

states = states.values保存存储在您的熊猫数据框中的正确标签,但是它们存储在不同的行中。使用.flatten()将所有这些标签放在同一行。(https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.ndarray.flatten.html

在 Sklearn 的KNeighborsClassifier文档(https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)中,他们在示例中显示标签必须存储在同一行:y = [0, 0, 1, 1]


推荐阅读