首页 > 解决方案 > 在 Pythons Scikit-Learn lib 中使用分类数据进行异常值预测

问题描述

我试图用我自己的输出进行预测。我使用 Python Scikit-learn lib 和 Isolation Forest 作为算法。我不知道我做错了什么,但是当我想转换我的输入数据时,我总是会出错。我在这一行得到一个错误:

    input_par = encoder.transform(val)#ERROR

这是错误: Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

我试过这个,但我总是得到一个错误:

    input_par = encoder.transform([val])#ERROR

这是错误:alueError: Specifying the columns using strings is only supported for pandas DataFrames

我做错了什么,我该如何解决这个错误?另外,我应该使用OneHotEncoder,LabelEncoder还是CountVectorizer?

这是我的代码:

import pandas as pd

from sklearn.ensemble import IsolationForest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']
num_data = [4, 1, 3, 2, 65, 3,3]

df = pd.DataFrame({'my text': textual_data,
                   'num data': num_data})
x = df

# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])], remainder='passthrough')
#encoder = ColumnTransformer(transformers=[('lab', LabelEncoder(), ['my text'])])

x = encoder.fit_transform(x)

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)

list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]

for val in list_of_val:

    input_par = encoder.transform(val)#ERROR

    outlier = model.predict(input_par)
    #print(outlier)

    if outlier[0] == -1:
        print('Values', val, 'are outliers')

    else:
        print('Values', val, 'are not outliers')

编辑:

我也试过这个:

list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]

for val in list_of_val:

    input_par = encoder.transform(pd.DataFrame({'my text': val[0],
                                               'num data': val[1]}))

但我得到这个错误:

ValueError: If using all scalar values, you must pass an index

标签: pythonscikit-learn

解决方案


我将尝试列出您可能会发现有用的观察结果:

  • 例如, LabelEncoder可用于将非数字数据转换为数字标签。OneHotEncoder通常采用数字或非数字数据并将其转换为 one-hot 编码。两者通常用于预处理“标签”(监督学习问题的类别)。
  • 据我了解,您正在尝试预测异常值(异常检测)。我不清楚话语和整数之间的连接是否只是硬编码的,或者您是否想以某种方式生成这种连接。如果这是您想要的,那么您无法使用前面提到的编码器来实现这一点,因为您正在将它们拟合到一些数据(通常应该是标签)并尝试转换新的不相关数据(ValueError:y 包含以前看不见的标签) . 但是,可以通过将 OneHotEncoder 的 handle_unknown 参数设置为“忽略”来解决此问题(来自文档:“如果在转换期间存在未知的分类特征,是否引发错误或忽略”)。即使您可以使用这些编码器之一实现您想要的,
  • 我假设您对“负面”话语赋予了很高的价值(即使“错误”与您的训练数据中的 65 不对应),而对“正面”话语赋予了很小的价值。如果您假设您已经知道每个话语的每个整数,您可以在被认为是“正”示例的模型上训练模型,并仅在测试中给出“负”示例(异常值)。您不会在“正面”和“负面”示例上训练 IsolationForest - 这只是可以使用决策树建模的基本二元分类。可以在这里看到 IsolationForest 的直观示例。以下是您的问题的代码:

    import numpy as np
    from sklearn.ensemble import IsolationForest
    
    textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', ...]
    integer_connection = [1, 1, 2, 3, 2, 2, 3, 1, 3, 4, 1, 2, 1, 2, 1, 2, 1, 1]
    integer_connection = np.array([[n] for n in integer_connection])
    
    isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
    isolation_forest.fit(integer_encoded)
    
    list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]]
    
    text_vals = [d[0] for d in list_of_val]
    numeric_vals = np.array([[d[1]] for d in list_of_val])
    
    print(integer_encoded, numeric_vals)
    
    outliers = isolation_forest.predict(numeric_vals)
    print(outliers)
    
  • 一般来说,我认为您的方法对于自然语言话语的异常值预测是不正确的。对于您在这个特定示例中尝试执行的操作,我可以推荐使用来自spaCy的词向量相似性,或者可能是简单的词袋方法。

  • 如果您不关心这些点中的任何一点,而只想要一个工作代码,那么这是您尝试做的我的版本:

    import numpy as np
    
    from sklearn.ensemble import IsolationForest
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    
    
    textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']
    
    
    encodings = {}
    
    num_data = [4, 1, 3, 2, 65, 3, 3]
    
    
    onehot_encoder = OneHotEncoder(handle_unknown='ignore')
    onehots = onehot_encoder.fit_transform(np.array([[utt, no] for utt, no in zip(textual_data, num_data)]))
    
    for i, l in enumerate(onehots):
        original_label = (textual_data[i], num_data[i])
        encodings[original_label] = l
    
    print(encodings)
    
    isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
    model = isolation_forest.fit(onehots)
    
    list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]]
    
    
    test_encoded = onehot_encoder.transform(np.array(list_of_val))
    print(test_encoded)
    
    outliers = isolation_forest.predict(test_encoded)
    print(outliers)
    
    for i, outlier in enumerate(outliers):
        if outlier == -1:
            print('Values', list_of_val[i], 'are outliers')
    
        else:
            print('Values', list_of_val[i], 'are not outliers')
    

推荐阅读