python - 在 Pythons Scikit-Learn lib 中使用分类数据进行异常值预测
问题描述
我试图用我自己的输出进行预测。我使用 Python Scikit-learn lib 和 Isolation Forest 作为算法。我不知道我做错了什么,但是当我想转换我的输入数据时,我总是会出错。我在这一行得到一个错误:
input_par = encoder.transform(val)#ERROR
这是错误:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
我试过这个,但我总是得到一个错误:
input_par = encoder.transform([val])#ERROR
这是错误:alueError: Specifying the columns using strings is only supported for pandas DataFrames
我做错了什么,我该如何解决这个错误?另外,我应该使用OneHotEncoder
,LabelEncoder
还是CountVectorizer
?
这是我的代码:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']
num_data = [4, 1, 3, 2, 65, 3,3]
df = pd.DataFrame({'my text': textual_data,
'num data': num_data})
x = df
# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])], remainder='passthrough')
#encoder = ColumnTransformer(transformers=[('lab', LabelEncoder(), ['my text'])])
x = encoder.fit_transform(x)
isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)
list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]
for val in list_of_val:
input_par = encoder.transform(val)#ERROR
outlier = model.predict(input_par)
#print(outlier)
if outlier[0] == -1:
print('Values', val, 'are outliers')
else:
print('Values', val, 'are not outliers')
编辑:
我也试过这个:
list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]
for val in list_of_val:
input_par = encoder.transform(pd.DataFrame({'my text': val[0],
'num data': val[1]}))
但我得到这个错误:
ValueError: If using all scalar values, you must pass an index
解决方案
我将尝试列出您可能会发现有用的观察结果:
- 例如, LabelEncoder可用于将非数字数据转换为数字标签。OneHotEncoder通常采用数字或非数字数据并将其转换为 one-hot 编码。两者通常用于预处理“标签”(监督学习问题的类别)。
- 据我了解,您正在尝试预测异常值(异常检测)。我不清楚话语和整数之间的连接是否只是硬编码的,或者您是否想以某种方式生成这种连接。如果这是您想要的,那么您无法使用前面提到的编码器来实现这一点,因为您正在将它们拟合到一些数据(通常应该是标签)并尝试转换新的不相关数据(ValueError:y 包含以前看不见的标签) . 但是,可以通过将 OneHotEncoder 的 handle_unknown 参数设置为“忽略”来解决此问题(来自文档:“如果在转换期间存在未知的分类特征,是否引发错误或忽略”)。即使您可以使用这些编码器之一实现您想要的,
我假设您对“负面”话语赋予了很高的价值(即使“错误”与您的训练数据中的 65 不对应),而对“正面”话语赋予了很小的价值。如果您假设您已经知道每个话语的每个整数,您可以在被认为是“正”示例的模型上训练模型,并仅在测试中给出“负”示例(异常值)。您不会在“正面”和“负面”示例上训练 IsolationForest - 这只是可以使用决策树建模的基本二元分类。可以在这里看到 IsolationForest 的直观示例。以下是您的问题的代码:
import numpy as np from sklearn.ensemble import IsolationForest textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', ...] integer_connection = [1, 1, 2, 3, 2, 2, 3, 1, 3, 4, 1, 2, 1, 2, 1, 2, 1, 1] integer_connection = np.array([[n] for n in integer_connection]) isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new') isolation_forest.fit(integer_encoded) list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]] text_vals = [d[0] for d in list_of_val] numeric_vals = np.array([[d[1]] for d in list_of_val]) print(integer_encoded, numeric_vals) outliers = isolation_forest.predict(numeric_vals) print(outliers)
一般来说,我认为您的方法对于自然语言话语的异常值预测是不正确的。对于您在这个特定示例中尝试执行的操作,我可以推荐使用来自spaCy的词向量相似性,或者可能是简单的词袋方法。
如果您不关心这些点中的任何一点,而只想要一个工作代码,那么这是您尝试做的我的版本:
import numpy as np from sklearn.ensemble import IsolationForest from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, LabelEncoder textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good'] encodings = {} num_data = [4, 1, 3, 2, 65, 3, 3] onehot_encoder = OneHotEncoder(handle_unknown='ignore') onehots = onehot_encoder.fit_transform(np.array([[utt, no] for utt, no in zip(textual_data, num_data)])) for i, l in enumerate(onehots): original_label = (textual_data[i], num_data[i]) encodings[original_label] = l print(encodings) isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new') model = isolation_forest.fit(onehots) list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]] test_encoded = onehot_encoder.transform(np.array(list_of_val)) print(test_encoded) outliers = isolation_forest.predict(test_encoded) print(outliers) for i, outlier in enumerate(outliers): if outlier == -1: print('Values', list_of_val[i], 'are outliers') else: print('Values', list_of_val[i], 'are not outliers')
推荐阅读
- python - 将多个 Twitter JSON 文件转换为一个 CSV
- firebase - Firestore 客户端侦听器消息传递保证?
- c - C中的强制宏观评估
- python - 有没有更好的方法来匹配这些布尔数据框列?
- python-3.x - 对 Snowflake 使用辅助类会导致连接问题
- spacy - 使用 spacy.matcher.matcher.Matcher.add() 方法的问题
- r - 如何将日期更改为 R 中的字符串?
- r - R:安装 ggplot2 的问题
- python - Read multiple csv data files and sort the data into a new csv file
- python - 编写一个函数,计算给定数字的以 2 为底的对数