python - ValueError:X 每个样本有 57 个特征;期待 283 在使用 ML 的文本分类中
问题描述
我正在尝试使用此算法对新数据进行分类,但它正在发生,因为模型中训练的形状与我想要测试的形状不同,我该如何解决这个请求?
使用机器学习进行文本分类(歌曲:Beatles vs. Rolling Stones)
dataset_url = https://raw.githubusercontent.com/suzanasvm/MachineLearningProjects/master/conteudo.csv
import pandas as pd
from collections import Counter
import numpy as np
import sklearn
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings("ignore")
#Reading CSV
classificacoes = pd.read_csv('conteudo.csv', sep=',', encoding='latin-1')
classificacoes = sklearn.utils.shuffle(classificacoes)
#1=Rolling Stones, 2=Rolling Stones
print(classificacoes)
################################################# ################################################# ##########
#Placing texts in a variable
textos = classificacoes['texto']
print(textos)
#Turning words into strings
#each line refers to an array with each of the words separated
palavrasIsoladas = textos.str.lower().str.split()
print(palavrasIsoladas)
#Creating a single array with all unique words (dictionary)
dicionario = set()
for lista in palavrasIsoladas:
dicionario.update(lista)
print(dicionario)
#prints the total of different words
totalDePalavras = len(dicionario)
print(totalDePalavras)
#Associates each word to a position and stores it in a mapped dictionary
palavraEposicao = dict(zip(dicionario, range(totalDePalavras)))
print(palavraEposicao)
#Function that counts the presence of each single word present in the mapped dictionary, in the entire text
def vetorizarPresencaPalavras(texto, palavraEposicao):
vetor = [0] * len(palavraEposicao)
for palavra in texto:
if palavra in palavraEposicao:
posicao = palavraEposicao[palavra]
vetor[posicao] += 1
return vetor
vetoresDeTexto = [vetorizarPresencaPalavras(texto, palavraEposicao) for texto in palavrasIsoladas]
print(vetoresDeTexto)
#Associates text with Category
Categoria = classificacoes['classificacao']
#Stores text and Categories
x = np.array(vetoresDeTexto)
y = np.array(Categoria)
################################################# ################################################# ##########
#Defines training percentage
porcentagem_de_treino = 0.7
#Sets training data size from training percentage
tamanho_de_treino = int(porcentagem_de_treino * len(y))
tamanho_de_validacao = len(y) - tamanho_de_treino
#Get the training data
treino_dados = x[0:tamanho_de_treino]
treino_marcacoes = y[0:tamanho_de_treino]
#Get the validation data
validacao_dados = x[tamanho_de_treino:]
validacao_marcacoes = y[tamanho_de_treino:]
#Function that trains the data
def predict(nome, modelo, treino_dados, treino_marcacoes):
k = 10
scores = cross_val_score(modelo, treino_dados, treino_marcacoes, cv = k)
taxa_de_acerto = np.mean(scores)
msg = "Accuracy {0}: {1}".format(nome, taxa_de_acerto)
print(msg)
return taxa_de_acerto
print("\nText Categorization\n")
#Using the OneVsRest classifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
resultados = {}
modeloOneVsRest = OneVsRestClassifier(LinearSVC(random_state = 0))
resultadoOneVsRest = predict("OneVsRest", modeloOneVsRest, treino_dados, treino_marcacoes)
modeloOneVsRest.fit(treino_dados, treino_marcacoes)
resultados[resultadoOneVsRest] = modeloOneVsRest
################################################# ################################################# ##########
#Testing Model with New Data
classificacoes2 = pd.read_csv('previsao.csv', sep=',', encoding='latin-1')
textos2 = classificacoes2['texto']
textos2
palavrasIsoladas2 = textos2.str.lower().str.split()
dicionario2 = set()
for lista in palavrasIsoladas2:
dicionario2.update(lista)
totalDePalavras2 = len(dicionario2)
palavraEposicao2 = dict(zip(dicionario2, range(totalDePalavras2)))
vetoresDeTexto2 = [vetorizarPresencaPalavras(textos2, palavraEposicao2) for textos2 in palavrasIsoladas2]
x2 = np.array(vetoresDeTexto2)
tamanho_de_treino2 = int(len(x2))
treino_dados2 = x2[0:tamanho_de_treino2]
modeloOneVsRest.fit(treino_dados, treino_marcacoes)
modeloOneVsRest.predict(treino_dados2)
解决方案
推荐阅读
- python - colab博主API授权
- cassandra - 使用从先前选择中选择的值更新字段
- gdb - 这个 GDB 输出是什么意思?
- sql - 在 SQL Developer 中导入 csv 数据时如何解决错误 gdk-05030
- python - 在没有互联网连接的情况下安装 python 包
- ios - 使用 vue-youtube 包,播放器在切换方向后未在 iOS 中初始化
- swift - 5.2.2 版:MIDIFileEditAndSync (Monterey),由于 MIDINode 失败(据说是从 AKMIDINode 重命名的),无处可寻
- android - 如何在另一个弹出/对话框出现后显示弹出/对话框
- sparql - 我们如何在 AWS Neptune 上的 RDF 图中对对象进行分组并为组分配通用颜色?
- python - Python CSV读取然后将变量写入模板