首页 > 解决方案 > 使用meka java的多标签分类

问题描述

任何人都可以帮助使用 meka java 代码对多标签数据集进行分类的完整文档。我必须先训练 80% 的数据,然后测试 20% 的数据。如何用 meka 做到这一点?如果有人能提供帮助,那将是真的。这里是我的数据集的样子,第一个六个属性是类

     @attribute IS_PROTECTION_binarized {0,1}
     @attribute IS_PRICING_binarized {0,1}
     @attribute IS_ERROR_binarized {0,1}
     @attribute IS_USAGE_binarized {0,1}
     @attribute IS_COMPATIBILITY_binarized {0,1}
     @attribute IS_RESOURCES_binarized {0,1}
     @attribute text string

     @data
     0,0,1,0,1,0,'keeps crashing since i upgraded my android this game keeps crashing'
     0,0,0,0,0,0,'addictive i first became a fan of this game when i got an app that u had to earn coins to unlock diffrent colored lights how u got coins was to play games and it just happened tbat one of the mini games was this kind of game'
     0,1,0,0,0,0,'ad free port of the original open source game'

标签: javamultilabel-classification

解决方案


您可以为此使用 scikit- multilearn ,LabelPowerset 类可以解决问题,只需选择一个基本的多类分类器。不过,您可能需要对 text 属性做一些事情,因此使用管道可能很重要。

from skmultilearn.problem_transform import LabelPowerset
from sklearn.ensemble import RandomForestClassifier

# initialize LabelPowerset multi-label classifier with a RandomForest
classifier = LabelPowerset(
    classifier = RandomForestClassifier(n_estimators=100),
    require_dense = [False, True]
)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

管道可能如下所示

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', classifier),
])

推荐阅读