首页 > 解决方案 > 有没有使用 weka JSONLoader 的例子?

问题描述

我正在使用 GUI 程序 weka 知识浏览器。设置我的管道来训练分类模型。这是我的一小部分数据的样子。唯一的属性是来自文本的值。由于它是监督学习,因此我为那里的每条推文/文档都有一个标签/类别。

[
  {
    "id": 8.7361726140328e+17,
    "text": "The Joki's on you! Unless you take advantage of 25% off Scarlet Court Chests - on sale now! https:\/\/t.co\/vc1ttPxJWm",
    "category": [
      "dont_care"
    ]
  },
  {
    "id": 8.7329941695388e+17,
    "text": "Don't be a drag - dress like a queen! Scarlet Court Chest Rolls are 25% off! https:\/\/t.co\/O0Ig5bEZdD",
    "category": [
      "dont_care"
    ]
  },
  {
    "id": 8.7328034547452e+17,
    "text": "Join @Inukii and @MezmoreyezTV for Top 5 Console Plays! https:\/\/t.co\/3JmreXSTWp",
    "category": [
      "dont_care"
    ]
  }
]

我在日志中遇到的异常

11:16:12: [Low] FlowRunner$1697181913|FlowRunner: Launching start point: JSONLoader
11:16:12: [Basic] JSONLoader$17081058|Loading /home/j/_Github-Projects/GameMediaBot/SmiteGame_classified_data.json
11:16:12: [ERROR] JSONLoader$17081058|java.lang.Exception: Can't recover from previous error(s)
weka.core.WekaException: java.lang.Exception: Can't recover from previous error(s)
    at weka.knowledgeflow.steps.Loader.start(Loader.java:178)
    at weka.knowledgeflow.StepManagerImpl.startStep(StepManagerImpl.java:1035)
    at weka.knowledgeflow.BaseExecutionEnvironment$3.run(BaseExecutionEnvironment.java:440)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.Exception: Can't recover from previous error(s)
    at weka.core.converters.JSONLoader.getStructure(JSONLoader.java:242)
    at weka.core.converters.JSONLoader.getDataSet(JSONLoader.java:267)
    at weka.knowledgeflow.steps.Loader.start(Loader.java:172)
    ... 7 more
Caused by: java.lang.Exception: Can't recover from previous error(s)
    at java_cup.runtime.lr_parser.report_fatal_error(lr_parser.java:392)
    at java_cup.runtime.lr_parser.unrecovered_syntax_error(lr_parser.java:539)
    at java_cup.runtime.lr_parser.parse(lr_parser.java:731)
    at weka.core.json.JSONNode.read(JSONNode.java:634)
    at weka.core.converters.JSONLoader.getStructure(JSONLoader.java:234)
    ... 9 more

11:16:12: [Low] JSONLoader$17081058|Interrupted

我的管道:

我的管道

标签: javaweka

解决方案


无论如何,我只是写了一个脚本来将我的数据(json)转换为 arff。不确定确定文本数据属性的约定是什么。我只是使用了我关心的推文类别中最常见的 40 个词。我在末尾添加了一个名为 class 的属性,它就像一个枚举,这似乎是训练这样的模型的惯例。

请参阅 github 上的代码或 SO https://github.com/jtara1/GameMediaBot/blob/master/transform_to_arff.py上的相同代码

import re
import json
from os.path import join, dirname, abspath, basename
from collections import Counter, OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import arff
import click


@click.command()
@click.argument('file')
@click.option('--dont-care-category',
              type=click.STRING,
              default='dont_care')
@click.option('-a',
              type=click.INT,
              default=40,
              help='Number of attributes. Attrs are the most frequent words '
                   'in the text of the target category')
def transform(file, dont_care_category, a):
    """input example
    [ {id: 123, text: "this is text body", category: ["dont_care"]} ]
    output example
    @relation game_media_bot

    @attribute

    :return:
    """
    classes = set()
    data = json.load(open(file, 'r'))

    master_vector = Counter()

    for tweet in data:
        classes.add(tweet['category'][0])
        if tweet['category'][0] != dont_care_category:
            master_vector += get_word_vector(tweet)

    print(master_vector)

    # most common words in the text of the target category
    attrs = [(word, 'INTEGER') for word, _ in master_vector.most_common(a)]
    attrs.append(('class', [value for value in classes]))

    arff_data = {
        'attributes': attrs,
        'data': [],
        'description': '',
        'relation': '{}'.format(dont_care_category)
    }

    for tweet in data:
        word_vector = get_word_vector(tweet)
        tweet_data = [word_vector[attr[0]] for attr in attrs[:-1]]
        tweet_data.append(tweet['category'][0])
        arff_data['data'].append(tweet_data)

    out_file = file.replace('.json', '.arff')
    data = arff.dumps(arff_data)
    with open(out_file, 'w') as f:
        f.write(data)


def get_word_vector(tweet):
    stop_words = stopwords.words('english')
    stop_words += ['!', ':', ',', '-', 'https', '/', '\u2026', "'s", "n't",
                   '#', '.', ';', ')', '(', "'re", '&', '?', '%', '@', "'",
                   '...']

    uri = re.compile(r'(https)?:?//t\.co/.*')

    # remove whitespace characters and put each word in a list
    words = word_tokenize(tweet['text'])

    # make each word lowercase
    words = [word.lower() for word in words]

    words = list(
        filter(
            lambda word: word not in stop_words and not uri.match(word),
            words
        )
    )

    return Counter(words)


if __name__ == '__main__':
    transform()

推荐阅读