首页 > 解决方案 > 训练和测试中的 Json 数据拆分

问题描述

我正在尝试使 CNN 适合 huffpost 新闻数据集https://www.kaggle.com/rmisra/news-category-dataset。我使用的数据集是 json 格式。我的数据格式是这样的

[{"category": "CRIME", "headline": "There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV", "authors": "Melissa Jeltsen", "link": "https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89", "short_description": "She left her husband. He killed their children. Just another day in America.", "date": "2018-05-26"} , {"category": "ENTERTAINMENT", "headline": "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song", "authors": "Andy McDonald", "link": "https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201", "short_description": "Of course it has a song.", "date": "2018-05-26"} ]

这是我正在尝试的代码代码源是https://www.kaggle.com/kredy10/simple-lstm-for-text-classification 我想在这些数据上拟合 LSTM

import pandas as pd
import json
with open('News_Category_Dataset_v2.json', 'r') as f:
    train = json.load(f)

现在我想泄露训练和测试数据,但我不知道如何使用数组来拆分数据..有人可以帮忙吗?

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.15)

标签: pythonjsonmachine-learninglstm

解决方案


我是这样做的:我首先使用 train_test_split 设置训练(70%)和测试(30%),然后在测试中使用相同的命令设置测试(50%)和验证(50%)。

from sklearn.model_selection import train_test_split
   
with open('file_name') as f:
    lines = f.readlines()
    
train, test = train_test_split(lines, test_size=0.3)
val, test = train_test_split(test, test_size=0.5)

希望这可以帮助!


推荐阅读