首页 > 解决方案 > 使用 Python 将 JSON 文件分割成不同的时间截距

问题描述

对于当前的研究项目,我正在尝试将 JSON 文件分割成不同的时间截距。基于对象“日期”,我想按季度分析 JSON 文件的内容,即 1 月 1 日 - 3 月 31 日、4 月 1 日 - 6 月 20 日等。

理想情况下,代码必须选择文件中最旧的日期,并在此之上添加季度时间截取。我已经对这一点进行了研究,但还没有找到任何有用的方法。

有什么聪明的方法可以将它包含在代码中吗?JSON 文件具有以下结构:

[
{"No":"121","Stock Symbol":"A","Date":"05/11/2017","Text Main":"Sample text"}
]

现有的相关代码摘录如下所示:

import pandas as pd

file = pd.read_json (r'Glassdoor_A.json')
data = json.load(file)

# Create an empty dictionary
d = dict()

# processing:
for row in data:
    line = row['Text Main']
    # Remove the leading spaces and newline character
    line = line.strip()

    # Convert the characters in line to
    # lowercase to avoid case mismatch
    line = line.lower()

    # Remove the punctuation marks from the line
    line = line.translate(line.maketrans("", "", string.punctuation))

    # Split the line into time intervals
    line.sort_values(by=['Date'])
    line.tshift(d, int = 90, freq=timedelta, axis='Date')

    # Split the line into words
    words = line.split(" ")

    # Iterate over each word in line
    for word in words:
        # Check if the word is already in dictionary
        if word in d:
            # Increment count of word by 1
            d[word] = d[word] + 1
        else:
            # Add the word to dictionary with count 1
            d[word] = 1

# Print the contents of dictionary
for key in list(d.keys()):
    print(key, ":", d[key])

    # Count the total number of words
    total = sum(d.values())
    print(d[key], total)

标签: pythonjsonnlp

解决方案


请在下面找到问题的解决方案。通过分配开始和结束日期并将 JSONDate对象与这些日期进行比较,可以使用 Pandas 对数据进行切片。

重要提示:必须对数据进行规范化,并且在处理信息之前必须将日期转换为 Pandas 日期时间格式。

import string
import json
import csv

import pandas as pd
import datetime

import numpy as np


# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])


# Create an empty dictionary
d = dict()


# Filtering by date
start_date = "01/01/2018"
end_date = "31/03/2018"

after_start_date = df["Date"] >= start_date
before_end_date = df["Date"] <= end_date

between_two_dates = after_start_date & before_end_date
filtered_dates = df.loc[between_two_dates]

print(filtered_dates)

推荐阅读