首页 > 解决方案 > 如何为没有键的字典提取列

问题描述

所以我尝试了如何在数据框中转换字典的资源,但问题是这是一个奇怪的字典。

它不像key: {} , key: {} and etc..

数据有很多项目。但目标是仅提取 dict {} 中的内容,如果可能的话,日期也是一个加号。

数据:

id,client,source,status,request,response,queued,created_at,updated_at
54252,sdf,https://asdasdadadad,,"{
    "year": "2010",
    "casa": "aca",
    "status": "p",
    "Group": "57981",
}",,1,"2020-05-02 11:06:17","2020-05-02 11:06:17"
54252,msc-lp,https://discover,,"{
    "year": "27",
    "casa": "Na",
    "status": "p",
    "Group": "57981",
}"

我的尝试:

#attempt 1
with open('data.csv') as fd:
    pairs = (line.split(None) for line in fd)
    res   = {int(pair[0]):pair[1] for pair in pairs if len(pair) == 2 and pair[0].isdigit()}



#attempt 2
import json

# reading the JSON data using json.load()
file = 'data.json'
with open(file) as train_file:
    dict_train = json.load(train_file)

# converting json dataset from dictionary to dataframe
train = pd.DataFrame.from_dict(dict_train, orient='index')
train.reset_index(level=0, inplace=True)


#attempt 3
df = pd.read_csv("data.csv")
df = df.melt(id_vars=["index", "Date"], var_name="variables",value_name="values")

由于数据形状怪异,没有任何效果

预期输出:
字典中的所有项目,每个键将是 df 的一列

Date                     year  casa  status    Group
2020-05-02 11:06:17     2010   aca    p       57981
2020-05-02 11:06:17      27     Na    p       57981

标签: pythonjsonpandasdictionary

解决方案


将数据格式化为有效的 csv 结构:

id,client,source,status,request,response,queued,created_at,updated_at
54252,sdf,https://asdasdadadad,,'{ "ag": "2010", "ca": "aca", "ve": "p", "Group": "57981" }',,1,"2020-05-02 11:06:17","2020-05-02 11:06:17"
54252,msc-lp,https://discover,,'{ "ag": "27", "ca": "Na", "ve": "p", "Group": "57981" }',,1,"2020-05-02 11:06:17","2020-05-02 11:06:17"

这也适用于最坏的情况,请检查一下。

import json
import pandas as pd

def parse_column(data):
    try:
        return json.loads(data)
    except Exception as e:
        print(e)
        return None



df =pd.read_csv('tmp.csv',converters={"request":parse_column}, quotechar="'")

推荐阅读