首页 > 解决方案 > 使用 JSON 功能读取 CSV

问题描述

我正在尝试读取包含 JSON 功能的大型 CSV(位置在这里)。首先,比如 100 行,文件如下所示:

Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan

我按照这个问题来解析位置列。该解决方案基本上将帮助程序定义为:

def CustomParser(data):
    import json
    j1 = json.loads(data)
    return j1

接着

df=pd.read_csv('data.csv', nrows=100,converters={'location':CustomParser},header=0)

我收到以下与 JSON 格式相关的错误:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Q1:如何将特征位置解析到新列上?

Q2(对于一般情况):对于数据中的 nrows>100,最后一个特征(labelA 和 labelB)也具有具有不同键和值的 JSON 格式。如何通过解析包括 JSON(甚至部分)的每个功能来读取整个 CSV?

标签: pythonjsonpandas

解决方案


修复文件:

  • 不幸的是,该文件难以阅读,因为每一行都包含一个dict,其key-value对以逗号分隔。
  • 解决问题的最简单方法是将 each 之外的分隔符dict从更改,|
  • 以下代码将读取现有文件
    • 它假设,第一行是标题,使用.replace(',', '|')
    • 剩余的行将使用正则表达式,替换{}
    • 每一行都将写入一个新文件。

代码:

数据:

Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan

文件修复:

import re
from pathlib import Path

p = Path.cwd() / 'test.csv'
p2 = Path.cwd() / 'test2.csv'

with p.open('r') as f:
    with p2.open('w') as f2:
        for cnt, line in enumerate(f):
            if cnt == 0:
                line = line.replace(',', '|')
            else:
                line = re.sub(r',(?=(((?!\}).)*\{)|[^\{\}]*$)', '|', line)
            f2.write(line)

新文件:

Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan

解析新文件:

  • 现在列将被正确分隔.read_csv
  • 但是,locationlabelAlabelBstr
    • 用于ast.literal_eval转换为dict
    • literal_eval将无法正常工作nan,因此请替换nan{}
  • for col in df.columns[1:]:循环遍历每一列,并且:
    • try-except将捕获任何未正确形成的列
    • 将它们strdict
    • 分隔keys成列
    • concats现有数据框的列
    • drops旧专栏
import pandas as pd
from ast import literal_eval

df = pd.read_csv('test2.csv', sep='|')
print(df)

       Time                                                             location                 labelA                 labelB
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN


for col in df.columns[1:]:
    try:
        df[col].fillna('{}', inplace=True)
        df[col] = df[col].apply(literal_eval)
        df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
        df.drop(columns=[col], inplace=True)
    except (SyntaxError, ValueError) as e:
        print(f'{col}: {e}')


print(df)

       Time   lng    alt        time  error   lat    ack    bar    foo    bar
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN

文字评估说明:

  • Pandas 具有以多种形式导入数据的方法,例如dictlist.
  • 但是,read_csv不能很好地解释容器(例如dict),它们被解释为字符串,除非您指定converters参数(pd.read_csv('test3.csv', sep='|', converters={'a': literal_eval}).
  • literal_eval不适用于由容器和stringsor组成的列NaN,除非string是唯一的数字(例如'8654')
  • 上面的部分代码,首先全部替换nan为 a{}这样literal_eval就不会出错了。
  • 给定以下混合列​​示例:
column_a
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
  • literal_eval会抛出ValueError: malformed node or string:
    • 两种解决方案之间的区别在于另一种解决方案固定一列,而该解决方案的实现方式是固定所有列并消除仅读取前 100 行的必要性。
    • 您可以放弃循环来修复所有列,只需修复location列,如果它是 all dicts。使用以下代码:
df['location'] = df['location'].apply(literal_eval)
df = pd.concat([df, df['location'].apply(pd.Series)], axis=1)

注意实际数据:

  • location列未正确形成
    • '{"lng":12.9975201,alt:413.0,"time:""2019-09-10T12:09:58Z""",error:7.0,lat:47.8258582}'
  • 这是预期的形式:
    • '{"lng":12.9975201,"alt":413.0,"time":"2019-09-10T12:09:58Z","error":7.0,"lat":47.8258582}'

修复location列:

  • locationPosition在真实数据中
def fix_pos(x):
    word_dict = {'alt': '"alt"',
                 '"time:"': '"time":',
                 '"",error:': ',"error":',
                 'lat': '"lat"'}
    for k, v in word_dict.items():
        x = x.replace(k, v)
    return x

df.Position = df.Position.apply(lambda x: fix_pos(x))
  • 对真实数据文件使用以下循环。
  • Zeit, device, Text&Type不需要处理
  • Position是在index4。
for col in df.columns[4:]:
    try:
        df[col].fillna('{}', inplace=True)
        df[col] = df[col].apply(literal_eval)
        df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
        df.drop(columns=[col], inplace=True)
    except (SyntaxError, ValueError) as e:
        print(f'{col}: {e}')
  • 适用literal_eval于所有列的循环已更新为try-except
    • 如果有名称和错误信息将被打印出来exceptioncolumn
    • 真实数据共有 64 列,其中大部分是Furchtbar

错误:

  • csv这些是所提供文件中所有列的错误。
device: unexpected EOF while parsing (<unknown>, line 1)
Text: malformed node or string: <_ast.Name object at 0x00000203B8473C08>
Typ: malformed node or string: <_ast.Name object at 0x00000203BE217E08>
Data: unexpected EOF while parsing (<unknown>, line 1)
Data1: invalid syntax (<unknown>, line 1)
Data2: invalid syntax (<unknown>, line 1)
Unnamed: 8: invalid syntax (<unknown>, line 1)
Unnamed: 9: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 10: invalid syntax (<unknown>, line 1)
Unnamed: 11: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 12: invalid syntax (<unknown>, line 1)
Unnamed: 13: invalid syntax (<unknown>, line 1)
Unnamed: 14: invalid syntax (<unknown>, line 1)
Unnamed: 15: invalid syntax (<unknown>, line 1)
Unnamed: 16: invalid syntax (<unknown>, line 1)
Unnamed: 17: invalid syntax (<unknown>, line 1)
Unnamed: 18: invalid syntax (<unknown>, line 1)
Unnamed: 19: invalid syntax (<unknown>, line 1)
Unnamed: 20: invalid syntax (<unknown>, line 1)
Unnamed: 21: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 22: invalid syntax (<unknown>, line 1)
Unnamed: 23: invalid syntax (<unknown>, line 1)
Unnamed: 24: invalid syntax (<unknown>, line 1)
Unnamed: 25: invalid syntax (<unknown>, line 1)
Unnamed: 26: invalid syntax (<unknown>, line 1)
Unnamed: 27: invalid syntax (<unknown>, line 1)

推荐阅读