python - 使用 JSON 功能读取 CSV
问题描述
我正在尝试读取包含 JSON 功能的大型 CSV(位置在这里)。首先,比如 100 行,文件如下所示:
Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
我按照这个问题来解析位置列。该解决方案基本上将帮助程序定义为:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
接着
df=pd.read_csv('data.csv', nrows=100,converters={'location':CustomParser},header=0)
我收到以下与 JSON 格式相关的错误:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Q1:如何将特征位置解析到新列上?
Q2(对于一般情况):对于数据中的 nrows>100,最后一个特征(labelA 和 labelB)也具有具有不同键和值的 JSON 格式。如何通过解析包括 JSON(甚至部分)的每个功能来读取整个 CSV?
解决方案
修复文件:
- 不幸的是,该文件难以阅读,因为每一行都包含一个
dict
,其key-value
对以逗号分隔。 - 解决问题的最简单方法是将 each 之外的分隔符
dict
从更改,
为|
。 - 以下代码将读取现有文件
- 它假设,第一行是标题,使用
.replace(',', '|')
- 剩余的行将使用正则表达式
,
替换{}
- 每一行都将写入一个新文件。
- 它假设,第一行是标题,使用
代码:
数据:
Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
Path.cwd()
假设current working directory
,如果不是这种情况:Path('c:/some_path_to_my_file') / 'file_name.poo'
可以使用
- pathlib是标准库的一部分
- Python 3 的 pathlib 模块:驯服文件系统
文件修复:
import re
from pathlib import Path
p = Path.cwd() / 'test.csv'
p2 = Path.cwd() / 'test2.csv'
with p.open('r') as f:
with p2.open('w') as f2:
for cnt, line in enumerate(f):
if cnt == 0:
line = line.replace(',', '|')
else:
line = re.sub(r',(?=(((?!\}).)*\{)|[^\{\}]*$)', '|', line)
f2.write(line)
新文件:
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
解析新文件:
- 现在列将被正确分隔
.read_csv
- 但是,
location
和labelA
列labelB
是str
- 用于
ast.literal_eval
转换为dict
literal_eval
将无法正常工作nan
,因此请替换nan
为{}
- 用于
for col in df.columns[1:]:
循环遍历每一列,并且:try-except
将捕获任何未正确形成的列- 将它们
str
从dict
- 分隔
keys
成列 concats
现有数据框的列drops
旧专栏
import pandas as pd
from ast import literal_eval
df = pd.read_csv('test2.csv', sep='|')
print(df)
Time location labelA labelB
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
for col in df.columns[1:]:
try:
df[col].fillna('{}', inplace=True)
df[col] = df[col].apply(literal_eval)
df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
df.drop(columns=[col], inplace=True)
except (SyntaxError, ValueError) as e:
print(f'{col}: {e}')
print(df)
Time lng alt time error lat ack bar foo bar
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
文字评估说明:
- Pandas 具有以多种形式导入数据的方法,例如
dict
或list
. - 但是,
read_csv
不能很好地解释容器(例如dict
),它们被解释为字符串,除非您指定converters
参数(pd.read_csv('test3.csv', sep='|', converters={'a': literal_eval})
. literal_eval
不适用于由容器和strings
or组成的列NaN
,除非string
是唯一的数字(例如'8654')- 上面的部分代码,首先全部替换
nan
为 a{}
这样literal_eval
就不会出错了。 - 给定以下混合列示例:
column_a
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
literal_eval
会抛出ValueError: malformed node or string:
- 两种解决方案之间的区别在于另一种解决方案固定一列,而该解决方案的实现方式是固定所有列并消除仅读取前 100 行的必要性。
- 您可以放弃循环来修复所有列,只需修复
location
列,如果它是 alldicts
。使用以下代码:
df['location'] = df['location'].apply(literal_eval)
df = pd.concat([df, df['location'].apply(pd.Series)], axis=1)
注意实际数据:
- 该
location
列未正确形成'{"lng":12.9975201,alt:413.0,"time:""2019-09-10T12:09:58Z""",error:7.0,lat:47.8258582}'
- 这是预期的形式:
'{"lng":12.9975201,"alt":413.0,"time":"2019-09-10T12:09:58Z","error":7.0,"lat":47.8258582}'
修复location
列:
- 该
location
列Position
在真实数据中
def fix_pos(x):
word_dict = {'alt': '"alt"',
'"time:"': '"time":',
'"",error:': ',"error":',
'lat': '"lat"'}
for k, v in word_dict.items():
x = x.replace(k, v)
return x
df.Position = df.Position.apply(lambda x: fix_pos(x))
- 对真实数据文件使用以下循环。
Zeit
,device
,Text
&Type
不需要处理Position
是在index
4。
for col in df.columns[4:]:
try:
df[col].fillna('{}', inplace=True)
df[col] = df[col].apply(literal_eval)
df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
df.drop(columns=[col], inplace=True)
except (SyntaxError, ValueError) as e:
print(f'{col}: {e}')
- 适用
literal_eval
于所有列的循环已更新为try-except
- 如果有名称和错误信息将被打印出来
exception
。column
- 真实数据共有 64 列,其中大部分是Furchtbar。
- 如果有名称和错误信息将被打印出来
错误:
csv
这些是所提供文件中所有列的错误。
device: unexpected EOF while parsing (<unknown>, line 1)
Text: malformed node or string: <_ast.Name object at 0x00000203B8473C08>
Typ: malformed node or string: <_ast.Name object at 0x00000203BE217E08>
Data: unexpected EOF while parsing (<unknown>, line 1)
Data1: invalid syntax (<unknown>, line 1)
Data2: invalid syntax (<unknown>, line 1)
Unnamed: 8: invalid syntax (<unknown>, line 1)
Unnamed: 9: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 10: invalid syntax (<unknown>, line 1)
Unnamed: 11: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 12: invalid syntax (<unknown>, line 1)
Unnamed: 13: invalid syntax (<unknown>, line 1)
Unnamed: 14: invalid syntax (<unknown>, line 1)
Unnamed: 15: invalid syntax (<unknown>, line 1)
Unnamed: 16: invalid syntax (<unknown>, line 1)
Unnamed: 17: invalid syntax (<unknown>, line 1)
Unnamed: 18: invalid syntax (<unknown>, line 1)
Unnamed: 19: invalid syntax (<unknown>, line 1)
Unnamed: 20: invalid syntax (<unknown>, line 1)
Unnamed: 21: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 22: invalid syntax (<unknown>, line 1)
Unnamed: 23: invalid syntax (<unknown>, line 1)
Unnamed: 24: invalid syntax (<unknown>, line 1)
Unnamed: 25: invalid syntax (<unknown>, line 1)
Unnamed: 26: invalid syntax (<unknown>, line 1)
Unnamed: 27: invalid syntax (<unknown>, line 1)
推荐阅读
- docker - 在 docker-compose 中,我可以使用环境变量来构造另一个环境变量吗?
- linux-device-driver - 当我使用“cat”写入字符设备时,“cat: write error: No space left on device”
- php - 在 Laravel 中缓存图像的最有效方法
- javascript - 如何使用 Node.js 的 wpapi 创建新的 WordPress 帖子?
- sql - 将 2 个查询合并到 1 个表中,用户为两个查询输入两次参数
- algorithm - 计算道路网络中的替代路径
- javascript - puppeteer 中未处理的 Promise 拒绝警告
- javascript - RemoveChild javascript,如果没有更多的子节点不要抛出错误
- python - 确保 PySpark 数组中相邻元素之间的差异大于给定的最小值
- android - 无法通过android NDK(camera2 api)获取相机列表