首页 > 解决方案 > 读取 JSON 文件并转换为 Panda DataFrame

问题描述

我有一个包含这样的 JSON 格式数据的文本文件。

{"accountNumber": "737265056", "customerId": "737265056", "creditLimit": 5000.0, "availableMoney": 5000.0, "transactionDateTime": "2016-08-13T14:27:32", "transactionAmount": 98.55, "merchantName": "Uber", "acqCountry": "US", "merchantCountryCode": "US", "posEntryMode": "02", "posConditionCode": "01", "merchantCategoryCode": "rideshare", "currentExpDate": "06/2023", "accountOpenDate": "2015-03-14", "dateOfLastAddressChange": "2015-03-14", "cardCVV": "414", "enteredCVV": "414", "cardLast4Digits": "1803", "transactionType": "PURCHASE", "echoBuffer": "", "currentBalance": 0.0, "merchantCity": "", "merchantState": "", "merchantZip": "", "cardPresent": false, "posOnPremises": "", "recurringAuthInd": "", "expirationDateKeyInMatch": false, "isFraud": false}
{"accountNumber": "737265056", "customerId": "737265056", "creditLimit": 5000.0, "availableMoney": 5000.0, "transactionDateTime": "2016-10-11T05:05:54", "transactionAmount": 74.51, "merchantName": "AMC #191138", "acqCountry": "US", "merchantCountryCode": "US", "posEntryMode": "09", "posConditionCode": "01", "merchantCategoryCode": "entertainment", "cardPresent": true, "currentExpDate": "02/2024", "accountOpenDate": "2015-03-14", "dateOfLastAddressChange": "2015-03-14", "cardCVV": "486", "enteredCVV": "486", "cardLast4Digits": "767", "transactionType": "PURCHASE", "echoBuffer": "", "currentBalance": 0.0, "merchantCity": "", "merchantState": "", "merchantZip": "", "posOnPremises": "", "recurringAuthInd": "", "expirationDateKeyInMatch": false, "isFraud": false}
{"accountNumber": "737265056", "customerId": "737265056", "creditLimit": 5000.0, "availableMoney": 5000.0, "transactionDateTime": "2016-11-08T09:18:39", "transactionAmount": 7.47, "merchantName": "Play Store", "acqCountry": "US", "merchantCountryCode": "US", "posEntryMode": "09", "posConditionCode": "01", "merchantCategoryCode": "mobileapps", "currentExpDate": "08/2025", "accountOpenDate": "2015-03-14", "dateOfLastAddressChange": "2015-03-14", "cardCVV": "486", "enteredCVV": "486", "cardLast4Digits": "767", "transactionType": "PURCHASE", "echoBuffer": "", "currentBalance": 0.0, "merchantCity": "", "merchantState": "", "merchantZip": "", "cardPresent": false, "posOnPremises": "", "recurringAuthInd": "", "expirationDateKeyInMatch": false, "isFraud": false}

我想要的是得到一个像这样的熊猫数据框

 accountNumber  customerId  creditLimit availableMoney  transactionDateTime transactionAmount   merchantName    acqCountry  merchantCountryCode posEntryMode    ... echoBuffer  currentBalance  merchantCity    merchantState   merchantZip cardPresent posOnPremises   recurringAuthInd    expirationDateKeyInMatch    isFraud
0   737265056   737265056   5000    5000.0  2016-08-13T14:27:32 98.55   Uber    US  US  02  ... NaN 0.0 NaN NaN NaN False   NaN NaN False   False

当我执行下面的代码时,只是读取 JSON 并转换为 Pandas,我没有得到值;但每个单元格也包括字典。我也试过 json_normalize,但问题不是所有列都对齐。任何帮助,将不胜感激。

import json
import pandas as pd
with open('test.txt', 'r') as f:
    data = json.load(f)
df = pd.DataFrame(data)
0   1   2   3   4   5   6   7   8   9   ... 19  20  21  22  23  24  25  26  27  28
0   {"accountNumber": "737265056"   "customerId": "737265056"   "creditLimit": 5000.0   "availableMoney": 5000.0    "transactionDateTime": "2016-08-13T14:27:32"    "transactionAmount": 98.55  "merchantName": "Uber"  "acqCountry": "US"  "merchantCountryCode": "US" "posEntryMode": "02"    ... "echoBuffer": ""    "currentBalance": 0.0   "merchantCity": ""  "merchantState": "" "merchantZip": ""   "cardPresent": false    "posOnPremises": "" "recurringAuthInd": ""  "expirationDateKeyInMatch": false   "isFraud": false}
1   {"accountNumber": "737265056"   "customerId": "737265056"   "creditLimit": 5000.0   "availableMoney": 5000.0    "transactionDateTime": "2016-10-11T05:05:54"    "transactionAmount": 74.51  "merchantName": "AMC #191138"   "acqCountry": "US"  "merchantCountryCode": "US" "posEntryMode": "09"    ... "transactionType": "PURCHASE"   "echoBuffer": ""    "currentBalance": 0.0   "merchantCity": ""  "merchantState": "" "merchantZip": ""   "posOnPremises": "" "recurringAuthInd": ""  "expirationDateKeyInMatch": false   "isFraud": false}

标签: pythonjsonpandasdataframe

解决方案


问题是您的“JSON”文件有点无效。基本上,每条记录都应该用逗号分隔,并且应该包含在这样的列表中 -

[{"accountNumber": "737265056", "customerId": "737265056", "creditLimit": 5000.0, "availableMoney": 5000.0, "transactionDateTime": "2016-08-13T14:27:32", "transactionAmount": 98.55, "merchantName": "Uber", "acqCountry": "US", "merchantCountryCode": "US", "posEntryMode": "02", "posConditionCode": "01", "merchantCategoryCode": "rideshare", "currentExpDate": "06/2023", "accountOpenDate": "2015-03-14", "dateOfLastAddressChange": "2015-03-14", "cardCVV": "414", "enteredCVV": "414", "cardLast4Digits": "1803", "transactionType": "PURCHASE", "echoBuffer": "", "currentBalance": 0.0, "merchantCity": "", "merchantState": "", "merchantZip": "", "cardPresent": false, "posOnPremises": "", "recurringAuthInd": "", "expirationDateKeyInMatch": false, "isFraud": false},
{"accountNumber": "737265056", "customerId": "737265056", "creditLimit": 5000.0, "availableMoney": 5000.0, "transactionDateTime": "2016-10-11T05:05:54", "transactionAmount": 74.51, "merchantName": "AMC #191138", "acqCountry": "US", "merchantCountryCode": "US", "posEntryMode": "09", "posConditionCode": "01", "merchantCategoryCode": "entertainment", "cardPresent": true, "currentExpDate": "02/2024", "accountOpenDate": "2015-03-14", "dateOfLastAddressChange": "2015-03-14", "cardCVV": "486", "enteredCVV": "486", "cardLast4Digits": "767", "transactionType": "PURCHASE", "echoBuffer": "", "currentBalance": 0.0, "merchantCity": "", "merchantState": "", "merchantZip": "", "posOnPremises": "", "recurringAuthInd": "", "expirationDateKeyInMatch": false, "isFraud": false},
{"accountNumber": "737265056", "customerId": "737265056", "creditLimit": 5000.0, "availableMoney": 5000.0, "transactionDateTime": "2016-11-08T09:18:39", "transactionAmount": 7.47, "merchantName": "Play Store", "acqCountry": "US", "merchantCountryCode": "US", "posEntryMode": "09", "posConditionCode": "01", "merchantCategoryCode": "mobileapps", "currentExpDate": "08/2025", "accountOpenDate": "2015-03-14", "dateOfLastAddressChange": "2015-03-14", "cardCVV": "486", "enteredCVV": "486", "cardLast4Digits": "767", "transactionType": "PURCHASE", "echoBuffer": "", "currentBalance": 0.0, "merchantCity": "", "merchantState": "", "merchantZip": "", "cardPresent": false, "posOnPremises": "", "recurringAuthInd": "", "expirationDateKeyInMatch": false, "isFraud": false}]

假设该片段保存在一个名为 的文件中test.json,则以下内容应该有效 -

>>> import pandas as pd
>>> df = pd.read_json('test.json')
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 29 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   accountNumber             3 non-null      int64
 1   customerId                3 non-null      int64
 2   creditLimit               3 non-null      int64
 3   availableMoney            3 non-null      int64
 4   transactionDateTime       3 non-null      object
 5   transactionAmount         3 non-null      float64
 6   merchantName              3 non-null      object
 7   acqCountry                3 non-null      object
 8   merchantCountryCode       3 non-null      object
 9   posEntryMode              3 non-null      int64
 10  posConditionCode          3 non-null      int64
 11  merchantCategoryCode      3 non-null      object
 12  currentExpDate            3 non-null      object
 13  accountOpenDate           3 non-null      object
 14  dateOfLastAddressChange   3 non-null      object
 15  cardCVV                   3 non-null      int64
 16  enteredCVV                3 non-null      int64
 17  cardLast4Digits           3 non-null      int64
 18  transactionType           3 non-null      object
 19  echoBuffer                3 non-null      object
 20  currentBalance            3 non-null      int64
 21  merchantCity              3 non-null      object
 22  merchantState             3 non-null      object
 23  merchantZip               3 non-null      object
 24  cardPresent               3 non-null      bool
 25  posOnPremises             3 non-null      object
 26  recurringAuthInd          3 non-null      object
 27  expirationDateKeyInMatch  3 non-null      bool
 28  isFraud                   3 non-null      bool
dtypes: bool(3), float64(1), int64(10), object(15)
memory usage: 761.0+ bytes
>>>

编辑:跟进回应 -

存在如何编辑字典文件以使其成为有效 JSON 的问题。

假设文件被调用test.txt并且看起来像原始格式的文件(每行都有字典,没有逗号分隔,开头和结尾没有括号)。

import json

with open('test.txt', 'r') as file:
    data = file.readlines()

data = [json.loads(d) for d in data]
df = pd.DataFrame(data)

现在你有了你的DataFrame,之后一切都一样了......


推荐阅读