首页 > 解决方案 > 在加载到 json (to_json) 之前从 pandas 数据帧迭代器中删除元素

问题描述

我正在尝试从 data_json 字符串中删除电子邮件和用户 ID,因为它是先前提取的,我不希望这些字段列出两次,工作 python 脚本:

credentials = service_account.Credentials.from_service_account_file('/keys/json_poc.txt')
project_id = 'myproject'
bq_conn = bigquery.Client(credentials=credentials,project=project_id)


 # setup sql query
bq_sql = ("""
select email , userid, * except (email , userid) from dataset.usertable
""")

# load up dataframe
df = bq_conn.query(bq_sql).to_dataframe()

# iterate over rows
for i, row in df.iterrows():
    
    # pull out email and userid from df
    email = df.loc[i]['email']
    userid = df.loc[i]['userid']
    
    # build json data fields for iterable
    data_json = df.loc[i].to_json(orient="columns") # remove fields
    
    # json format {"email": "email@yahoo.com","dataFields": {},"userId": "99999"}
    final_json = '{"email": "' + email + '","dataFields": ' + data_json + ',"userId": "' + userid + '"}'
    r = requests.request(method="POST", url=url, headers=headers, data=final_json)
    print('added: ' + email)

API (GET) 的输出如下所示:

b'{"user":{"email":"email@yahoo.com","dataFields":{"email":"email@yahoo.com","userId":"99999","StateofResidence":"CA",......,"categoryid":99},"userId":"99999"}}'

上面你可以看到 email 和 userid 字段被列出了两次。我们将为每个用户提供大约 500 个属性,因此通过手动提取元素来构建 json 字符串是不切实际的(另外我希望能够在不更新 python 的情况下向源表添加列)。

任何想法将不胜感激。

标签: pythonjsonpandasdataframe

解决方案


整理了一下,只需要改如下:

def myconverter(o):
if isinstance(o, datetime.datetime):
    return o.__str__()

bq_sql = ("""select email , userid, * except (email , userid) from dataset.usertable 
""")

df = bq_conn.query(bq_sql).to_dataframe()

for i, row in df.iterrows():

    payload_dict = {"email": row["email"], "dataFields": row.drop(labels=['userid', 'email']).to_dict(), "userid": row["userid"]} 

    json_payload = json.dumps(payload_dict, default = myconverter) #cast datetime
    
    r = requests.request(method="POST", url=url, headers=headers, data=json_payload)
    print(r.status_code)
    print(r.content)

推荐阅读