首页 > 解决方案 > Pandas Dataframe:如何使用 NaN 管理列的自动浮点转换

问题描述

我知道熊猫中的空值 NaN 是浮点数,这就是为什么在带有 NaN 的列中,整数被转换为浮点数的原因。

但我有一个脚本可以合并多个数据源以生成 xpt 文件。我的问题是 xpt 文件格式被定义为 SDTM 国际标准。

实验室结果可以是整数、浮点数和事件文本(正、负)。在 SDTM 中,我有 2 个结果列(LBORRES 和 LBSTRESN),一个是 char(LBORRES),另一个是数字(LBSTRESN)。

char 列必须存储整数(不含小数)、浮点数和字符串。可能吗 ?

我对 numerix 列没有任何问题。但是对于 cha 列,整数被转换为浮点数。

df = pd.DataFrame(columns = sdtm_variables)

# reference table = randomisation
df_randomisation = pd.read_csv(f'./csv/source/randomisation.csv',delimiter=",").fillna('NULL')
df_randomisation = df_randomisation.loc[:, df_randomisation.columns!='redcap_repeat_instance'] # exclude column redcap_repeat_instance as randomisation form are unique
df_randomisation = df_randomisation.loc[:, df_randomisation.columns!='redcap_event_name'] # exclude column redcap_repeat_name as randomisation form are unique
df_randomisation = df_randomisation.loc[:, df_randomisation.columns!='ID'] # exclude column redcap_repeat_name as randomisation form are unique
df = pd.merge(df,df_randomisation, left_on='pat_ide', right_on='pat_ide', how='outer').fillna('NULL')


# 1. raw data retrieved

df_laboratory = pd.read_csv(f'./csv/source/biologie_labo.csv',delimiter=",").fillna('NULL')
# df_laboratory = df_laboratory[['pat_ide','redcap_event_name','redcap_repeat_instance','ID']]
df = pd.merge(df,df_laboratory, left_on='pat_ide', right_on='pat_ide', how='outer').sort_values(by=['pat_ide','lab_dat']).fillna('NULL').sort_values(by=['pat_ide','lab_dat']) 

# creation of empty dataframe where will be add lines for each diseases for a patient
tmp_df_laboratory = pd.DataFrame(columns = sdtm_variables)
tmp_df_laboratory['ran_trt'] = None


# df = df[(df['pat_ide'] == 'BFBO001')]
# print(df)

# list of biological analysis
labos = pd.read_excel('LABO.xls',sheet_name='EXAMS')

for index, row in df.iterrows(): # .convert_dtypes() prevent from integer to become float when missing values (NaN is a float so the column is converted to float) => do not works
    for k,v in labos.iterrows(): 
        if row['ID'] != 'NULL':
            # Biological analysis with unit select by user
            # and not corresponding to: Coagulation (Taux de prothrombine, INR) et analyses sérologique Hépatite et HIV
            # if f"{v['VAR']}"[-4:] != '_uni' and f"{v['VAR']}" not in ['lab_pro','lab_inr','lab_hbs','lab_hcv','lab_vih','lab_hcg','lab_gro']:   
            # if f"{v['VAR']}" in ['lab_pro','lab_inr','lab_hbs','lab_hcv','lab_vih','lab_hcg','lab_gro']: 
            if f"{v['CATEGORY']}" == 'NFS': 
                tmp_df_laboratory = tmp_df_laboratory.append({
                    ...
                    'LBORRES' : str(row[f"{v['VAR']}"]) if row[f"{v['VAR']}"] != 'NULL' else '',
                    'LBSTRESN' : row[f"{v['VAR']}"] if row[f"{v['VAR']}"] != 'NULL' else np.nan,
                },ignore_index=True) 
            

在此处输入图像描述

预期输出(注意最后一行是空的)

在此处输入图像描述

标签: pythonpandas

解决方案


推荐阅读