首页 > 解决方案 > 处方数据集的 Python 分析 - 循环提前终止,并且没有输出文件

问题描述

我有这个输入数据:

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500

输出应该如下:

drug_name,num_prescriber,total_cost
CHLORPROMAZINE,2,3000
BENZTROPINE MESYLATE,1,1500
AMBIEN,2,300

但相反,我得到以下输出:

AMBIEN 2 300
CHLORPROMAZINE 0 0
BENZTROPINE MESYLATE 0 0

任何建议,将不胜感激!我的代码如下:

 fileHandle = """
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
"""
input_data = re.sub(r'(\n)', r',\1', fileHandle)   
fields = input_data.split(',')    
del fields[0]

NumberOfRows = round(len(fields)/5)
NumberOfCols = 5
length_of_fields=len(fields)

# Expected output: drug_name,number_of_prescribers,total_cost
# drug_name at indices 3 (title), 8, 13, 18, 23, 28
# drug_cost at indices 4 (title), 9, 14, 19, 24, 29


#outputfile = open('/output/top_cost_drug.txt','w') 

# get list of drug names
i=8
j=0
drug_name_list=list()
drug_name_indices=list()
while i<=length_of_fields:
    drug_name_list.append(fields[i])
    drug_name_indices.append(i)
    i=i+5;
    j=j+1;

# find unique names (same as drug_name_list but without repetition)

unique_drug_list = list()


 # traverse for all elements
for x in drug_name_list:
        # check if exists in unique_name_list or not
   if x not in unique_drug_list:
       unique_drug_list.append(x)

i=0
j=0
number_of_unique_drugs=len(unique_drug_list)
unique_cost_list=[0]*number_of_unique_drugs
number_of_prescribers = [0]*number_of_unique_drugs


#while i<len(drug_name_list):   
#    while j<number_of_unique_drugs:
#        if drug_name_list[i]==unique_drug_list[j]:
#            drug_name_index=drug_name_indices[i]
#            cost_of_drug=int(fields[drug_name_index+1])
#            unique_cost_list[j]=int(unique_cost_list[j])+cost_of_drug
#            number_of_prescribers[j]=number_of_prescribers[i]+1
#        j=j+1
#    i=i+1

while j<number_of_unique_drugs:
    while i<len(drug_name_list):   
        if drug_name_list[i]==unique_drug_list[j]:
            drug_name_index=drug_name_indices[i]
            cost_of_drug=int(fields[drug_name_index+1])
            unique_cost_list[j]=int(unique_cost_list[j])+cost_of_drug
            number_of_prescribers[j]=number_of_prescribers[i]+1
        i=i+1
    j=j+1

# print output values
counter=0
print("drug_name,number_of_prescribers,total_cost \n")
while counter<number_of_unique_drugs:
    print(unique_drug_list[counter], number_of_prescribers[counter], unique_cost_list[counter])
    print("\n")
    counter=counter+1

另外,我将打印语句更改为 outputfile.write 但我没有得到任何输出文件,这是为什么呢?

outputfile = open('/output/top_cost_drug.txt','w')

# print output values

counter=0

outputfile.write("drug_name,number_of_prescribers,total_cost \n")

while counter<number_of_unique_drugs:

outputfile.write(unique_drug_list[counter],',', number_of_prescribers[counter],',', unique_cost_list[counter])

print("\n")

counter=counter+1

标签: pythondataframe

解决方案


要获得预期的输出,请使用 Pandasgroupby()聚合方法:

df.groupby("drug_name").drug_cost.agg(["count", "sum"])

                      count   sum
drug_name                        
AMBIEN                    2   300
BENZTROPINE MESYLATE      1  1500
CHLORPROMAZINE            2  3000

要写入文件,请使用to_csv()

df.groupby("drug_name").drug_cost.agg(["count", "sum"]).to_csv("output.csv")

数据:

import pandas as pd
data = {'id': {0: 1000000001,
  1: 1000000002,
  2: 1000000003,
  3: 1000000004,
  4: 1000000005},
 'prescriber_last_name': {0: 'Smith',
  1: 'Garcia',
  2: 'Johnson',
  3: 'Rodriguez',
  4: 'Smith'},
 'prescriber_first_name': {0: 'James',
  1: 'Maria',
  2: 'James',
  3: 'Maria',
  4: 'David'},
 'drug_name': {0: 'AMBIEN',
  1: 'AMBIEN',
  2: 'CHLORPROMAZINE',
  3: 'CHLORPROMAZINE',
  4: 'BENZTROPINE MESYLATE'},
 'drug_cost': {0: 100, 1: 200, 2: 1000, 3: 2000, 4: 1500}}

df = pd.DataFrame(data)

df
           id prescriber_last_name prescriber_first_name  \
0  1000000001                Smith                 James   
1  1000000002               Garcia                 Maria   
2  1000000003              Johnson                 James   
3  1000000004            Rodriguez                 Maria   
4  1000000005                Smith                 David   

              drug_name  drug_cost  
0                AMBIEN        100  
1                AMBIEN        200  
2        CHLORPROMAZINE       1000  
3        CHLORPROMAZINE       2000  
4  BENZTROPINE MESYLATE       1500  

推荐阅读