首页 > 解决方案 > 将字典中的值打印到新的 csv 文件

问题描述

我有一个 csv 文件,看起来像这样

year,gender,age,country
2002,F,9-10,CO
2002,F,9-10,CO
2002,M,9-10,CO
2002,F,9-10,BR
2002,M,11-15,BR
2002,F,11-15,CO
2003,F,9-10,CO
2003,M,9-10,CO
2003,F,9-10,BR
2003,M,9-10,CO
2004,F,11-15,BR
2004,F,11-15,CO
2004,F,9-10,BR
2004,F,9-10,CO

我想得到一个这样的输出文件:

year,gender,age,country,population
2002,F,9-10,CO,2
2002,M,9-10,CO,1
2002,F,9-10,BR,1
2002,M,9-10,BR,0
2002,F,11-15,CO,1
2002,M,11-15,CO,0
2002,F,11-15,BR,0
2002,M,11-15,BR,1
2003,F,9-10,CO,1
2003,M,9-10,CO,1
2003,F,9-10,BR,1
2003,M,9-10,BR,0
2003,F,11-15,CO,0
2003,M,11-15,CO,0
2004,F,9-10,CO,1
2004,M,9-10,CO,0
2004,F,9-10,BR,1
2004,M,9-10,BR,0
2004,F,11-15,CO,1
2004,M,11-15,CO,0
2004,F,11-15,BR,1
2004,M,11-15,BR,0

基本上我想打印出每年,每个年龄和每个国家的女性人数,所以年份,性别,年龄和国家将是字典的关键。此外,有些年份没有特定国家的数据,或者有些年份没有特定国家的特定年龄。例如,2003 年,CO 国没有 11-15 岁年龄段的女性数据。在这种情况下,人口将为 0。而且,有些年份根本没有特定的性别数据。例如,对于 2004 年,没有所有年龄和国家/地区的男性数据,但我仍然想在人口 0 的输出文件中打印出来。

下面是我编写的一些 python 代码,但它不起作用,我不知道如何处理丢失的数据并在人口字段中将其打印为 0。

import csv
import os
import sys
from operator import itemgetter, attrgetter
import math
from collections import Counter

# Create dictionary to hold the data
valDic = {}

# Read data into dictionary
with open(sys.argv[1], "r",) as inputfile:
    readcsv = csv.reader(inputfile, delimiter = ',')    
    next(readcsv)
    for line in readcsv:
        key = line[0] + line[1] + line[2] + line[3]
        year = line[0]
        gender = line[1]
        age = line[2]
        country = line[3]
        if key in valDic:
            key = key + 1
        else:
            valDic[key] = [year, gender, age, country, 0] # 0s are placeholder for running sum and itemCount
    inputfile.close()  

newcsvfile = []

for key in valDic:
    newcsvfile.append([valDic[key][0], valDic[key][1], valDic[key][2], valDic[key][3], len(valDic[key])])

newcsvfile = sorted(newcsvfile)
newcsvfile = [["year", "gender", "age", "country", "population"]] 

with open(sys.argv[2], "w") as outputfile:
    writer = csv.writer(outputfile)
    writer.writerows(newcsvfile)        

标签: pythonpandasnumpydata-cleaning

解决方案


我们可以将年份、性别、年龄、国家/地区的每个组合存储为一个元组,并将其用作字典的键。我们还维护了这些值中的每一个的唯一集合。我们迭代我们看到的每一个组合,如果数据不存在(比如在 2004 年,只有女性存在,但没有男性);然后我们可以为此添加“0”。

演示:

import csv
import sys

# Create dictionary to hold the data
valDic = {}

years, genders, age, country = set(), set(), set(), set()

# Read data into dictionary
with open(sys.argv[1], 'r',) as inputfile:

    reader = csv.reader(inputfile, delimiter = ',')
    next(reader)

    for row in reader:

        key = (row[0], row[1], row[2], row[3])

        years.add(key[0])
        genders.add(key[1])
        age.add(key[2])
        country.add(key[3])

        if key not in valDic:
            valDic[key]=0

        valDic[key]+=1


#Add missing combinations
for y in years:
    for g in genders:
        for a in age:
            for c in country:
                key = (y, g, a, c)
                if key not in valDic:
                    valDic[key]=0

#Prepare new CSV
newcsvfile = [["year", "gender", "age", "country", "population"]] 

for key, val in sorted(valDic.items()):
    newcsvfile.append([key[0], key[1], key[2], key[3], valDic[key]])

with open(sys.argv[2], "w", newline='') as outputfile:
    writer = csv.writer(outputfile)
    writer.writerows(newcsvfile)  

输出:

year,gender,age,country,population
2002,F,11-15,BR,0
2002,F,11-15,CO,1
2002,F,9-10,BR,1
2002,F,9-10,CO,2
2002,M,11-15,BR,1
2002,M,11-15,CO,0
2002,M,9-10,BR,0
2002,M,9-10,CO,1
2003,F,11-15,BR,0
2003,F,11-15,CO,0
2003,F,9-10,BR,1
2003,F,9-10,CO,1
2003,M,11-15,BR,0
2003,M,11-15,CO,0
2003,M,9-10,BR,0
2003,M,9-10,CO,2
2004,F,11-15,BR,1
2004,F,11-15,CO,1
2004,F,9-10,BR,1
2004,F,9-10,CO,1
2004,M,11-15,BR,0
2004,M,11-15,CO,0
2004,M,9-10,BR,0
2004,M,9-10,CO,0

推荐阅读