首页 > 解决方案 > 如何在列中填充空值?

问题描述

我需要读取一个 csv 文件并根据关系类型和 LastName 在“电话、电子邮件、地址、城市、州、邮编”列中填充空/空值,然后写入一个新的 csv 文件。示例:如果一个人的 relationshipType 是“Employer”并且他的家属共享相同的姓氏并且如果家属没有“Phone,Email,address,city,state,zip”,我应该用 Employer 的“Phone,Email”填充空值,地址,城市,州,邮编”。我面临两个主要问题,1)姓氏对于不同的人/家庭(即第 5-10 行)可能相同,因此每当“关系”更改为“雇主”时,循环应该中断并继续 2)在某些情况下, Dependent Lastname 不会与 Employer Lastname 相同,但如果它位于相同的姓氏之间,我们仍然应该填充空值(即,

我将无法使用 python pandas,因为其余代码/程序完全基于 python 2.7。

输入格式/表格如下所示,带有空单元格(csv 文件):

[FirstName  LastName    DoB Relationship    Phone   Email   Address City    State   Zip
Hannah  Kahnw   9/12/1972   Employer    1457871452  hann@gmail.com  Han Ave hannas  UT  563425
Michel  Kahnw   2/9/1993    Dependent                       
Jonaas  Kahnw   2/22/1997   Dependent                       
Mikkel  Nielsen 1/25/1976   Employer    4509213887  mik@yah.com 887 Street  neil    NY  72356
Magnus  Nielsen 9/20/1990   Dependent                       
Ulrich  Nielsen 9/12/1983   Employer    7901234516  ulric@mail.com  Ulric Build mavric  KS  421256
kathari Nielsen 10/2/2003   Dependent                       
kathy   storm   12/12/1999  Dependent                       
kiiten  Nielsen 6/21/1999   Dependent                       
Elisab  Doppler 2/22/1987   Employer    5439001211  elop@amaz.com   Elisa apart Elis    AR  758475
Peterp  Doppler 1/25/1977   Employer    6847523758  per@dail.com    park Ave    Pete    PT  415253
bartos  Doppler 9/21/1990   Dependent][1]                       

输出格式应如下所示:

FirstName   LastName    DoB Relationship    Phone   Email   Address City    State   Zip
Hannah  Kahnw   9/12/1972   Employer    1457871452  hann@gmail.com  Han Ave hannas  UT  563425
Michel  Kahnw   2/9/1993    Dependent   1457871453  hann@gmail.com  Han Ave hannas  UT  563426
Jonaas  Kahnw   2/22/1997   Dependent   1457871454  hann@gmail.com  Han Ave hannas  UT  563427
Mikkel  Nielsen 1/25/1976   Employer    4509213887  mik@yah.com 887 Street  neil    NY  72356
Magnus  Nielsen 9/20/1990   Dependent   4509213888  mik@yah.com 888 Street  neil    NY  72357
Ulrich  Nielsen 9/12/1983   Employer    7901234516  ulric@mail.com  Ulric Build mavric  KS  421256
kathari Nielsen 10/2/2003   Dependent   7901234517  ulric@mail.com  Ulric Build mavric  KS  421257
kathy   storm   12/12/1999  Dependent   7901234518  ulric@mail.com  Ulric Build mavric  KS  421258
kiiten  Nielsen 6/21/1999   Dependent   7901234519  ulric@mail.com  Ulric Build mavric  KS  421259
Elisab  Doppler 2/22/1987   Employer    5439001211  elop@amaz.com   Elisa apart Elis    AR  758475
Peterp  Doppler 1/25/1977   Employer    6847523758  per@dail.com    park Ave    Pete    PT  415253
bartos  Doppler 9/21/1990   Dependent   6847523759  per@dail.com    park Ave    Pete    PT  415254


import csv
from collections import namedtuple


def get_info(file_path):

    # Read data from file and convert to list of namedtuples
    # dictionary to use to fill in missing information from others.
    with open(file_path, 'rb') as fin:
        csv_reader =  csv.reader(fin, skipinitialspace=True)

        header = next(csv_reader)
        Record = namedtuple('Record', header)

        addr_dict = {}
        data = [header]

        for rec in (Record._make(row) for row in csv_reader):
            if rec.Email or rec.Phone or rec.Address or rec.City or rec.State or rec.Zip:
                addr_dict.setdefault(rec.LastName, []).append(rec)  # Remember it.

    # Try to fill in missing data from any other records with same Address.
    for i, row in enumerate(data[1:], 1):
        if not (row.Phone and row.Email and rec.Address and rec.City and rec.State and rec.Zip):  # Info missing?
            # Try to copy it from others at same address.
            updated = False
            for other in addr_dict.get(row.LastName, []):
                if not row.Phone and other.Phone:
                    row = row._replace(Phone=other.Phone)
                    updated = True
                if not row.Email and other.Email:
                    row = row._replace(Email=other.Email)
                    updated = True
                if not row.Address and other.Address:
                    row = row._replace(Address=other.Address)
                    updated = True
                if not row.City and other.City:
                    row = row._replace(City=other.City)
                    updated = True
                if not row.Zip and other.Zip:
                    row = row._replace(Zip=other.Zip)
                    updated = True
                if row.Phone and row.Email and rec.Address and rec.City and rec.State and rec.Zip:  # Info now filled in?
                    break

            if updated:
                data[i] = row

    return data


INPUT_FILE = 'null_cols.csv'
OUTPUT_FILE = 'fill_cols.csv'

data = get_info(INPUT_FILE)

with open(OUTPUT_FILE, 'wb') as fout:
    writer = csv.DictWriter(fout, data[0])  # First elem has column names.
    writer.writeheader()
    for row in data[1:]:
        writer.writerow(row._asdict())

#(i got this code from earlier question which i asked in S.O This script doesn't include relationshiptype logic and also it doesn't consider the Duplicate LastName issue)

谢谢您的帮助 !!

标签: pythonpython-2.7csvparsing

解决方案


好的,有一种使用 ffill 的简单方法,如下所示,但是,这仅在家属始终按行顺序跟随其雇主的情况下才有效,这在您提供的数据示例中就是这种情况,因此以下代码有效:

import pandas as pd

#read in your csv file
df = pd.read_csv('fileName.csv')


#loop over columns and replace nans with the most recent value available
for c in df.columns:
    df[c].fillna(method='ffill', inplace=True)

#write out your df back to csv
df.to_csv('newFile.csv', index=False)

推荐阅读