首页 > 解决方案 > pandas groupby 并根据值填充列

问题描述

给定一个数据框(https://pastebin.com/MdqWz4Ke

# some data
data3 = [["Alex","Tampa","A23","1","Ax","Red"],
     ["Alex","Tampa","A23","1","Ay","Blue"],
     ["Alex","Tampa","B43","1","Bx","Green"],
     ["Alex","Tampa","B43","1","By","White"],
     ["Alex","Tampa","C55","1","Cx","Red"],
     ["Alex","Tampa","C55","1","Cy","White"],
     ["Alex","Tampa","C55","2","Cx","Purple"],
     ["Alex","Tampa","C55","2","Cy","Black"],
     ["Tim","San Diego","A23","1","Ax","Green"],
     ["Tim","San Diego","A23","1","Ay","Black"],
     ["Tim","San Diego","B43","1","Bx","Yellow"],
     ["Tim","San Diego","B43","1","By","Black"],
     ["Tim","San Diego","C55","1","Cx","Pink"],
     ["Tim","San Diego","C55","1","Cy","Orange"],
     ["Tim","San Diego","A23","2","Ax","Green"],
     ["Tim","San Diego","A23","2","Ay","Red"],
     ["Tim","San Diego","B43","2","Bx",""],
     ["Tim","San Diego","B43","2","By",""],        
     ["Mark","Houston","A23","1","Ax","Purple"],
     ["Mark","Houston","A23","1","Ay","Yellow"],
     ["Mark","Houston","B43","1","Bx","Gray"],
     ["Mark","Houston","B43","1","By","White"],
     ["Mark","Houston","C55","1","Cx",""],
     ["Mark","Houston","C55","1","Cy",""],        
     ["Anthony","Seattle","A23","","Ax","Orange"],
     ["Anthony","Seattle","A23","","Ay","Black"],
     ["Anthony","Seattle","B43","","Bx","Red"],
     ["Anthony","Seattle","B43","","By","Black"],
     ["Anthony","Seattle","C55","","Cx","Blue"],
     ["Anthony","Seattle","C55","","Cy","Pink"]]

# create dataframe
df3 = pd.DataFrame(data3,columns=[
        "Name","City","Domain","Sequence","Group","Value"])

如何比较组中的值并使用这些值有条件地填充列?

# add Compared columns
df3["Compared Group"] = ""
df3["Compared Value"] = ""

# replace nulls with np.NaN
df3.replace(r"^s*$", np.nan, regex=True, inplace = True)

# fillna for missing Sequence and Value
df3.fillna({"Sequence":"N/A","Value":"NULL"},inplace=True)

# expected result
result = [["Alex","Tampa","A23","1","Ax","Red","Ay","Blue"],
          ["Alex","Tampa","B43","1","Bx","Green","By","White"],
          ["Alex","Tampa","C55","1","Cx","Red","Cy","White"],
          ["Alex","Tampa","C55","2","Cx","Purple","Cy","Black"],
         ["Tim","San Diego","A23","1","Ax","Green","Ay","Black"],
         ["Tim","San Diego","A23","2","Ax","Green","Ay","Red"],
         ["Tim","San Diego","B43","1","Bx","Yellow","By","Black"],
         ["Tim","San Diego","B43","2","Bx","NULL","By","NULL"],        
         ["Tim","San Diego","C55","1","Cx","Pink","Cy","Orange"],        
         ["Mark","Houston","A23","1","Ax","Purple","Ay","Yellow"],
         ["Mark","Houston","B43","1","Bx","Gray","By","White"],
         ["Mark","Houston","C55","1","Cx","NULL","Cy","NULL"],
         ["Anthony","Seattle","A23","","Ax","Orange","Ay","Black"],
         ["Anthony","Seattle","B43","","Bx","Red","By","Black"],
         ["Anthony","Seattle","C55","","Cx","Blue","Cy","Pink"]]

    result_df = pd.DataFrame(result,columns=[
            "Name","City","Domain","Sequence","Group",
            "Value","Compared Group","Compared Value"])

笔记:

我创建了一个映射Group值的字典

# map groups with dictionary
group_dict = {"Ax":"Ay","Bx":"By","Cx":"Cy"}

并创建了groupby对象

# groupby
grouped = df3.groupby(["Name","Sequence","Domain","Group"], group_keys=False)

我最初的计划是.loc为了填充Compared列并可能map与字典一起使用,但是当尝试访问组中的值时......

for name in df3["Name"]:
    print(grouped.get_group((name,"Ax")))

我收到以下错误:

ValueError: must supply a a same-length tuple to get_group with multiple grouping keys

我假设是因为并非所有组都包含相同数量和类型的Group值(例如,Tim 有Sequence1 和 2 ,Ax而 Alex 只有Sequence1 Ax)。我不确定如何从这里开始以合并和转换这些行。

标签: python-3.xpandasgroup-by

解决方案


鉴于您的样本数据,您可以执行以下操作:

def myfunc(x):
    # extract rows 0 2 4 ...
    # reset_index rename the rows as 0 1 2 ...
    df1 = x.iloc[::2].reset_index(drop=True)

    # extract rows 1 3 5
    df2 = x.iloc[1::2].reset_index(drop=True)

    # merge put the two dataframes next together
    return df1.merge(df2, left_index=True, right_index=True)

# group by the other columns and select only ['Group', 'Value']
(df3.groupby(['Name', 'City', 'Domain', 'Sequence'])[['Group','Value']]
 .apply(myfunc)                # concatenate the rows
 .reset_index(-1, drop=True)   # drop the unnecessary index
 .reset_index()                # make the other original columns as data instead of index
)

输出:

       Name         City Domain Sequence Group_x Value_x Group_y Value_y
0      Alex        Tampa    A23        1      Ax     Red      Ay    Blue
1      Alex        Tampa    B43        1      Bx   Green      By   White
2      Alex        Tampa    C55        1      Cx     Red      Cy   White
3      Alex        Tampa    C55        2      Cx  Purple      Cy   Black
4   Anthony      Seattle    A23      N/A      Ax  Orange      Ay   Black
5   Anthony      Seattle    B43      N/A      Bx     Red      By   Black
6   Anthony      Seattle    C55      N/A      Cx    Blue      Cy    Pink
7      Mark      Houston    A23        1      Ax  Purple      Ay  Yellow
8      Mark      Houston    B43        1      Bx    Gray      By   White
9      Mark      Houston    C55        1      Cx    NULL      Cy    NULL
10      Tim  Los Angeles    A23        1      Ax   Green      Ay   Black
11      Tim  Los Angeles    A23        2      Ax   Green      Ay     Red
12      Tim  Los Angeles    B43        1      Bx  Yellow      By   Black
13      Tim  Los Angeles    B43        2      Bx    NULL      By    NULL
14      Tim  Los Angeles    C55        1      Cx    Pink      Cy  Orange

推荐阅读