首页 > 解决方案 > How to groupby when there is a pattern by group but not an ID?

问题描述

I have a dataset of individuals organized by families. The families always (for this example) follow an order:father, mother, daughter, and son (the children can repeat). I would like to groupby data by family and create a dataset with only children with some information from the parents. Usually, I would use groupby and transform to create this new dataset. However, I don't have an ID for family and the final dataset will not be the same shape of the original dataset since it will exclude parents.

Here is a replicable minimal example,the desired output and what I have been attempting so far:

df=pd.DataFrame({'name':['Stephen Dedalus','Helena','Atena','Odisseu','Filoctetes','Medeia','Antígona','Tebas','Aqulies'],
                      'age':[30,32,15,16,45,43,12,13,6],
              'can_read':['yes','no','yes']*3, 'nationality':['Irish','Greek','Brazilian']*3, 
              'parental_situation':['chefe', 'dependente-esposa','filha_dependente','filho_dependente',
                                   'chefe', 'dependente-esposa','filha_dependente','filha_dependente',
                                   'filho_dependente']})
df

So, my data is like:

    name             age    can_read    nationality       parental_situation
0   Stephen Dedalus  30        yes            Irish           chefe
1   Helena           32        no             Greek        dependente-esposa
2   Atena            15        yes            Brazilian    filha_dependente
3   Odisseu          16        yes            Irish        filho_dependente
4   Filoctetes       45        no             Greek            chefe
5   Medeia           43        yes            Brazilian    dependente-esposa
6   Antígona         12        yes            Irish        filha_dependente
7   Tebas            13        no             Greek        filha_dependente
8   Aqulies          6         yes            Brazilian    filho_dependente

Desired output:

  name      age   can_read      nationality    father_canread   mother_canread father_nation    mother_nation
Atena       15      yes         Brazilian            yes            no              Irish   Greek
Odisseu     16      yes         Irish                yes            no              Irish   Greek
Antígona    12      yes         Irish                 no           yes              Greek   Brazilian
Tebas       13      no          Greek                 no           yes              Greek   Brazilian
Aquiles     6       yes         Brazilian             no           yes              Greek Brazilian

What I have been trying

My solution to create the ID variable for each family is the folowing (very inneficient):

df['id']=2 #lastfamily
for k in df.index:
    if list(df['parental_situation'][:k]).count('chefe')==1:
        df['id'][k-1]=1
    elif list(df['parental_situation'][:k]).count('chefe')==2:
        df['id'][k-1]=2

Then I create a function to get the information from parents, and proceed to merge and append all sub-datasets:

def get_parents_info(family_id, father=True):
    if father:
        return df[(df['parental_situation']=='chefe') & (df['id']==family_id)].drop(['name','age','parental_situation'], axis=1).rename(columns={'can_read':'father_canread',
                                                                   'nationality':'father_nationality'}).reset_index(drop=True)
    elif father==False:
        return df[(df['parental_situation']=='dependente-esposa') & (df['id']==family_id)].drop(['name','age',
                                                                                                'parental_situation'], axis=1).rename(columns={'can_read':'mother_canread',
                                                                   'nationality':'mother_nationality'}).reset_index(drop=True)

condition_children = (df['parental_situation']=='filha_dependente') | (df['parental_situation']=='filho_dependente')

final=df.merge(get_parents_info(1,father=False)).merge(get_parents_info(1,father=True)).append(df.merge(get_parents_info(2,father=False)).merge(get_parents_info(2,father=True)))

final[condition_children]

Which gives me exactly what I want. But I think there must be a more efficient and elegant way to do that (my dataframe is rather a big one). Furthermore,it will be nice if there is a more flexible approach since in my real data the sequence of parental information is not so well behaved. The mother can be the family chief (if the man is dead, for example) or she can come first, even if there is a man as the family chief.

Any help will be appreciated!

P.S: sorry, chefe is equivalent to father, dependente-esposa is equivalent to mother. The remaining are all children

标签: pythonpandaspandas-groupby

解决方案


可能有一种更有效的方法,但是通过一些摆弄来为每个家庭分配一个独特的组,并为孩子们旋转和添加一个列,即使它可能不像你想要的那样平坦,你也可以获得你想要的东西...

# rename to mother/father and we can assume NaN to be child
df['category'] = df['parental_situation'].map({'chefe': 'father', 'dependente-esposa': 'mother'})

# assign a family group
is_parent = df['category'].notna()
df.loc[is_parent, 'group'] = (is_parent ^ is_parent.shift()).astype(int)
df['group'] = df['group'].fillna(0).cumsum()

# Pivot accordingly
parents = df[is_parent].pivot('group', 'category')
# Add extra level to child columns
children = df[~is_parent].set_index('group')
children.columns = pd.MultiIndex.from_product([children.columns, ['child']])

# Do a join...
new = children.join(parents)

这给了你:

           name   age can_read nationality parental_situation category             name            age        can_read        nationality            parental_situation                   
          child child    child       child              child    child           father  mother father mother   father mother      father     mother             father             mother
group                                                                                                                                                                                     
1.0       Atena    15      yes   Brazilian   filha_dependente      NaN  Stephen Dedalus  Helena     30     32      yes     no       Irish      Greek              chefe  dependente-esposa
1.0     Odisseu    16      yes       Irish   filho_dependente      NaN  Stephen Dedalus  Helena     30     32      yes     no       Irish      Greek              chefe  dependente-esposa
2.0    Antígona    12      yes       Irish   filha_dependente      NaN       Filoctetes  Medeia     45     43       no    yes       Greek  Brazilian              chefe  dependente-esposa
2.0       Tebas    13       no       Greek   filha_dependente      NaN       Filoctetes  Medeia     45     43       no    yes       Greek  Brazilian              chefe  dependente-esposa
2.0     Aqulies     6      yes   Brazilian   filho_dependente      NaN       Filoctetes  Medeia     45     43       no    yes       Greek  Brazilian              chefe  dependente-esposa

您可能希望在加入之前进行交换级别或删除一些列,但这应该为您提供一种让您接近的通用方法。

进一步的想法是限制枢轴中的值并明确使用子框架中的列,例如:

# Pivot accordingly
parents = df[is_parent].pivot('group', 'category', ['can_read', 'nationality'])
# Add extra level to child columns
children = df.loc[~is_parent, ['group', 'name', 'age', 'can_read', 'nationality']].set_index('group')
children.columns = pd.MultiIndex.from_product([children.columns, ['child']])

然后会给你:

           name   age can_read nationality can_read        nationality           
          child child    child       child   father mother      father     mother
group                                                                            
1.0       Atena    15      yes   Brazilian      yes     no       Irish      Greek
1.0     Odisseu    16      yes       Irish      yes     no       Irish      Greek
2.0    Antígona    12      yes       Irish       no    yes       Greek  Brazilian
2.0       Tebas    13       no       Greek       no    yes       Greek  Brazilian
2.0     Aqulies     6      yes   Brazilian       no    yes       Greek  Brazilian

...这更接近您想要的结果。


推荐阅读