python - How to groupby when there is a pattern by group but not an ID?
问题描述
I have a dataset of individuals organized by families. The families always (for this example) follow an order:father, mother, daughter, and son (the children can repeat). I would like to groupby data by family and create a dataset with only children with some information from the parents. Usually, I would use groupby and transform to create this new dataset. However, I don't have an ID for family and the final dataset will not be the same shape of the original dataset since it will exclude parents.
Here is a replicable minimal example,the desired output and what I have been attempting so far:
df=pd.DataFrame({'name':['Stephen Dedalus','Helena','Atena','Odisseu','Filoctetes','Medeia','Antígona','Tebas','Aqulies'],
'age':[30,32,15,16,45,43,12,13,6],
'can_read':['yes','no','yes']*3, 'nationality':['Irish','Greek','Brazilian']*3,
'parental_situation':['chefe', 'dependente-esposa','filha_dependente','filho_dependente',
'chefe', 'dependente-esposa','filha_dependente','filha_dependente',
'filho_dependente']})
df
So, my data is like:
name age can_read nationality parental_situation
0 Stephen Dedalus 30 yes Irish chefe
1 Helena 32 no Greek dependente-esposa
2 Atena 15 yes Brazilian filha_dependente
3 Odisseu 16 yes Irish filho_dependente
4 Filoctetes 45 no Greek chefe
5 Medeia 43 yes Brazilian dependente-esposa
6 Antígona 12 yes Irish filha_dependente
7 Tebas 13 no Greek filha_dependente
8 Aqulies 6 yes Brazilian filho_dependente
Desired output:
name age can_read nationality father_canread mother_canread father_nation mother_nation
Atena 15 yes Brazilian yes no Irish Greek
Odisseu 16 yes Irish yes no Irish Greek
Antígona 12 yes Irish no yes Greek Brazilian
Tebas 13 no Greek no yes Greek Brazilian
Aquiles 6 yes Brazilian no yes Greek Brazilian
What I have been trying
My solution to create the ID variable for each family is the folowing (very inneficient):
df['id']=2 #lastfamily
for k in df.index:
if list(df['parental_situation'][:k]).count('chefe')==1:
df['id'][k-1]=1
elif list(df['parental_situation'][:k]).count('chefe')==2:
df['id'][k-1]=2
Then I create a function to get the information from parents, and proceed to merge and append all sub-datasets:
def get_parents_info(family_id, father=True):
if father:
return df[(df['parental_situation']=='chefe') & (df['id']==family_id)].drop(['name','age','parental_situation'], axis=1).rename(columns={'can_read':'father_canread',
'nationality':'father_nationality'}).reset_index(drop=True)
elif father==False:
return df[(df['parental_situation']=='dependente-esposa') & (df['id']==family_id)].drop(['name','age',
'parental_situation'], axis=1).rename(columns={'can_read':'mother_canread',
'nationality':'mother_nationality'}).reset_index(drop=True)
condition_children = (df['parental_situation']=='filha_dependente') | (df['parental_situation']=='filho_dependente')
final=df.merge(get_parents_info(1,father=False)).merge(get_parents_info(1,father=True)).append(df.merge(get_parents_info(2,father=False)).merge(get_parents_info(2,father=True)))
final[condition_children]
Which gives me exactly what I want. But I think there must be a more efficient and elegant way to do that (my dataframe is rather a big one). Furthermore,it will be nice if there is a more flexible approach since in my real data the sequence of parental information is not so well behaved. The mother can be the family chief (if the man is dead, for example) or she can come first, even if there is a man as the family chief.
Any help will be appreciated!
P.S: sorry, chefe is equivalent to father, dependente-esposa is equivalent to mother. The remaining are all children
解决方案
可能有一种更有效的方法,但是通过一些摆弄来为每个家庭分配一个独特的组,并为孩子们旋转和添加一个列,即使它可能不像你想要的那样平坦,你也可以获得你想要的东西...
# rename to mother/father and we can assume NaN to be child
df['category'] = df['parental_situation'].map({'chefe': 'father', 'dependente-esposa': 'mother'})
# assign a family group
is_parent = df['category'].notna()
df.loc[is_parent, 'group'] = (is_parent ^ is_parent.shift()).astype(int)
df['group'] = df['group'].fillna(0).cumsum()
# Pivot accordingly
parents = df[is_parent].pivot('group', 'category')
# Add extra level to child columns
children = df[~is_parent].set_index('group')
children.columns = pd.MultiIndex.from_product([children.columns, ['child']])
# Do a join...
new = children.join(parents)
这给了你:
name age can_read nationality parental_situation category name age can_read nationality parental_situation
child child child child child child father mother father mother father mother father mother father mother
group
1.0 Atena 15 yes Brazilian filha_dependente NaN Stephen Dedalus Helena 30 32 yes no Irish Greek chefe dependente-esposa
1.0 Odisseu 16 yes Irish filho_dependente NaN Stephen Dedalus Helena 30 32 yes no Irish Greek chefe dependente-esposa
2.0 Antígona 12 yes Irish filha_dependente NaN Filoctetes Medeia 45 43 no yes Greek Brazilian chefe dependente-esposa
2.0 Tebas 13 no Greek filha_dependente NaN Filoctetes Medeia 45 43 no yes Greek Brazilian chefe dependente-esposa
2.0 Aqulies 6 yes Brazilian filho_dependente NaN Filoctetes Medeia 45 43 no yes Greek Brazilian chefe dependente-esposa
您可能希望在加入之前进行交换级别或删除一些列,但这应该为您提供一种让您接近的通用方法。
进一步的想法是限制枢轴中的值并明确使用子框架中的列,例如:
# Pivot accordingly
parents = df[is_parent].pivot('group', 'category', ['can_read', 'nationality'])
# Add extra level to child columns
children = df.loc[~is_parent, ['group', 'name', 'age', 'can_read', 'nationality']].set_index('group')
children.columns = pd.MultiIndex.from_product([children.columns, ['child']])
然后会给你:
name age can_read nationality can_read nationality
child child child child father mother father mother
group
1.0 Atena 15 yes Brazilian yes no Irish Greek
1.0 Odisseu 16 yes Irish yes no Irish Greek
2.0 Antígona 12 yes Irish no yes Greek Brazilian
2.0 Tebas 13 no Greek no yes Greek Brazilian
2.0 Aqulies 6 yes Brazilian no yes Greek Brazilian
...这更接近您想要的结果。
推荐阅读
- javascript - Javascript 在具有基础 6 的 Django 模板中不起作用
- reactjs - 使用 React 上下文将对象传递给 context.provider 的值时如何避免重新渲染
- kubernetes - Minikube - 如何获取前端 Web 应用程序的后端 IP 地址/主机?
- ruby - ruby 有没有办法在不重复实例名称的情况下应用实例方法
- powershell - 每个循环的Powershell等待输出
- python - 脚本未显示在 GIMP 插件浏览器中
- ios - GMSCoordinateBounds 是什么意思?它是如何工作的?
- c# - 如何将三元运算符用于 WinForm 对象属性?
- python - 尝试运行烧瓶应用程序时的非法硬件指令
- node.js - TDengine nodejs 连接器安装失败