python-3.x - 使用频率计数合并 Pandas 数据帧
问题描述
我有一个(df1)
包含学生详细信息的数据框,例如-
Student ID Course Code Mark
1 C001 88
1 C002 71
2 C003 67
3 C002 92
3 C001 66
3 C004 70
4 C004 65
和另一个(df2)
具有
WR ID K ID Course Code
SP-RS-01 K001 C002, C004
SP-RS-01 K004 C002
SP-RS-02 K005
SP-RS-03 K004 C003, C004
SP-RS-03 K006 C001
现在,我需要一个数据框,其中包含每个学生 ID 的 KID 和 WR ID,根据他们所学的课程。如果他们不止一次这样做,可能会提到计数(作为字典)。所以,这样的事情也许——
Student ID Courses KID WR ID
1 C001, C002 K006, K001, K004 SP-RS-03
2 C003 K004 SP-RS-01, SP-RS-03
3 C001, C002, C004 K001x2, K006 SP-RS-01, SP-RS-03,
K004x2
4 C004 K004 SP-RS-01, SP-RS-03
我该怎么做呢?
解决方案
您可以使用:
#first flatten values pslitted by ,
s = (df2.set_index(['WR ID','K ID'])['Course Code']
.str.split(',\s+', expand=True)
.stack()
.reset_index(level=2, drop=True)
.rename('Course Code')
)
#print (s)
#aggregate list per Course Code
df2 = (df2.drop('Course Code', axis=1)
.join(s, on=['WR ID','K ID'])
.groupby('Course Code')
.agg(list)
.reset_index()
)
print (df2)
Course Code WR ID K ID
0 C001 [SP-RS-03] [K006]
1 C002 [SP-RS-01, SP-RS-01] [K001, K004]
2 C003 [SP-RS-03] [K004]
3 C004 [SP-RS-01, SP-RS-03] [K001, K004]
from collections import Counter
#combination flattening nested lists, Counter and new format with counts
f = lambda x: ', '.join(f'{k}x{v}' if v > 1 else k
for k, v in Counter([z for y in x for z in y]).items())
#merge together and aggregate again
df = (df1.merge(df2, on='Course Code', how='left')
.groupby('Student ID')
.agg({'Course Code':', '.join,
'WR ID':f,
'K ID':f})
.reset_index()
)
print (df)
Student ID Course Code WR ID K ID
0 1 C001, C002 SP-RS-03, SP-RS-01x2 K006, K001, K004
1 2 C003 SP-RS-03 K004
2 3 C002, C001, C004 SP-RS-01x3, SP-RS-03x2 K001x2, K004x2, K006
3 4 C004 SP-RS-01, SP-RS-03 K001, K004
编辑:
问题是一些缺失值,解决方案是将它们替换为空列表:
from collections import Counter
#combination flattening nested lists, Counter and new format with counts
f = lambda x: ', '.join(f'{k}x{v}' if v > 1 else k
for k, v in Counter([z for y in x for z in y]).items())
#merge together and aggregate again
df = df1.merge(df2, on='Course Code', how='left')
df[['WR ID','K ID']] = df[['WR ID','K ID']].applymap(lambda x: x if x==x else [])
df = (df.groupby('Student ID')
.agg({'Course Code':', '.join,
'WR ID':f,
'K ID':f})
.reset_index()
)
推荐阅读
- angular - Angular - routerLink 和状态的问题
- mysql - 存储用户多个爱好的关系型数据库结构
- ios - 在swift 5中将字节转换为unicode文本
- php - Retrieving eloquent api resource using keyby collection method
- angular - Phaser 3 Phaser.GameObjects.Sprite body not found
- java - compileOptions set to JavaVersion 1.8 cause gradle to fail sync
- matplotlib - 在 Julia 中绘图:缺乏广泛且容易理解的文档?
- acumatica - 从父帐户更新客户子项
- c# - 如何使用 ConfigurationManager 将连接字符串值从 Unity3d 传递到类库
- sql - Rails 关联 SQL