python - 将数据框与现有值合并
问题描述
我喜欢合并 Pandas Dataframe 中长度不等的两列。
我尝试了许多合并、连接和加入的方法,但没有奏效。
keyList = ["Clone", "Chain", "Fragment", "R0", "R1", "R2"]
dataDict = {key: [] for key in keyList}
# Example for different list length
plist1 = ["ABCD", "DJFZ", "DHRZ"]
plist2 = ["ABCD", "DJFZ", "DHRZ", "JGJZ"]
filelist = ["E2_VH_Fab_R0.fasta", "E2_VH_scFV_R0.fasta", "E2_VH_Fab_R1.fasta", "E2_VH_scFV_R1.fasta","E2_VH_Fab_R2.fasta" ]
# Subsets are:
# E1 || E2 with VH || VL with Fab || scFV with R0 || R1 || R2
for file in enumerate(filelist):
# Get List with emits from class function
peptidelist = clseq.processEmits()
# Split filename into 6 parameters, see keylist
fileparms = datafile.split('.')[0].split('_')
# Iterate through peptide list and add the subsets into the dict
for peptide in peptidelist:
dataDict.setdefault("Clone", []).append(sclone)
dataDict.setdefault("Chain", []).append(schain)
dataDict.setdefault("Fragment", []).append(sfragment)
# Set other Rounds as "NaN" to equal the length
if "R0" in sround:
dataDict.setdefault("R0", []).append(peptide)
dataDict.setdefault("R1", []).append("NaN")
dataDict.setdefault("R2", []).append("NaN")
elif "R1" in sround:
dataDict.setdefault("R0", []).append("NaN")
dataDict.setdefault("R1", []).append(peptide)
dataDict.setdefault("R2", []).append("NaN")
elif "R2" in sround:
dataDict.setdefault("R0", []).append("NaN")
dataDict.setdefault("R1", []).append("NaN")
dataDict.setdefault("R2", []).append(peptide)
else:
dataDict.setdefault("R0", []).append("NaN")
dataDict.setdefault("R1", []).append("NaN")
dataDict.setdefault("R2", []).append("NaN")
dtframe.merge(pd.DataFrame(dataDict), on=['Clone', 'Chain', 'Fragment'], how='inner')
问题是,我有不同的列表长度,我喜欢合并到一个数据帧中,并用 NaN 填充其余部分。
这个:
0 E2 VH Fab r0 nan
1 E2 VH Fab r0 nan
2 E2 VH Fab r0 nan
3 E2 VH Fab r0 nan
4 E2 VH Fab r0 nan
5 E2 VH Fab r0 nan
和这个:
0 E2 VH Fab nan r1
1 E2 VH Fab nan r1
2 E2 VH Fab nan r1
3 E2 VH Fab nan r1
4 E2 VH Fab nan r1
5 E2 VH Fab nan r1
6 E2 VH Fab nan r1
7 E2 VH Fab nan r1
应该导致:
0 E2 VH Fab r0 r1
1 E2 VH Fab r0 r1
2 E2 VH Fab r0 r1
3 E2 VH Fab r0 r1
4 E2 VH Fab r0 r1
5 E2 VH Fab r0 r1
6 E2 VH Fab nan r1
7 E2 VH Fab nan r1
请注意,我所有的数据字段都是字符串。
解决方案
这是combine_first
. 我们需要将索引设置为要合并的三列,然后cumcount
为具有许多不同组的真实数据创建一个附加级别。
df1['idx'] = df1.groupby(['Clone', 'Chain', 'Fragment']).cumcount()
df2['idx'] = df2.groupby(['Clone', 'Chain', 'Fragment']).cumcount()
df1 = df1.set_index(['Clone', 'Chain', 'Fragment', 'idx'])
df2 = df2.set_index(['Clone', 'Chain', 'Fragment', 'idx'])
df1.combine_first(df2).reset_index()
# Clone Chain Fragment idx R0 R1
#0 E2 VH Fab 0 r0 r1
#1 E2 VH Fab 1 r0 r1
#2 E2 VH Fab 2 r0 r1
#3 E2 VH Fab 3 r0 r1
#4 E2 VH Fab 4 r0 r1
#5 E2 VH Fab 5 r0 r1
#6 E2 VH Fab 6 NaN r1
#7 E2 VH Fab 7 NaN r1
df1
Clone Chain Fragment R0 R1
0 E2 VH Fab r0 NaN
1 E2 VH Fab r0 NaN
2 E2 VH Fab r0 NaN
3 E2 VH Fab r0 NaN
4 E2 VH Fab r0 NaN
5 E2 VH Fab r0 NaN
df2
Clone Chain Fragment R0 R1
0 E2 VH Fab NaN r1
1 E2 VH Fab NaN r1
2 E2 VH Fab NaN r1
3 E2 VH Fab NaN r1
4 E2 VH Fab NaN r1
5 E2 VH Fab NaN r1
6 E2 VH Fab NaN r1
7 E2 VH Fab NaN r1
推荐阅读
- swift - 从 Core Data uisng AVPlayer 播放视频
- winapi - 在没有 vsync 的情况下与监视器刷新同步
- java - 在为树数据结构创建树类时是否必须使类节点静态
- python - ImportError:无法导入名称“winrandom”
- reactjs - 错误 TS2322:类型 '"openDialog('https://someurl.sharepoint.com/...' 不可分配给类型 'EventHandler
> - php - 如何为foreach循环进行分页
- c++ - 用于图形实现的邻接矩阵中的数组错误向量
- postgresql - 在 Postgres 存储过程中以常量声明表名
- c# - 如何获取类列表,从一个类和一个接口继承/或者从两个接口继承
- mysql - 无论我做什么,Phpmyadmin 都会不断向我抛出错误 1146 'table xxx.xxx 不存在'