首页 > 解决方案 > 连接相同列数但行数不同的 55 个数据场,并用零填充缺失值

问题描述

假设我有 55 个文件,每个文件有 2 列,每个文件有不同的行数。我使用以下代码将它们连接起来。

path       = r'/data/user/files' 
files      = os.listdir(path)
file_score   = [os.path.join(path,i) for i in files if i.endswith('tped')]
score   = [pd.read_csv(x, sep='\t',header=0) for x in file_score]
score   = pd.concat(score,axis=1)

现在输出的score数据帧如下所示,

  gene  file1   gene    file2   gene    file3   gene    file4   gene    file5   
0   A1BG    5.014479    A1BG    6.268099    A1BG    5.014479    A1BG    5.014479    A1BG    5.014479    ... A1BG    6.268099    A1BG    5.014479    A1BG    5.014479    A1BG    5.014479    A1BG    5.014479
1   A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    ... A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578
2   A1CF    NaN A2M -2.851459   A2M -2.851459   A2M -2.851459   A2M -2.851459   ... A2M -2.604416   A1CF    NaN A2M -2.851459   A2M -2.851459   A2M -2.851459
3   A2M -11.405835  A2ML1   -0.007012   A2ML1   -0.010518   A2ML1   -0.010518   A2ML1   -0.007012   ... A2ML1   -0.007012   A2M -2.851459   A2ML1   -0.010518   A2ML1   5.705464    A2ML1   -0.007012
4   A2ML1   0.569222    AAAS    NaN AAAS    -3.693289   A4GALT  NaN AAAS    NaN ... A3GALT2 1.174647    A2ML1   -0.007012   A3GALT2 -0.141380   A4GALT  NaN A4GALT  NaN

我需要的是gene作为我的索引的file*列和作为我最终数据框的列的列。genes每个file值的列都不同。但是,我需要它作为索引并file用零填充每列的缺失值。我不确定如何实现这一目标。简单set_index的对我不起作用。

任何建议表示赞赏。谢谢

标签: pythonpandasdataframepivot-table

解决方案


path       = r'/data/user/files' 
files      = os.listdir(path)
file_score   = [os.path.join(path,i) for i in files if i.endswith('tped')]
score   = [pd.read_csv(x, sep='\t',header=0) for x in file_score]
score   = pd.concat(score,axis=1)

# This should remove all duplicated columns
# Only columns with duplicated names, not values
score = score.loc[:,~score.columns.duplicated()]
# to set the index with the genes column
score.index = score['gene']
 # to fill all N/As with 0
score = score.fillna(0)

推荐阅读