首页 > 解决方案 > Pandas dataframe: merge files by common columns

问题描述

I have a collection of files that have some common columns that I want to join. In my real problem, there are several dissimilar and common columns. In this toy example, I have a set of a files and a set of b files that have unique columns and share identical c columns.

$ for ii in $(ls *.dat) ; do echo " "; echo $ii ; cat $ii ; done

a1.dat
a,c
4,8
1,10
2,3

a2.dat
a,c
1,2
3,4

b1.dat
b,c
2,8
2,10
1,3

b2.dat
b,c
.2,2
.8,4

I want to sweep through these files and merge them into a single dataframe. Here's what I've tried so far. I concat the first files to make sure I have all of the column names collected then merge the remaining files. When I merge by "inner", an empty dataframe is returned.

$ cat s.py 
import pandas as pd
dat = pd.DataFrame()
for ii in [1, 2]:
  for jj in ['a', 'b']:
     d = pd.read_csv('%s%i.dat' % (jj, ii))
     if ii == 1: dat = pd.concat([dat, d])
     else: dat = pd.merge(dat, d, how='outer')
print(dat)

$ Python s.py 
     a    b   c
0  4.0  NaN   8
1  1.0  NaN  10
2  2.0  NaN   3
3  NaN  2.0   8
4  NaN  2.0  10
5  NaN  1.0   3
6  1.0  NaN   2
7  3.0  NaN   4
8  NaN  0.2   2
9  NaN  0.8   4

This is not my desired output. I don't understand how I can make this work better. The desired output was

     a    b   c
0  4.0  2.0   8
1  1.0  2.0  10
2  2.0  1.0   3
3  1.0  0.2   2
4  3.0  0.8   4

标签: pythonpandasdataframe

解决方案


有两个步骤:

首先,将所有相同类型的文件连接到一个 DataFrame 中:

df = {}
for k in ['a', 'b']:
    df[k] = pd.concat([
            pd.read_csv('%s%d.dat' % (k, i)) for i in [1, 2]
            ], axis=0)

然后在共享列'c'上合并连接,

result = df['a'].merge(df['b'], on='c')[['a', 'b', 'c']]

推荐阅读