python - Pandas dataframe: merge files by common columns
问题描述
I have a collection of files that have some common columns that I want to join. In my real problem, there are several dissimilar and common columns. In this toy example, I have a set of a
files and a set of b
files that have unique columns and share identical c
columns.
$ for ii in $(ls *.dat) ; do echo " "; echo $ii ; cat $ii ; done
a1.dat
a,c
4,8
1,10
2,3
a2.dat
a,c
1,2
3,4
b1.dat
b,c
2,8
2,10
1,3
b2.dat
b,c
.2,2
.8,4
I want to sweep through these files and merge them into a single dataframe. Here's what I've tried so far. I concat the first files to make sure I have all of the column names collected then merge the remaining files. When I merge by "inner", an empty dataframe is returned.
$ cat s.py
import pandas as pd
dat = pd.DataFrame()
for ii in [1, 2]:
for jj in ['a', 'b']:
d = pd.read_csv('%s%i.dat' % (jj, ii))
if ii == 1: dat = pd.concat([dat, d])
else: dat = pd.merge(dat, d, how='outer')
print(dat)
$ Python s.py
a b c
0 4.0 NaN 8
1 1.0 NaN 10
2 2.0 NaN 3
3 NaN 2.0 8
4 NaN 2.0 10
5 NaN 1.0 3
6 1.0 NaN 2
7 3.0 NaN 4
8 NaN 0.2 2
9 NaN 0.8 4
This is not my desired output. I don't understand how I can make this work better. The desired output was
a b c
0 4.0 2.0 8
1 1.0 2.0 10
2 2.0 1.0 3
3 1.0 0.2 2
4 3.0 0.8 4
解决方案
有两个步骤:
首先,将所有相同类型的文件连接到一个 DataFrame 中:
df = {}
for k in ['a', 'b']:
df[k] = pd.concat([
pd.read_csv('%s%d.dat' % (k, i)) for i in [1, 2]
], axis=0)
然后在共享列'c'上合并连接,
result = df['a'].merge(df['b'], on='c')[['a', 'b', 'c']]
推荐阅读
- javascript - 如何找到与提供的 slug 匹配的某个对象的索引
- android - 使用 cordova-plugin-firebase-messaging 显示来自 Ionic 的通知
- r - 使用 %in% 进行字符串搜索(包含特殊字符)在 str_detect 工作时不起作用
- vba - VBA:设置工作簿,并激活工作簿错误
- vba - How to copy data from csv file to other excel files
- shell - Awk the command output
- javascript - React-Redux TypeError: this.props.getAnimals is not a function
- javascript - 使用 javascript 获取选定的文件名
- sql - 如何查询数据库以返回某个演员演过的电影?
- python-3.x - Python 无法将浮点对象隐式转换为 str