首页 > 解决方案 > Is there some way to find the intersect in multiple file names between multiple CSV headers?

问题描述

I am trying to loop through all CSV files in a folder and find all header names that are in all files. I am thinking the code would start like this...it needs treatment and enhancement, for sure.

import glob
import pandas as pd

csvs = glob.glob('C:\\my_path\' + '*.csv')

master_set = set()

for file in csvs:
    this_df = pd.read_csv(file)
    cols = set(this_df.columns)
    master_set = master_set.intersection(cols)

print(master_set)

This is just looping through files in a folder, obviously. What I want to do is compare all CSV headers in one folder, and check for the matches (intersection) of all headers, and print that result. Does it make sense? I hope so. I will need to do a UNION of all these files at some point. I am trying to determine the best way to get all common headers together. This is the lowest common denominator of the whole data series.

So, if I have 4 files with this schema:

colA colB colC colD colE

And, I have one file with this schema:

colA colC colE colX colX

Then, this is want I to see:

colA colC colE

标签: pythonpython-3.x

解决方案


是的,您可以这样做,但需要您在文件列表上循环并存储结果。就示例而言,这是代码。

import pandas as pd
df1 = pd.read_csv("File1.csv")
df2 = pd.read_csv("File2.csv")
setA = set(df1.columns)
setB = set(df2.columns)
common = setA.intersection(setB)

推荐阅读