首页 > 解决方案 > 如何使用 pandas 将文件名映射到从多个 excel 文件中提取的列

问题描述

我正在尝试从多个 excel 文件中提取所有列,然后将文件名映射到每个提取的列,但是我正在努力解决“TypeError:索引不支持可变操作”。

下面是我的两个文件:

Fund_Data.xlsx:
    
    FUND ID  FUND NAME           AMOUNT   client code  Price description   Trade Date    Trade Datetime  
0   10101   Holdings company A  10000.5   1234        124.3   abcd        2020-08-19    2020-08-19 12:30:00 
1   20202   Holdings company B  -2000.5   192         -24.2   abcd        2020-08-20    2020-08-20 12:30:00 
2   30303   Holdings company C  3000.5    123         192     NaN         2020-08-21    2020-08-21 12:30:00 
3   10101   Holdings company A  10000     1234567     5.5     NaN         2020-08-22    2020-08-22 12:30:00
4   20202   Holdings company B  10000.5   9999        3.887   abcd        2020-08-23    2020-08-23 12:30:00
Stocks.xlsx

ID   STOCK                     VALUE
1   3i                         100
2   Admiral Group              200
3   Anglo American             300
4   Antofagasta                100
5   Ashtead                    200
6   Associated British Foods   300
7   AstraZeneca                400
8   Auto Trader Group          500
9   Avast                      600

到目前为止,这是我的代码:

import pandas as pd
from os import walk

f = []
directory = 'C:/Users/rrai020/Documents/Python Scripts/DD'
for (dirpath, dirnames, filenames) in os.walk(directory):
    for x in filenames:
        if x.endswith('xlsx'):
            f.append(x)
#f = ['Fund_Data.xlsx', 'Stocks.xlsx'] created a list from filenames in directory ^^^

data = pd.DataFrame() # initialize empty df
for filename in f:
    df = pd.read_excel(filename, dtype = object, ignore_index=True).columns # read in each excel to df
    df['filename'] = filename # add a column with the filename
    data = data.append(df) # add all small df's to big df 

print(data)

我正在尝试实现以下输出(或类似输出):

Field Name       Filename
FUND ID          Fund_Data.xlsx
FUND NAME        Fund_Data.xlsx
AMOUNT           Fund_Data.xlsx
client code      Fund_Data.xlsx
Price            Fund_Data.xlsx
description      Fund_Data.xlsx
Trade Date       Fund_Data.xlsx
Trade Datetime   Fund_Data.xlsx
Trade time       Fund_Data.xlsx
ID               Stocks.xlsx
STOCK            Stocks.xlsx
VALUE            Stocks.xlsx

我希望代码灵活,以便它可以用于我这里的 2 个以上的文件。抱歉,如果这是微不足道的,我仍在学习!

标签: pythonpandasdataframe

解决方案


问题在于您要附加的数据框。我们需要为循环内的每个文件创建一个包含Field Name,Filename列的数据框,然后将其附加到data.

这是一个选项:

data = pd.DataFrame()
for filename in f:
    # read in each excel to df
    df = pd.read_excel(filename, dtype = object, ignore_index=True).columns

    # create a dataframe with (Field Name, Filename) columns for current file
    x = pd.DataFrame({'Field Name': x.columns, 'Filename': filename})
    
    # append to the global dataframe
    data = data.append(x)

data

输出:

          Field Name        Filename
0            FUND ID  Fund_Data.xlsx
1          FUND NAME  Fund_Data.xlsx
2             AMOUNT  Fund_Data.xlsx
3        client code  Fund_Data.xlsx
4  Price description  Fund_Data.xlsx
5         Trade Date  Fund_Data.xlsx
6     Trade Datetime  Fund_Data.xlsx
7                 ID     Stocks.xlsx
8              STOCK     Stocks.xlsx
9              VALUE     Stocks.xlsx

推荐阅读