首页 > 解决方案 > 有效地重命名具有复杂逻辑的数据框列

问题描述

我有两个问题,这是我要针对以下代码解决的第一个concat sheets问题xlsx

import os
import pandas as pd

shared_BM_NL_Q2_DNS = r'Shared_BM_NL_Q2_DNS.xlsx'
sheet_names = ['client31_KPN', 'client32_T-Mobile', 'client33_Vodafone']
cols = ['A:AB', 'A:AB', 'A:AB']
df = {}
for ws, c in zip(sheet_names, cols):
    df[ws] = pd.read_excel(shared_BM_NL_Q2_DNS, sheet_name = ws, usecols = c)

第二个问题我想使用以下行读取工作表中的所有列:

cols = ['A:AB', 'A:AB', 'A:AB']

请注意:工作表中具有相同名称的列

我也想以更好和更短的方式执行如下代码:

# shared_BM_NL_Q2_DNS
shared_BM_NL_Q2_DNS_df1.columns = shared_BM_NL_Q2_DNS_df1.columns.str.replace(' ', '_')
shared_BM_NL_Q2_DNS_df1.columns = shared_BM_NL_Q2_DNS_df1.columns.str.replace('\n', '')
shared_BM_NL_Q2_DNS_df1.columns = shared_BM_NL_Q2_DNS_df1.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
shared_BM_NL_Q2_DNS_df1.columns = map(str.lower, shared_BM_NL_Q2_DNS_df1.columns)

shared_BM_NL_Q2_DNS_df2.columns = shared_BM_NL_Q2_DNS_df2.columns.str.replace(' ', '_')
shared_BM_NL_Q2_DNS_df2.columns = shared_BM_NL_Q2_DNS_df2.columns.str.replace('\n', '')
shared_BM_NL_Q2_DNS_df2.columns = shared_BM_NL_Q2_DNS_df2.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
shared_BM_NL_Q2_DNS_df2.columns = map(str.lower, shared_BM_NL_Q2_DNS_df2.columns)

shared_BM_NL_Q2_DNS_df3.columns = shared_BM_NL_Q2_DNS_df3.columns.str.replace(' ', '_')
shared_BM_NL_Q2_DNS_df3.columns = shared_BM_NL_Q2_DNS_df3.columns.str.replace('\n', '')
shared_BM_NL_Q2_DNS_df3.columns = shared_BM_NL_Q2_DNS_df3.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
shared_BM_NL_Q2_DNS_df3.columns = map(str.lower, shared_BM_NL_Q2_DNS_df3.columns)
dataframes2 = [shared_BM_NL_Q2_DNS_df1, shared_BM_NL_Q2_DNS_df2, shared_BM_NL_Q2_DNS_df3]
join2 = pd.concat(dataframes2).reset_index(drop=True)

之前的代码在更新之前属于我的旧代码,如下所示:

import os
import pandas as pd

shared_BM_NL_Q2_DNS = 'Shared_BM_NL_Q2_DNS.xlsx'

shared_BM_NL_Q2_DNS_df1 = pd.read_excel(os.path.join(os.path.dirname(__file__), shared_BM_NL_Q2_DNS), sheet_name='client31_KPN')
shared_BM_NL_Q2_DNS_df2 = pd.read_excel(os.path.join(os.path.dirname(__file__), shared_BM_NL_Q2_DNS), sheet_name='client32_T-Mobile')
shared_BM_NL_Q2_DNS_df3 = pd.read_excel(os.path.join(os.path.dirname(__file__), shared_BM_NL_Q2_DNS), sheet_name='client33_Vodafone')

#shared_BM_NL_Q2_DNS
shared_BM_NL_Q2_DNS_df1.columns = shared_BM_NL_Q2_DNS_df1.columns.str.replace(' ', '_')
shared_BM_NL_Q2_DNS_df1.columns = shared_BM_NL_Q2_DNS_df1.columns.str.replace('\n', '')
shared_BM_NL_Q2_DNS_df1.columns = shared_BM_NL_Q2_DNS_df1.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
shared_BM_NL_Q2_DNS_df1.columns = map(str.lower, shared_BM_NL_Q2_DNS_df1.columns)

shared_BM_NL_Q2_DNS_df2.columns = shared_BM_NL_Q2_DNS_df2.columns.str.replace(' ', '_')
shared_BM_NL_Q2_DNS_df2.columns = shared_BM_NL_Q2_DNS_df2.columns.str.replace('\n', '')
shared_BM_NL_Q2_DNS_df2.columns = shared_BM_NL_Q2_DNS_df2.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
shared_BM_NL_Q2_DNS_df2.columns = map(str.lower, shared_BM_NL_Q2_DNS_df2.columns)

shared_BM_NL_Q2_DNS_df3.columns = shared_BM_NL_Q2_DNS_df3.columns.str.replace(' ', '_')
shared_BM_NL_Q2_DNS_df3.columns = shared_BM_NL_Q2_DNS_df3.columns.str.replace('\n', '')
shared_BM_NL_Q2_DNS_df3.columns = shared_BM_NL_Q2_DNS_df3.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
shared_BM_NL_Q2_DNS_df3.columns = map(str.lower, shared_BM_NL_Q2_DNS_df3.columns)
dataframes2 = [shared_BM_NL_Q2_DNS_df1, shared_BM_NL_Q2_DNS_df2, shared_BM_NL_Q2_DNS_df3]
join2 = pd.concat(dataframes2).reset_index(drop=True)

#编辑:

我试图创建一些接近我想要的东西,如下代码:

for ws, c in zip(sheet_names, cols):
    df[ws] = pd.read_excel(shared_BM_NL_Q2_DNS, sheet_name = ws, usecols = c)

    df[ws].columns = df[ws].columns.str.replace(' ', '_')
    df[ws].columns = df[ws].columns.str.replace('\n', '')
    df[ws].columns = df[ws].columns.str.replace(r"[^a-zA-Z\d\_]+", "")
    df[ws].columns = map(str.lower, df[ws].columns)

    join2 = pd.concat(ws).reset_index(drop=True)

但我发现以下错误:

Traceback (most recent call last):
  File "D:/Python Projects/MyAuditPy/pd_read.py", line 29, in <module>
    join2 = pd.concat(ws).reset_index(drop=True)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\reshape\concat.py", line 271, in concat
    op = _Concatenator(
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\reshape\concat.py", line 306, in __init__
    raise TypeError(
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "str"

标签: pythonpandas

解决方案


首先,我尽量避免.columns直接分配属性。出错的风险太大。

这是我要做的:


def renamer(c):
    # I'm assuming this does what you want. hard to tell without knowing
    # what your input and output looks like.
    return (
        c.strip().split(' ')[-1].lower()
    )

df = pd.concat([
    pd.read_excel(shared_BM_NL_Q2_DNS, sheet_name=ws, usecols=c)
      .rename(columns=renamer)
    for ws, c in zip(sheet_names, cols)
], ignore_index=True).reset_index(drop=True)

   

推荐阅读