首页 > 解决方案 > 用于数据管道的 Python Pandas 合并、融化、pivot_table

问题描述

我有一个用 Python 和 Pandas 编写的数据管道。它有效,我最喜欢它。用 Python 很不错。在这一点上,熊猫不太好。然而,作为一名 Python 的学生,我一直在寻找更加惯用的方式,而不是依赖于我可能在其他语言中使用的循环和逻辑。具体来说,我总是惊讶于如何使用 Pandas 方法迭代系列和帧,而不是围绕它们构建循环。

这是源文件。我还有其他来自 API 调用的人口统计信息。此代码已从我的示例中截断,但假设 data['record'] 具有我的人口统计数据,我可以使用 pd.merge() 成功合并这些数据。

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|StudentFirstName|StudentMiddleName|StudentLastName|UniqueIdentifier|Grade|GOM   |SchoolYear|Fall_September|Fall_SeptemberDateGiven|Fall_SeptemberNationalPercentileRank|Winter_January|Winter_JanuaryDateGiven|Winter_JanuaryNationalPercentileRank|Spring_May|Spring_MayDateGiven|Spring_MayNationalPercentileRank|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                |                 |               |100             |1    |LNF   |2017      |29            |9/11/2017              |14                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |100             |1    |LSF   |2017      |31            |9/11/2017              |51                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |100             |1    |M-COMP|2017      |8             |9/20/2017              |48                                  |15            |2/5/2018               |17                                  |42        |5/8/2018           |65                              |
|                |                 |               |100             |1    |MNM   |2017      |9             |9/11/2017              |36                                  |7             |2/1/2018               |3                                   |5         |5/8/2018           |1                               |
|                |                 |               |100             |1    |NIM   |2017      |35            |9/11/2017              |34                                  |62            |2/1/2018               |52                                  |51        |5/8/2018           |18                              |
|                |                 |               |100             |1    |NWF   |2017      |28            |9/11/2017              |37                                  |69            |2/1/2018               |71                                  |31        |5/8/2018           |5                               |
|                |                 |               |100             |1    |OCM   |2017      |62            |9/11/2017              |30                                  |89            |2/1/2018               |58                                  |94        |5/8/2018           |51                              |
|                |                 |               |100             |1    |PSF   |2017      |14            |9/11/2017              |10                                  |              |                       |                                    |49        |5/8/2018           |33                              |
|                |                 |               |100             |1    |QDM   |2017      |23            |9/11/2017              |57                                  |31            |2/1/2018               |46                                  |26        |5/8/2018           |15                              |
|                |                 |               |100             |1    |R-CBM |2017      |8             |9/11/2017              |36                                  |17            |2/1/2018               |22                                  |29        |5/8/2018           |15                              |
|                |                 |               |200             |1    |LNF   |2017      |47            |9/11/2017              |51                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |LSF   |2017      |47            |9/11/2017              |86                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |M-COMP|2017      |27            |9/22/2017              |92                                  |34            |2/2/2018               |67                                  |47        |5/8/2018           |88                              |
|                |                 |               |200             |1    |MNM   |2017      |11            |9/11/2017              |48                                  |23            |2/1/2018               |80                                  |21        |5/9/2018           |55                              |
|                |                 |               |200             |1    |NIM   |2017      |56            |9/11/2017              |81                                  |80            |2/1/2018               |95                                  |80        |5/9/2018           |92                              |
|                |                 |               |200             |1    |NWF   |2017      |63            |9/11/2017              |87                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |OCM   |2017      |107           |9/11/2017              |                                    |109           |2/1/2018               |                                    |109       |5/9/2018           |                                |
|                |                 |               |200             |1    |PSF   |2017      |50            |9/11/2017              |73                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |QDM   |2017      |28            |9/11/2017              |75                                  |38            |2/1/2018               |78                                  |40        |5/9/2018           |84                              |
|                |                 |               |200             |1    |R-CBM |2017      |40            |9/11/2017              |80                                  |76            |2/1/2018               |80                                  |84        |5/9/2018           |65                              |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

由于这是要去的地方,我需要将其拆分为两个文件,一个用于读取数据,一个用于数学数据。它进入哪个文件由 GOM 列确定。大约一半的 GOM 值用于阅读,另一半用于数学。

这是两个输出文件的样子。阅读:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date      |LNF 1|LNF 2|LNF 3    |LSF 1|LSF 2|LSF 3   |PSF 1|PSF 2|PSF 3   |NWF 1|NWF 2|NWF 3  |R-CBM 1|R-CBM 2|R-CBM 3|MAZE 1|MAZE 2|MAZE 3|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100           |         |          |1    |Fall  |9/11/2017 |29   |14   |Level 2  |31   |51   |Level 3 |14   |10   |Level 1 |28   |37   |Level 3|8      |36     |Level 3|      |      |      |
|100           |         |          |1    |Spring|5/8/2018  |     |     |         |     |     |        |49   |33   |Level 3 |31   |5    |Level 1|29     |15     |Level 2|      |      |      |
|100           |         |          |1    |Winter|2/1/2018  |     |     |         |     |     |        |     |     |        |69   |71   |Level 3|17     |22     |Level 2|      |      |      |
|200           |         |          |1    |Fall  |9/11/2017 |47   |51   |Level 3  |47   |86   |Level 4 |50   |73   |Level 3 |63   |87   |Level 4|40     |80     |Level 4|      |      |      |
|200           |         |          |1    |Spring|5/9/2018  |     |     |         |     |     |        |     |     |        |     |     |       |84     |65     |Level 3|      |      |      |
|200           |         |          |1    |Winter|2/1/2018  |     |     |         |     |     |        |     |     |        |     |     |       |76     |80     |Level 4|      |      |      |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

和数学:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date     |OCM 1|OCM 2|OCM 3  |NIM 1|NIM 2|NIM 3  |QDM 1|QDM 2|QDM 3  |MNM 1|MNM 2|MNM 3  |M-COMP 1|M-COMP 2|M-COMP 3|M-CAP 1|M-CAP 2|M-CAP 3|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100           |         |          |1    |Fall  |9/11/2017|62   |30   |Level 3|35   |34   |Level 3|23   |57   |Level 3|9    |36   |Level 3|8       |48      |Level 3 |       |       |       |
|100           |         |          |1    |Spring|5/8/2018 |94   |51   |Level 3|51   |18   |Level 2|26   |15   |Level 2|5    |1    |Level 1|42      |65      |Level 3 |       |       |       |
|100           |         |          |1    |Winter|2/1/2018 |89   |58   |Level 3|62   |52   |Level 3|31   |46   |Level 3|7    |3    |Level 1|15      |17      |Level 2 |       |       |       |
|200           |         |          |1    |Fall  |9/11/2017|107  |     |       |56   |81   |Level 4|28   |75   |Level 3|11   |48   |Level 3|27      |92      |Level 5 |       |       |       |
|200           |         |          |1    |Spring|5/8/2018 |109  |     |       |80   |92   |Level 5|40   |84   |Level 4|21   |55   |Level 3|47      |88      |Level 4 |       |       |       |
|200           |         |          |1    |Winter|2/1/2018 |109  |     |       |80   |95   |Level 5|38   |78   |Level 4|23   |80   |Level 4|34      |67      |Level 3 |       |       |       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

最棘手的事情是为每个主题找到正确的日期。孩子们大约在同一天拿到每个 GOM,但出来的文件为每个 GOM 提供了不同的日期。当数据进入我们的另一个系统时,它是每个主题一个日期,它由几个 GOM 组成。我求助于编写一个函数来遍历它们,但这可能更容易完成。

通常,我一次只运行这个脚本一个赛季并注释掉其他的,但我现在正在导入去年的数据,并且一次完成所有三个赛季。

输出列的快速解释。对于每个 GOM,1 是数字分数,2 是百分位数,3 是性能级别的 alpha 描述。这是组织该文件的一种可怕方式,但这正是供应商所需要的。

我希望以更少的步骤或至少以一种可能更惯用的方式实现这一目标。我发现自己撤消了一些 pivot_table 产生的东西,这样我就可以以我知道的方式再次使用 df ,这仍然非常有限。

我最担心的是第 65-77 行,我在其中熔化和 pivot_table。不确定这是否可以更有效地完成。此外,从 Python 的角度来看,第 108-111 行真的很愚蠢。我只是不想再次列出 GOM,我希望在不重新输入的情况下快速生成 LNF 1、LNF 2、LNF 3、LSF 1 的列表。必须有更好的方法来做到这一点。

如果您对如何改进此过程有任何建议,请告诉我,无论是直接 Python 还是(尤其是)Pandas。

import pandas as pd
import os
import json
import glob
import re
import datetime
import numpy as np
import itertools

# build of list of seasons to process.  usually only one.
seasons = []
seasons.append('Fall')
seasons.append('Winter')
seasons.append('Spring')

# Read demographic data into dataframe
# data['record'] comes from something that was truncated for simplicity
demographics = pd.DataFrame(data['record'], columns=['student_number', 'state_studentnumber', 'last_name', 'first_name', 'dob'])

# iterate through score files
for file in glob.glob('To Do/*.csv'):
    # read score data into dataframe
    aimsweb = pd.read_csv(file)

    aimsweb["Fall_SeptemberDateGiven"] = pd.to_datetime(aimsweb['Fall_SeptemberDateGiven']).dt.strftime('%m/%d/%Y');
    aimsweb["Winter_JanuaryDateGiven"] = pd.to_datetime(aimsweb['Winter_JanuaryDateGiven']).dt.strftime('%m/%d/%Y');
    aimsweb["Spring_MayDateGiven"] = pd.to_datetime(aimsweb['Spring_MayDateGiven']).dt.strftime('%m/%d/%Y');    

    # perform left outer join of score data to demographics.  this ensures these are real students.
    merged = aimsweb.merge(demographics, how='left', left_on=['UniqueIdentifier', 'StudentLastName', 'StudentFirstName'], right_on=['student_number', 'last_name', 'first_name'], indicator=True)

    renames = {}
    # rename table
    renames['Fall_September'] = 'Fall_1'
    renames['Fall_SeptemberDateGiven'] = 'Fall_Date'
    renames['Fall_SeptemberNationalPercentileRank'] = 'Fall_2'

    renames['Winter_January'] = 'Winter_1'
    renames['Winter_JanuaryDateGiven'] = 'Winter_Date'
    renames['Winter_JanuaryNationalPercentileRank'] = 'Winter_2'

    renames['Spring_May'] = 'Spring_1'
    renames['Spring_MayDateGiven'] = 'Spring_Date'
    renames['Spring_MayNationalPercentileRank'] = 'Spring_2'    

    merged.rename(index=str, columns=renames, inplace=True)

    # function for converting percentile into level 1-5
    def percentile2level(percentile):
        if percentile >= 91:
            return 'Level 5'
        if percentile >= 76:
            return 'Level 4'
        if percentile >= 26:
            return 'Level 3'
        if percentile >= 11:
            return 'Level 2'
        if percentile >= 0:
            return 'Level 1'

    # convert percentile to level
    for season in ['Fall', 'Winter', 'Spring']:
        merged[season + '_3'] = merged[season + '_2'].apply(percentile2level);

    # melt data from columns into rows
    merged = merged.melt(id_vars = ['student_number', 'GOM', 'last_name', 'first_name', 'Grade'],value_vars=['Fall_Date', 'Fall_1', 'Fall_2', 'Fall_3', 'Winter_Date', 'Winter_1', 'Winter_2', 'Winter_3', 'Spring_Date', 'Spring_1', 'Spring_2', 'Spring_3'])

    # split the variable into season and attribute
    merged['Season'], merged['Attribute'] = merged['variable'].str.split('_', 1).str

    # pivot data back to columns
    merged = pd.pivot_table(merged, index=['student_number', 'Season', 'last_name', 'first_name', 'Grade'], values=['value'], columns=['GOM', 'Attribute'], aggfunc=np.max, fill_value='')

    # condense the levels and reset index to get back to flat df
    merged.columns = merged.columns.droplevel(0)
    merged.columns = [' '.join(col).strip() for col in merged.columns.values]       
    merged.reset_index(inplace=True);   

    # build subject-GOM mappings
    subjects = {};
    subjects['Reading'] = ['LNF', 'LSF', 'PSF', 'NWF', 'R-CBM', 'MAZE']
    subjects['Mathematics'] = ['OCM', 'NIM', 'QDM', 'MNM', 'M-COMP', 'M-CAP']   

    # function for finding the min non-null date from a list
    def mindate(row, headings):     
        dates = list(row[headings])
        dates = list(filter(None, dates))
        if len(dates) > 0:
            return min(dates)       
        return "";

    # iterate through two subjects
    for subject in subjects:
        # build list of date headers for all GOMs
        headings = [];
        for gom in subjects[subject]:
            headings.append(gom + " Date");

        # create df for this subject
        df = merged

        # create date column with value of minimum from list of dates
        df['Date'] = df.apply(mindate, axis=1, args=(headings,))        

        # filter out records with no real date
        df = df[ df['Date'] != 'NaT' ]

        # build a list of GOM headers with 1, 2, 3
        headings = [];      
        for combo in list(itertools.product( subjects[subject], [1,2,3] )):
            headings.append( combo[0] + " " + str(combo[1]) )           

        # save csv
        df[ df['Season'].isin(seasons) ][ ['student_number', 'last_name', 'first_name', 'Grade', 'Season', 'Date'] + headings].to_csv("./" + os.path.splitext(os.path.basename(file))[0] + "_" + "".join(seasons) + "_" + subject + ".csv", index=False);

标签: pythonpandasdataframepivot-tablemelt

解决方案


推荐阅读