首页 > 解决方案 > 使用字符串而不是日期时间列索引 Pandas 数据库

问题描述

我有一个数据框(df2),它由 30 年的每日气象数据组成。此数据重复多次运行(请参阅 run_file_year)。以下是数据框的示例:

                                    Date  DHI  ...      WD    run_file_year
Date                                           ...                         
1991-01-01 00:00:00  01/01/1991 00:00:00  0.0  ...  281.70  1991_r1_r10i2p1
1991-01-01 01:00:00  01/01/1991 01:00:00  0.0  ...  281.01  1991_r1_r10i2p1
1991-01-01 02:00:00  01/01/1991 02:00:00  0.0  ...  274.43  1991_r1_r10i2p1
1991-01-01 03:00:00  01/01/1991 03:00:00  0.0  ...  280.94  1991_r1_r10i2p1
1991-01-01 04:00:00  01/01/1991 04:00:00  0.0  ...  272.53  1991_r1_r10i2p1
...                                  ...  ...  ...     ...              ...
2021-12-31 19:00:00  31/12/2021 19:00:00  0.0  ...  289.06   2021_r5_r9i2p1
2021-12-31 20:00:00  31/12/2021 20:00:00  0.0  ...  301.39   2021_r5_r9i2p1
2021-12-31 21:00:00  31/12/2021 21:00:00  0.0  ...  301.30   2021_r5_r9i2p1
2021-12-31 22:00:00  31/12/2021 22:00:00  0.0  ...  313.21   2021_r5_r9i2p1
2021-12-31 23:00:00  31/12/2021 23:00:00  0.0  ...  313.29   2021_r5_r9i2p1

我当前的代码如下(请参阅 >>>>>> 了解需要注意的确切行):

df2 = pd.DataFrame(df2, columns=['dry_bulb_temp', 'dew_point_temp','WS','GIR','max_temp','min_temp','max_dew_point','min_dew_point','max_wind'])


for i in range(12):
    c, Q = selectYear(df2, i + 1, config)



def selectYear(d, m, config):
    """
    Use the Sandia method, to select the most typical year of data
    for the given month
    """
>>>>d = d[d.index.month == m]<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    n_bins = config['cdf_bins']
    weights = dict(config['weights'])
    total = weights.pop('total')

    score = dict.fromkeys(d.index.year, 0)
    fs = dict.fromkeys(weights)
    cdfs = dict.fromkeys(weights)
    i = 0
    x2 = np.zeros((len(weights), 30))

    for w in weights:
        cdfs[w] = dict([])
        fs[w] = dict([])

        # Calculate the long term CDF for this weight
        cdfs[w]['Long-Term'], bin_edges = cdf(d, w, n_bins)

        x = bin_edges[:-1] * np.diff(bin_edges) / 2
        x2[i, :] = x
        i += 1

>>>>>>>>for yr in set(d.index.year):<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
            dy = d[d.index.year == yr]
            #print(dy)

            # calculate the CDF for this weight for specific year
            cdfs[w][yr], b = cdf(dy, w, bin_edges)

            # Finkelstein-Schafer statistic (difference between long term
            # CDF and year CDF
            fs[w][yr] = np.mean(abs(cdfs[w]['Long-Term'] - cdfs[w][yr]))

            # Add weighted FS value to score for this year
            score[yr] += fs[w][yr] * weights[w] / total

    # select the top 5 years ordered by their weighted scores
    top5 = sorted(score, key=score.get)[:5]

目前我的代码按月索引数据,然后比较每年的数据。换句话说,每年的一月份都会被评估(计算CDF),然后排名。

出现的问题是,因为有多次运行,所以存在多个 2001 年 1 月。我的代码目前合并了它们的数据,而不是将 2001 年 1 月运行 1 与 2001 年 1 月运行 2 视为要比较的单独实体。我的问题是有一种方法可以使用我的列“run_file_year”(它是一个字符串)进行索引,并让代码遍历所有 run_file_year 列(不列出它们)?

目前,数据帧 d(按月索引),然后按年份索引。我想知道是否可以按 run_file_year 列索引而不是按年份索引,而无需迭代其中的所有项目?

标签: pythonpandasdataframe

解决方案


推荐阅读