首页 > 解决方案 > 找到具有 0 个特征 (shape=(268215, 0)) 的数组,而 StandardScaler 要求至少为 1

问题描述

我正在解决一个问题,我正在提取所有 ProductID 的数据,然后遍历数据框以查看唯一的 ProductID 以执行一组功能。

这里,item 是 ProductID/Item 编号:

#looping through the big dataframe to get a dataframe pertaining to the unique ID
for item in df2['Item Nbr'].unique():
        # fetch item data
        df = df2.loc[df2['Item Nbr'] == item]

然后我有一组定制的python函数:所以,当我通过第一个循环(对于一个productID)它工作得很好,但是当它遍历循环并进入下一个产品ID时,我确信它提取的数据是正确的,但我收到此错误:

找到具有 0 个特征 (shape=(268215, 0)) 的数组,而 StandardScaler 至少需要 1 个。

虽然,X_train 和 y_train 形状是: (268215, 6) (268215,)

代码片段:(额外信息)

这是一个巨大的文件。但是最初的大数据框有

[362988 行 x 7 列] - 用于第一个产品和 [268215 行 x 7 列] - 用于第二个产品

代码扩展:

具有两个唯一产品 ID 的大数据框

biqQueryData = get_item_data(详细=真)

遍历每个唯一的产品 ID,以提取与产品 ID 相关的数据框子集

对于 biqQueryData['Item Nbr'].unique() 中的项目:df = biqQueryData.loc[biqQueryData['Item Nbr'] == item]

try:
    df_model = model_all_stores(df, item, n_jobs=n_jobs, 
                                    train_model=train_model,
                                    test_model=test_model,
                                    tune_model=tune_model,
                                    export_model=export_model, 
                                    output=export_demand)

函数 model_all_stores

def model_all_stores(df_raw, item_nbr, n_jobs=1, train_model=False, 
                     test_model=False,  export_model=False, output=False,
                     tune_model=False):
    """Models demand for specified item.

    Predict the demand of specified item for all stores. Does not 
    filter for predict hidden demand (the function get_hidden_demand 
    should be used for this.)

    Output: data frame output
    """

    # ML model hyperparameters 
    impute_with = 'median'
    n_estimators = 100
    min_samples_split = 3 
    min_samples_leaf = 3
    max_depth = None

    # load data and subset traited and valid
    dfnew = subset_traited_valid(df_raw)

    # get known demand
    df_ma = get_demand(dfnew)
    # impute missing sales data
    median_sales = df_ma['Sales Qty'].median()
    df_ma['Sales Qty'] = df_ma['Sales Qty'].fillna(median_sales)

    # add moving average features
    df_ma = df_ma.sort_values('Gregorian Days')
    window_list = [7 * x for x in [1, 2, 4, 8, 16, 52]]
    for w in window_list:
        grouped = df_ma.groupby('Store Nbr')['Sales Qty'].shift(1)
        rolling = grouped.rolling(window=w, min_periods=1).mean()
        df_ma['MA' + str(w)] = rolling.reset_index(0, drop=True)

    X_full = df_ma.loc[:, 'MA7':].values
    #print(X_full.shape)
    # use full data if not testing/tuning
    rows_for_model = df_ma['Known Demand'].notnull()
    X = df_ma.loc[rows_for_model, 'MA7':].values
    y = df_ma.loc[rows_for_model, 'Known Demand'].values
    X_train, y_train = X, y 
    print(X_train.shape, y_train.shape)

if train_model:
        # instantiate model components
        imputer = Imputer(missing_values='NaN', strategy=impute_with, axis=0)
        scale = StandardScaler()
        pca = PCA()
        forest = RandomForestRegressor(n_estimators=n_estimators, 
                                       max_features='sqrt',
                                       min_samples_split=min_samples_split,
                                       min_samples_leaf=min_samples_leaf,
                                       max_depth=max_depth,
                                       criterion='mse',
                                       random_state=42,
                                       warm_start=True,
                                       n_jobs=n_jobs)
        # pipeline for model
        pipeline_steps = [('imputer', imputer),
                          ('scale', scale),     
                          ('pca', pca),
                          ('forest', forest)]
        regr = Pipeline(pipeline_steps)

regr.fit(X_train, y_train)

这里失败了

数据片段:

biqQueryData(整个数据框)

364174,1084,2019-12-12,,,,0.0

......

364174,1084,2019-12-13,,,,0.0

188880,397752,19421,2020-02-04,2.0,1.0,1.0,0.0

......

188881,397752,19421,2020-02-05,2.0,1.0,1.0,0.0

子集 DF 1:

364174,1084,2019-12-12,,,,0.0 .....

364174,1084,2019-12-13,,,,0.0

子集 DF 2:

188880,397752,19421,2020-02-04,2.0,1.0,1.0,0.0

......

188881,397752,19421,2020-02-05,2.0,1.0,1.0,0.0

这里的任何帮助都会很棒!谢谢

标签: python-3.xpandasnumpymachine-learningdata-science

解决方案


推荐阅读