python-3.x - 找到具有 0 个特征 (shape=(268215, 0)) 的数组,而 StandardScaler 要求至少为 1
问题描述
我正在解决一个问题,我正在提取所有 ProductID 的数据,然后遍历数据框以查看唯一的 ProductID 以执行一组功能。
这里,item 是 ProductID/Item 编号:
#looping through the big dataframe to get a dataframe pertaining to the unique ID
for item in df2['Item Nbr'].unique():
# fetch item data
df = df2.loc[df2['Item Nbr'] == item]
然后我有一组定制的python函数:所以,当我通过第一个循环(对于一个productID)它工作得很好,但是当它遍历循环并进入下一个产品ID时,我确信它提取的数据是正确的,但我收到此错误:
找到具有 0 个特征 (shape=(268215, 0)) 的数组,而 StandardScaler 至少需要 1 个。
虽然,X_train 和 y_train 形状是: (268215, 6) (268215,)
代码片段:(额外信息)
这是一个巨大的文件。但是最初的大数据框有
[362988 行 x 7 列] - 用于第一个产品和 [268215 行 x 7 列] - 用于第二个产品
代码扩展:
具有两个唯一产品 ID 的大数据框
biqQueryData = get_item_data(详细=真)
遍历每个唯一的产品 ID,以提取与产品 ID 相关的数据框子集
对于 biqQueryData['Item Nbr'].unique() 中的项目:df = biqQueryData.loc[biqQueryData['Item Nbr'] == item]
try:
df_model = model_all_stores(df, item, n_jobs=n_jobs,
train_model=train_model,
test_model=test_model,
tune_model=tune_model,
export_model=export_model,
output=export_demand)
函数 model_all_stores
def model_all_stores(df_raw, item_nbr, n_jobs=1, train_model=False,
test_model=False, export_model=False, output=False,
tune_model=False):
"""Models demand for specified item.
Predict the demand of specified item for all stores. Does not
filter for predict hidden demand (the function get_hidden_demand
should be used for this.)
Output: data frame output
"""
# ML model hyperparameters
impute_with = 'median'
n_estimators = 100
min_samples_split = 3
min_samples_leaf = 3
max_depth = None
# load data and subset traited and valid
dfnew = subset_traited_valid(df_raw)
# get known demand
df_ma = get_demand(dfnew)
# impute missing sales data
median_sales = df_ma['Sales Qty'].median()
df_ma['Sales Qty'] = df_ma['Sales Qty'].fillna(median_sales)
# add moving average features
df_ma = df_ma.sort_values('Gregorian Days')
window_list = [7 * x for x in [1, 2, 4, 8, 16, 52]]
for w in window_list:
grouped = df_ma.groupby('Store Nbr')['Sales Qty'].shift(1)
rolling = grouped.rolling(window=w, min_periods=1).mean()
df_ma['MA' + str(w)] = rolling.reset_index(0, drop=True)
X_full = df_ma.loc[:, 'MA7':].values
#print(X_full.shape)
# use full data if not testing/tuning
rows_for_model = df_ma['Known Demand'].notnull()
X = df_ma.loc[rows_for_model, 'MA7':].values
y = df_ma.loc[rows_for_model, 'Known Demand'].values
X_train, y_train = X, y
print(X_train.shape, y_train.shape)
if train_model:
# instantiate model components
imputer = Imputer(missing_values='NaN', strategy=impute_with, axis=0)
scale = StandardScaler()
pca = PCA()
forest = RandomForestRegressor(n_estimators=n_estimators,
max_features='sqrt',
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
max_depth=max_depth,
criterion='mse',
random_state=42,
warm_start=True,
n_jobs=n_jobs)
# pipeline for model
pipeline_steps = [('imputer', imputer),
('scale', scale),
('pca', pca),
('forest', forest)]
regr = Pipeline(pipeline_steps)
regr.fit(X_train, y_train)
这里失败了
数据片段:
biqQueryData(整个数据框)
364174,1084,2019-12-12,,,,0.0
......
364174,1084,2019-12-13,,,,0.0
188880,397752,19421,2020-02-04,2.0,1.0,1.0,0.0
......
188881,397752,19421,2020-02-05,2.0,1.0,1.0,0.0
子集 DF 1:
364174,1084,2019-12-12,,,,0.0 .....
364174,1084,2019-12-13,,,,0.0
子集 DF 2:
188880,397752,19421,2020-02-04,2.0,1.0,1.0,0.0
......
188881,397752,19421,2020-02-05,2.0,1.0,1.0,0.0
这里的任何帮助都会很棒!谢谢
解决方案
推荐阅读
- python - 在 Lambda DynamoDB 函数中添加限制
- ios - IOS 14 AVPlayer 在模拟器上播放,但不在真实设备上
- elasticsearch - 弹性搜索 kubernetes 数据磁盘使用量猛增
- c# - 上传 excel 文件并在 aspnet core 中添加手动值
- mysql - MySQL CLI:未知选项:enable-cleartext-plugin
- php - 在 laravel 控制器中使用特征
- azure - Azure Kubernetes - 副本与 HPA?
- python - 在预处理文本中保持标点符号作为自己的单位
- javascript - 正则表达式:从变体搜索,但首先按变体搜索,而不是通过 searcgin 字符串
- sql - Hive TPCDS Query30“只允许顶级连接的子查询表达式”