首页 > 解决方案 > 必须是结构化数组,第一个字段是二进制类 RandomSurvivalForest

问题描述

我不确定为什么会收到以下错误。

 y must be a structured array with the first field being a binary class event indicator and the second field the time of the event/censoring
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<command-2136632501118727> in <module>
      5                            n_jobs=-1,
      6                            random_state=0)
----> 7 rsf.fit(X_train, y_train)

/databricks/python/lib/python3.8/site-packages/sksurv/ensemble/forest.py in fit(self, X, y, sample_weight)
    235         self
    236         """
--> 237         X, event, time = check_arrays_survival(X, y)
    238 
    239         self.n_features_ = X.shape[1]

/databricks/python/lib/python3.8/site-packages/sksurv/util.py in check_arrays_survival(X, y, **kwargs)
    192         Time of event or censoring.
    193     """
--> 194     event, time = check_y_survival(y)
    195     kwargs.setdefault("dtype", numpy.float64)
    196     X = check_array(X, ensure_min_samples=2, **kwargs)

/databricks/python/lib/python3.8/site-packages/sksurv/util.py in check_y_survival(y_or_event, allow_all_censored, *args)
    132 
    133         if not isinstance(y, numpy.ndarray) or y.dtype.fields is None or len(y.dtype.fields) != 2:
--> 134             raise ValueError('y must be a structured array with the first field'
    135                              ' being a binary class event indicator and the second field'
    136                              ' the time of the event/censoring')

ValueError: y must be a structured array with the first field being a binary class event indicator and the second field the time of the event/censoring

我尝试将数据类型转换为布尔值。以及转换为数组我的数据看起来像这样:

below   day_of_quarter
0   0   87
1   1   38
2   0   18
3   1   84
4   0   64

这是我使用 sklearn 生存包的代码。应为生存分析设置数据。

from sksurv.ensemble import RandomSurvivalForest

df = data.select(col('below'),col('day_of_quarter')).toPandas()
x = df.day_of_quarter
y = df.below.astype(bool)
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.25, random_state=0)
rsf = RandomSurvivalForest(n_estimators=1000,
                           min_samples_split=10,
                           min_samples_leaf=15,
                           max_features="sqrt",
                           n_jobs=-1,
                           random_state=0)
rsf.fit(X_train, y_train)

标签: pythonpandasnumpyscikit-learnrandom-forest

解决方案


推荐阅读