首页 > 解决方案 > AWS Sagemaker ValueError:使用字符串和日期时数组上不支持的 dtype 对象

问题描述

我有一个我正在尝试 RCF 的 CSV 文件。如果我在 CSV 中输入日期或字符串,则会收到如下错误。如果我将其限制为整数和浮点字段,则脚本运行良好。有没有办法处理日期和字符串?我从 AWS 看到了出租车示例,它的日期与我的相同

eventData = pd.read_csv(data_location, delimiter=",", header=None, parse_dates=True)

print('Starting RCF Training')
# specify general training job information
rcf = RandomCutForest(role=sagemaker.get_execution_role(),
                      instance_count=1,
                      instance_type='ml.m4.xlarge',
                      data_location=data_location,
                      output_path='s3://{}/{}/output'.format(bucket, prefix),
                      base_job_name="ad-rcf",
                      num_samples_per_tree=512,
                      num_trees=50)

rcf.fit(rcf.record_set(eventData.values))

CSV 数据失败

392507,1613744,1/2/2020 19:11,1577238693,2469,3.30E+01,-9.67E+01
691381,1888551,12/10/2019 9:22,1575641745,3460,2.37E+01,9.04E+01
392507,1613744,1/2/2020 19:20,1577236815,1797,3.30E+01,-9.67E+01
392507,1613744,1/29/2020 19:04,1577264188,1797,3.30E+01,-9.67E+01

错误输出

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-ba19bf5d66a2> in <module>
---> 21 rcf.fit(rcf.record_set(eventData.values))
     22 
     23 print('Done RCF Training')

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in record_set(self, train, labels, channel, encrypt)
    281         logger.debug("Uploading to bucket %s and key_prefix %s", bucket, key_prefix)
    282         manifest_s3_file = upload_numpy_to_s3_shards(
--> 283             self.instance_count, s3, bucket, key_prefix, train, labels, encrypt
    284         )
    285         logger.debug("Created manifest file %s", manifest_s3_file)

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
    443                 s3.Object(bucket, key_prefix + file).delete()
    444         finally:
--> 445             raise ex
    446 
    447 

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
    424                     write_numpy_to_dense_tensor(file, shard, label_shards[shard_index])
    425                 else:
--> 426                     write_numpy_to_dense_tensor(file, shard)
    427                 file.seek(0)
    428                 shard_index_string = str(shard_index).zfill(len(str(len(shards))))

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
    154             )
    155         resolved_label_type = _resolve_type(labels.dtype)
--> 156     resolved_type = _resolve_type(array.dtype)
    157 
    158     # Write each vector in array into a Record in the file object

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
    288     if dtype == np.dtype("float32"):
    289         return "Float32"
--> 290     raise ValueError("Unsupported dtype {} on array".format(dtype))
    291 
    292 

ValueError: Unsupported dtype object on array

标签: pythonamazon-sagemaker

解决方案


弄清楚我的问题,RCF 无法处理日期和字符串。AWS 提供的 Kenesis 产品的此页面涵盖相同的随机森林算法https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html 它说该功能仅支持“该算法接受 DOUBLE、INTEGER、FLOAT、TINYINT、SMALLINT、REAL 和 BIGINT 数据类型。”

AWS 在 NYC Taxi 示例中遇到的问题是他们使用 .value ,它仅引用数据的 value 列。他们基本上是从 RCF 中删除日期作为一项功能。数组上的 .values 确实有效并且看起来与 .value 非常相似,这并没有帮助


推荐阅读