首页 > 解决方案 > ML Engine Google Cloud Platform,从已部署模型中的字符串解析特征

问题描述

我在 ML Engine - Google Cloud Platform 上使用 TensorFlow 来解决回归问题。我需要向 ML Engine 发送一个包含日期的张量字符串,例如“2018/06/05 23:00”,然后我部署的模型从那里提取基本上是(年、月、日、小时)的特征。对于上面的示例,将是 (2018, 06, 05, 23)。问题是我需要在 ML Engine 的已部署模型中执行此操作,而不是在中间 API 中执行。

首先,我所做的是使人口普查模型教程适应我的回归问题。 https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction

gcloud ml-engine models create $MODEL_NAME ...在本教程中,他们通过终端使用 gcloud 命令在 ML Engine 中部署模型。

您将在下面找到我发现的操纵包含日期的字符串张量以获取特征的方式

import tensorflow as tf
import numpy as np 
date_time = tf.placeholder(shape=(1,), dtype=tf.string, name="ph_date_time")

INPUT_COLUMNS=["year", "month", "day", "hour"]


split_date_time = tf.string_split(date_time, ' ')

date = split_date_time.values[0]
time = split_date_time.values[1]

split_date = tf.string_split([date], '-')
split_time = tf.string_split([time], ':')

year = split_date.values[0]
month = split_date.values[1]
day = split_date.values[2]
hours = split_time.values[0]
minutes = split_time.values[1]

year = tf.string_to_number(year, out_type=tf.int32, name="year_temp")
month = tf.string_to_number(month, out_type=tf.int32, name="month_temp")
day = tf.string_to_number(day, out_type=tf.int32, name="day_temp")
hours = tf.string_to_number(hours, out_type=tf.int32, name="hour_temp")
minutes = tf.string_to_number(minutes, out_type=tf.int32, name="minute_temp")

year = tf.expand_dims(year, 0, name="year")
month = tf.expand_dims(month, 0, name="month")
day = tf.expand_dims(day, 0, name="day")
hours = tf.expand_dims(hours, 0, name="hours")
minutes = tf.expand_dims(minutes, 0, name="minutes")

features = []
features = np.append(features, year)
features = np.append(features, month)
features = np.append(features, day)
features = np.append(features, hours)

# this would be the actual features to the deployed model
actual_features = dict(zip(INPUT_COLUMNS, features))



with tf.Session() as sess:
    year, month, day, hours, minutes = sess.run([year, month, day, hours, minutes], feed_dict={date_time: ["2018-12-31 22:59"]})
    print("Year =", year)
    print("Month =", month)
    print("Day =", day)
    print("Hours =", hours)
    print("Minutes =", minutes)

问题是我不知道如何告诉 ML Engine 使用上面的解析。我知道它与input_fn定义模型或serving_input_fn用于导出模型有关,但我不确定我是否必须将我的代码粘贴到两者或其中之一中,任何建议将不胜感激,抱歉如果解释不清楚。

标签: pythontensorflowmachine-learninggoogle-cloud-platformgoogle-cloud-ml

解决方案


要遵循的一般模式是(请参阅此文档):

  1. 创建一个input_fn用于训练的,通常使用tf.data.Dataset. 应该调用辅助函数来进行数据转换,就像你的代码中的input_fn那些。输出将是特征名称到批次值的字典。
  2. 为您的输出中的项目定义 FeatureColumns input_fn。如有必要,请执行特征交叉、分桶等操作。
  3. 实例化估计器(例如DnnRegressor),将 FeatureColumns 传递给构造函数
  4. 创建一个input_fn专门用于服务的,具有一个或多个tf.PlaceholderNone可变批量大小)作为外部维度。从 (1) 中调用相同的辅助函数来进行转换。返回tf.estimator.export.ServingInputReceiver带有占位符作为输入的 a 和一个与 (1) 中的 dict 看起来相同的 dict。

您的特殊情况需要一些额外的细节。首先,您已将批量大小为 1 的硬编码到占位符中,相应的代码继续该假设。您的占位符必须有shape=[None].

不幸的是,您的代码是在假设形状为1的情况下编写的,例如,split_date_time.values[0]将不再有效。我在下面的代码中添加了一个辅助函数来解决这个问题。

这是一些希望对您有用的代码:

import tensorflow as tf

# tf.string_split returns a SparseTensor. When using a variable batch size,
# this can be difficult to further manipulate. In our case, we don't need
# a SparseTensor, because we have a fixed number of elements each split.
# So we do the split and convert the SparseTensor to a dense tensor.
def fixed_split(batched_string_tensor, delimiter, num_cols):
    # When splitting a batch of elements, the values array is row-major, e.g.
    # ["2018-01-02", "2019-03-04"] becomes ["2018", "01", "02", "2019", "03", "04"].
    # So we simply split the string then reshape the array to create a dense
    # matrix with the same rows as the input, but split into columns, e.g.,
    # [["2018", "01", "02"], ["2019", "03", "04"]]
    split = tf.string_split(batched_string_tensor, delimiter)
    return tf.reshape(split.values, [-1, num_cols])


def parse_dates(dates):  
    split_date_time = fixed_split(dates, ' ', 2)

    date = split_date_time[:, 0]
    time = split_date_time[:, 1]

    # The values of the resulting SparseTensor will alternate between year, month, and day
    split_date = fixed_split(date, '-', 3)
    split_time = fixed_split(time, ':', 2)

    year = split_date[:, 0]
    month = split_date[:, 1]
    day = split_date[:, 2]
    hours = split_time[:, 0]
    minutes = split_time[:, 1]

    year = tf.string_to_number(year, out_type=tf.int32, name="year_temp")
    month = tf.string_to_number(month, out_type=tf.int32, name="month_temp")
    day = tf.string_to_number(day, out_type=tf.int32, name="day_temp")
    hours = tf.string_to_number(hours, out_type=tf.int32, name="hour_temp")
    minutes = tf.string_to_number(minutes, out_type=tf.int32, name="minute_temp")

    return {"year": year, "month": month, "day": day, "hours": hours, "minutes": minutes}


def training_input_fn():
    filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
    dataset = tf.data.TextLineDataset(filenames)    
    dataset.batch(BATCH_SIZE)
    return parse_dates(iterator.get_next())


def serving_input_fn():
    date_strings = tf.placeholder(dtype=tf.string, shape=[None], name="date_strings")
    features = parse_dates(date_strings)
    return tf.estimator.export.ServingInputReceiver(features, date_strings)


with tf.Session() as sess:
    date_time_list = ["2018-12-31 22:59", "2018-01-23 2:09"]

    date_strings = tf.placeholder(dtype=tf.string, shape=[None], name="date_strings")
    features = parse_dates(date_strings)


    fetches = [features[k] for k in ["year", "month", "day", "hours", "minutes"]]
    year, month, day, hours, minutes = sess.run(fetches, feed_dict={date_strings: date_time_list})
    print("Year =", year)
    print("Month =", month)
    print("Day =", day)
    print("Hours =", hours)
    print("Minutes =", minutes)

推荐阅读