首页 > 解决方案 > 如何使用 tensorflow 数据集生成上下文对话数据集的所有组合

问题描述

假设我有一个任意长度的会话的 tsv 数据集文件,每个消息选项卡是分开的,每一行代表一个完整的会话:

Hi\tHow are you?\tIm doing well
This is a conversation?\tYes.\tHuh\tIts also a test

我想从中创建一个 tensorflow 数据集,其中包含按顺序排列的所有对话组合,如下所示(我将用 分隔输入和目标,用 分隔\t单个消息\b):

Hi\tHow are you?
Hi/bHow are you?\tIm doing well
This is a conversation?\tYes.
This is a conversation?/bYes.\tHuh
This is a conversation?/bYes./bHuh\tIts also a test

我本质上是在寻找实现这一点,但在 tensorflow 数据集中:

def convertline(text, max_length=20):
    text=text.split("\t") #split the conversation by tabs
    inputs, targets=[],[] #create empty arrays for inputs and targets
    for y in range(1,len(text)): #iterate through the split conversation
        x=y-max_length if y-max_length >= 0 else 0 #get the starting value; if it's negative, use 0 instead
        inputs.append("/b".join(text[x:y])) #append to the inputs the current window, joined by /b
        targets.append(text[y]) #append the target
    return [{"inputs":inputs[i], "targets":targets[i]} for i in range(len(inputs))] #zip them together in a dict of inputs and targets

with open("testfile.txt", "r") as f: #open a file
    line = f.readline() #read file line by line
    while line:
        print(convertline(line.strip())) #run the function and print its results
        line=f.readline()

返回:

[{'inputs': 'Hi', 'targets': 'How are you?'}, {'inputs': 'Hi/bHow are you?', 'targets': 'Im doing well'}]
[{'inputs': 'This is a conversation?', 'targets': 'Yes.'}, {'inputs': 'This is a conversation?/bYes.', 'targets': 'Huh'}, {'inputs': 'This is a conversation?/bYes./bHuh', 'targets': 'Its also a test'}]

这是我到目前为止所拥有的:

def dataset(split, shuffle_files=False):
    # Load lines from the text file as examples.
    ds = tf.data.TextLineDataset(nq_tsv_path[split])
    # Split each "<question>\t<answer>" example into (question, answer) tuple.
    # This definitely won't work, and is most likely where the code to generate sliding windows should be
    ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["", ""],
                field_delim="\t", use_quote_delim=False),
                num_parallel_calls=tf.data.experimental.AUTOTUNE)
    # Map the dataset into dicts of questions and answers
    ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex)))
    return ds

标签: tensorflowtensorflow-datasets

解决方案


推荐阅读