tensorflow - 如何使用 tensorflow 数据集生成上下文对话数据集的所有组合
问题描述
假设我有一个任意长度的会话的 tsv 数据集文件,每个消息选项卡是分开的,每一行代表一个完整的会话:
Hi\tHow are you?\tIm doing well
This is a conversation?\tYes.\tHuh\tIts also a test
我想从中创建一个 tensorflow 数据集,其中包含按顺序排列的所有对话组合,如下所示(我将用 分隔输入和目标,用 分隔\t
单个消息\b
):
Hi\tHow are you?
Hi/bHow are you?\tIm doing well
This is a conversation?\tYes.
This is a conversation?/bYes.\tHuh
This is a conversation?/bYes./bHuh\tIts also a test
我本质上是在寻找实现这一点,但在 tensorflow 数据集中:
def convertline(text, max_length=20):
text=text.split("\t") #split the conversation by tabs
inputs, targets=[],[] #create empty arrays for inputs and targets
for y in range(1,len(text)): #iterate through the split conversation
x=y-max_length if y-max_length >= 0 else 0 #get the starting value; if it's negative, use 0 instead
inputs.append("/b".join(text[x:y])) #append to the inputs the current window, joined by /b
targets.append(text[y]) #append the target
return [{"inputs":inputs[i], "targets":targets[i]} for i in range(len(inputs))] #zip them together in a dict of inputs and targets
with open("testfile.txt", "r") as f: #open a file
line = f.readline() #read file line by line
while line:
print(convertline(line.strip())) #run the function and print its results
line=f.readline()
返回:
[{'inputs': 'Hi', 'targets': 'How are you?'}, {'inputs': 'Hi/bHow are you?', 'targets': 'Im doing well'}]
[{'inputs': 'This is a conversation?', 'targets': 'Yes.'}, {'inputs': 'This is a conversation?/bYes.', 'targets': 'Huh'}, {'inputs': 'This is a conversation?/bYes./bHuh', 'targets': 'Its also a test'}]
这是我到目前为止所拥有的:
def dataset(split, shuffle_files=False):
# Load lines from the text file as examples.
ds = tf.data.TextLineDataset(nq_tsv_path[split])
# Split each "<question>\t<answer>" example into (question, answer) tuple.
# This definitely won't work, and is most likely where the code to generate sliding windows should be
ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["", ""],
field_delim="\t", use_quote_delim=False),
num_parallel_calls=tf.data.experimental.AUTOTUNE)
# Map the dataset into dicts of questions and answers
ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex)))
return ds
解决方案
推荐阅读
- c++ - 如何从另一个类访问表单组件
- c# - C# 为不同的属性名称重用逻辑
- esp32 - 在为异步 Web 服务器设置 esp32 时,void loop() 中可以有代码吗?
- node.js - 关于 Stripe connect express 的一些问题
- security - OTP Token 和 CSRF Token 的区别
- html - 为什么我的部分 CSS 代码在 Safari 中丢失但在 chrome 中没有?
- c++ - 如何找到多维数组的模式?
- sql - 查询以从具有互斥约束的单个字段返回两列
- flutter - 堆wrap_content里面的所有Widget?
- sql - Sitecore 9.3 安装错误 - 创建分片 - SqlShardingDeploymentTool.exe 返回非零退出代码