python - 如何加快文本翻译速度?
问题描述
有没有加快处理速度的方法?
我必须在每个 ~126k 样本上翻译三个文本字段。此任务的估计时间超过 96 小时:
import pickle
from deep_translator import GoogleTranslator
from tqdm import tqdm
def translate(text):
return GoogleTranslator(
source='english',
target='portuguese').translate(text)
def translate_samples(samples):
translated_sample = []
for sample in tqdm(samples):
translated_sample.append({
"idx": sample["idx"],
"qs1": translate(sample["qs1"]),
"qs2": translate(sample["qs2"]),
"ans": translate(sample["ans"]),
"cls":sample["cls"]
})
return translated_sample
def perform_tasks():
with open("resource/dataset/aug.pkl", "rb") as samples_file:
samples = pickle.load(samples_file)
translated_sample = translate_samples(samples)
with open("resource/dataset/aug_pt_br.pkl", "wb") as samples_file:
pickle.dump(translated_sample, samples_file)
if __name__ == '__main__':
perform_tasks()
# 0%| | 36/126738 [00:36<96:12:38, 2.14s/it]
你能给我一些指示吗?
解决方案
您可以尝试拥抱人脸库中提供的其他预训练模型。检查下面的示例代码可能会减少约 75 小时。为您的数据集。您可以尝试将批处理和 GPU 结合使用,以获得更好的性能。
from transformers import MarianTokenizer, MarianMTModel
import time
model_name = 'Helsinki-NLP/opus-mt-fr-en'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# function to get the chunks out of the entire data
def chunks(lst, n):
for i in range(0, len(lst), n):
yield lst[i : i + n]
for i, col in enumerate(source_cols):
start = time.time()
translations = []
batch_no = 0
# encode the source language text
for source_text in chunks(data[col].tolist(), 500):
batch_no += 1
print('batch %d tokenization started' % batch_no)
batch = tokenizer.encode(source_text, return_tensors='pt', padding=True)
# predict the output token ids
print('batch %d prediction started.' % batch_no)
outputs = model.generate(**batch)
print('batch %d decoding started.' % batch_no)
decoded_output = tokenizer.decode(outputs, skip_special_tokens=True)
translations.extend(decoded_output)
print('batch %d completed' % batch_no)
data[target_cols[i]] = translations
end = time.time()
print('%.2f hours taken for verbatim %s' % ((end - start)/3600, col))
推荐阅读
- javascript - 如何在JS中弹出特定日期之前的数组值
- visual-c++ - Windows 上的 QtCreator:找不到 CMAKE_CXX_COMPILER
- python - 如何调用自定义包中的任何函数
- reactjs - 如何在不出现 Typescript 错误的情况下将 signUpConfig 传递给 Amplify 的 withAuthenticator HOC
- dart - 在单个列表中按条件组合多个项目
- php - ini_set() 抛出未记录到 Apache 错误日志的 500
- javascript - 如何使用 mocha/sinon/chai 正确测试异步流错误事件处理?
- c++ - 在 OpenGL 中使用和绑定多个 vbo
- apache-flink - 如何检测 Flink Batch Job 是否完成
- sql - 具有多个变量的sql日期差异