python - SpeechBrain:带有 csv 的 dataio_prepare 函数
问题描述
我目前正在关注 ASRfromScratch 教程,但我正在尝试使其与 Fluent Speech Dataset https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/一起使用。我能够毫无问题地通过 Tokenizer 部分和 Language Model 部分,但我正在努力处理 SpeechRecognizer 部分。我修改了 dataio_prepare 函数,但我不确定它是否是正确的方法:
"""This function prepares the datasets to be used in the brain class.
It also defines the data processing pipeline through user-defined functions.
Arguments
---------
hparams : dict
This dictionary is loaded from the `train.yaml` file, and it includes
all the hyperparameters needed for dataset construction and loading.
Returns
-------
datasets : dict
Dictionary containing "train", "valid", and "test" keys that correspond
to the DynamicItemDataset objects.
"""
# Define audio pipeline. In this case, we simply read the path contained
# in the variable wav with the audio reader.
@sb.utils.data_pipeline.takes("path")
@sb.utils.data_pipeline.provides("sig")
def audio_pipeline(path):
"""Load the audio signal. This is done on the CPU in the `collate_fn`."""
sig = sb.dataio.dataio.read_audio('../fluent_speech_commands_dataset/' + path)
return sig
# Define text processing pipeline. We start from the raw text and then
# encode it using the tokenizer. The tokens with BOS are used for feeding
# decoder during training, the tokens with EOS for computing the cost function.
# The tokens without BOS or EOS is for computing CTC loss.
@sb.utils.data_pipeline.takes("transcription")
@sb.utils.data_pipeline.provides(
"words", "tokens_list", "tokens_bos", "tokens_eos", "tokens"
)
def text_pipeline(transcription):
"""Processes the transcriptions to generate proper labels"""
yield transcription
tokens_list = hparams["tokenizer"].encode_as_ids(transcription)
yield tokens_list
tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
yield tokens_bos
tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
yield tokens_eos
tokens = torch.LongTensor(tokens_list)
yield tokens
# Define datasets from json data manifest file
# Define datasets sorted by ascending lengths for efficiency
datasets = {}
data_folder = hparams["data_folder"]
for dataset in ["train", "valid", "test"]:
datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_csv(
csv_path = hparams[f"{dataset}_annotation"],
replacements={"data_root": data_folder},
dynamic_items=[audio_pipeline, text_pipeline],
output_keys=[
"id",
"sig",
"words",
"tokens_bos",
"tokens_eos",
"tokens",
],
)
hparams[f"{dataset}_dataloader_opts"]["shuffle"] = False
# Sorting training data with ascending order makes the code much
# faster because we minimize zero-padding. In most of the cases, this
# does not harm the performance.
if hparams["sorting"] == "ascending":
datasets["train"] = datasets["train"].filtered_sorted(sort_key="length")
hparams["train_dataloader_opts"]["shuffle"] = False
elif hparams["sorting"] == "descending":
datasets["train"] = datasets["train"].filtered_sorted(
sort_key="length", reverse=True
)
hparams["train_dataloader_opts"]["shuffle"] = False
elif hparams["sorting"] == "random":
hparams["train_dataloader_opts"]["shuffle"] = True
pass
else:
raise NotImplementedError(
"sorting must be random, ascending or descending"
)
return datasets
为了澄清,.csv 文件看起来像这样:
ID,path,speakerId,transcription,action,object,location
0,wavs/speakers/2BqVo8kVB2Skwgyb/0a3129c0-4474-11e9-a9a5-5dbec3b8816a.wav,2BqVo8kVB2Skwgyb,Change language,change language,none,none
1,wavs/speakers/2BqVo8kVB2Skwgyb/0ee42a80-4474-11e9-a9a5-5dbec3b8816a.wav,2BqVo8kVB2Skwgyb,Resume,activate,music,none
2,wavs/speakers/2BqVo8kVB2Skwgyb/144d5be0-4474-11e9-a9a5-5dbec3b8816a.wav,2BqVo8kVB2Skwgyb,Turn the lights on,activate,lights,none
3,wavs/speakers/2BqVo8kVB2Skwgyb/1811b6e0-4474-11e9-a9a5-5dbec3b8816a.wav,2BqVo8kVB2Skwgyb,Switch on the lights,activate,lights,none
我还删除了与 prenaining 阶段相对应的行,因为我不知道如何使它们与我自己的数据集一起使用。
run_on_main(hparams["pretrainer"].collect_files)
hparams["pretrainer"].load_collected(device=run_opts["device"])
我的问题是拟合模型阶段一直卡在要处理的第一个数据上,我不知道如何解决它:
(Polette) aurelienmarchal@aurelienmarchal-X556UQ:~/Stage/Polette/speech_recognizer$ python3 train.py train.yaml --batch_size=2
../noise/rirs_noises.zip exists. Skipping download
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: results/CRDNN_BPE_960h_LM/42
speechbrain.core - Info: ckpt_interval_minutes arg from hparam file is used
speechbrain.core - 171.8M trainable parameters in ASR
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
speechbrain.utils.epoch_loop - Going into epoch 1
0%| | 0/11566 [00:00<?, ?it/s]
speechbrain.core - Exception:
Traceback (most recent call last):
File "train.py", line 452, in <module>
asr_brain.fit(
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/speechbrain/core.py", line 1011, in fit
for batch in t:
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/tqdm/std.py", line 1133, in __iter__
for obj in iterable:
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
data = self._next_data()
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/speechbrain/dataio/batch.py", line 125, in __init__
padded = PaddedData(*padding_func(values, **padding_kwargs))
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/speechbrain/utils/data_utils.py", line 415, in batch_pad_right
padded, valid_percent = pad_right_to(
File "/home/aurelienmarchal/.local/lib/python3.8/site-packages/speechbrain/utils/data_utils.py", line 353, in pad_right_to
valid_vals.append(tensor.shape[j] / target_shape[j])
ZeroDivisionError: division by zero
另外,我的 train.yaml 看起来像这样:
# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN
# Decoder: GRU + beamsearch + RNNLM
# Tokens: 500 BPE
# losses: CTC+ NLL
# Training: mini-librispeech
# Pre-Training: librispeech 960h
# Authors: Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga, Samuele Cornell 2020
# # ############################################################################
# Seed needs to be set at top of yaml, before objects with parameters are instantiated
seed: 42
__set_seed: !apply:torch.manual_seed [!ref <seed>]
# If you plan to train a system on an HPC cluster with a big dataset,
# we strongly suggest doing the following:
# 1- Compress the dataset in a single tar or zip file.
# 2- Copy your dataset locally (i.e., the local disk of the computing node).
# 3- Uncompress the dataset in the local folder.
# 4- Set data_folder with the local path
# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.
# It allows you to read the data much faster without slowing down the shared filesystem.
data_folder: ../fluent_speech_commands_dataset # In this case, data will be automatically downloaded here.
data_folder_rirs: ../noise # noise/ris dataset will automatically be downloaded here
output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>
wer_file: !ref <output_folder>/wer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
# Language model (LM) pretraining
# NB: To avoid mismatch, the speech recognizer must be trained with the same
# tokenizer used for LM training. Here, we download everything from the
# speechbrain HuggingFace repository. However, a local path pointing to a
# directory containing the lm.ckpt and tokenizer.ckpt may also be specified
# instead. E.g if you want to use your own LM / tokenizer.
pretrained_path: ../language_model/results/RNNLM/save/CKPT+2021-05-12+15-27-08+00/
# Path where data manifest files will be stored. The data manifest files are created by the
# data preparation script
train_annotation: ../fluent_speech_commands_dataset/data/train_data.csv
valid_annotation: ../fluent_speech_commands_dataset/data/valid_data.csv
test_annotation: ../fluent_speech_commands_dataset/data/test_data.csv
# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
# Training parameters
number_of_epochs: 15
number_of_ctc_epochs: 5
batch_size: 8
lr: 1.0
ctc_weight: 0.5
sorting: random
ckpt_interval_minutes: 15 # save checkpoint every N min
label_smoothing: 0.1
# Dataloader options
train_dataloader_opts:
batch_size: !ref <batch_size>
valid_dataloader_opts:
batch_size: !ref <batch_size>
test_dataloader_opts:
batch_size: !ref <batch_size>
# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40
# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 500 # Number of tokens (same as LM)
blank_index: 0
bos_index: 0
eos_index: 0
unk_index: 0
# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 1.0
valid_beam_size: 8
test_beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
ctc_weight_decode: 0.0
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25
# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>
# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
sample_rate: !ref <sample_rate>
n_fft: !ref <n_fft>
n_mels: !ref <n_mels>
# Feature normalization (mean and std)
normalize: !new:speechbrain.processing.features.InputNormalization
norm_type: global
# Added noise and reverb come from OpenRIR dataset, automatically
# downloaded and prepared with this Environmental Corruption class.
env_corrupt: !new:speechbrain.lobes.augment.EnvCorrupt
openrir_folder: !ref <data_folder_rirs>
babble_prob: 0.0
reverb_prob: 0.0
noise_prob: 1.0
noise_snr_low: 0
noise_snr_high: 15
# Adds speech change + time and frequency dropouts (time-domain implementation).
augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
sample_rate: !ref <sample_rate>
speeds: [95, 100, 105]
# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.
encoder: !new:speechbrain.lobes.models.CRDNN.CRDNN
input_shape: [null, null, !ref <n_mels>]
activation: !ref <activation>
dropout: !ref <dropout>
cnn_blocks: !ref <cnn_blocks>
cnn_channels: !ref <cnn_channels>
cnn_kernelsize: !ref <cnn_kernelsize>
inter_layer_pooling_size: !ref <inter_layer_pooling_size>
time_pooling: True
using_2d_pooling: False
time_pooling_size: !ref <time_pooling_size>
rnn_class: !ref <rnn_class>
rnn_layers: !ref <rnn_layers>
rnn_neurons: !ref <rnn_neurons>
rnn_bidirectional: !ref <rnn_bidirectional>
rnn_re_init: True
dnn_blocks: !ref <dnn_blocks>
dnn_neurons: !ref <dnn_neurons>
use_rnnp: False
# Embedding (from indexes to an embedding space of dimension emb_size).
embedding: !new:speechbrain.nnet.embedding.Embedding
num_embeddings: !ref <output_neurons>
embedding_dim: !ref <emb_size>
# Attention-based RNN decoder.
decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
enc_dim: !ref <dnn_neurons>
input_size: !ref <emb_size>
rnn_type: gru
attn_type: location
hidden_size: !ref <dec_neurons>
attn_dim: 1024
num_layers: 1
scaling: 1.0
channels: 10
kernel_size: 100
re_init: True
dropout: !ref <dropout>
# Linear transformation on the top of the encoder.
ctc_lin: !new:speechbrain.nnet.linear.Linear
input_size: !ref <dnn_neurons>
n_neurons: !ref <output_neurons>
# Linear transformation on the top of the decoder.
seq_lin: !new:speechbrain.nnet.linear.Linear
input_size: !ref <dec_neurons>
n_neurons: !ref <output_neurons>
# Final softmax (for log posteriors computation).
log_softmax: !new:speechbrain.nnet.activations.Softmax
apply_log: True
# Cost definition for the CTC part.
ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
blank_index: !ref <blank_index>
# Tokenizer initialization
tokenizer: !new:sentencepiece.SentencePieceProcessor
# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class
modules:
encoder: !ref <encoder>
embedding: !ref <embedding>
decoder: !ref <decoder>
ctc_lin: !ref <ctc_lin>
seq_lin: !ref <seq_lin>
normalize: !ref <normalize>
env_corrupt: !ref <env_corrupt>
lm_model: !ref <lm_model>
# Gathering all the submodels in a single model object.
model: !new:torch.nn.ModuleList
- - !ref <encoder>
- !ref <embedding>
- !ref <decoder>
- !ref <ctc_lin>
- !ref <seq_lin>
# This is the RNNLM that is used according to the Huggingface repository
# NB: It has to match the pre-trained RNNLM!!
lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
output_neurons: !ref <output_neurons>
embedding_dim: !ref <emb_size>
activation: !name:torch.nn.LeakyReLU
dropout: 0.0
rnn_layers: 2
rnn_neurons: 2048
dnn_blocks: 1
dnn_neurons: 512
return_hidden: True # For inference
# Beamsearch is applied on the top of the decoder. If the language model is
# given, a language model is applied (with a weight specified in lm_weight).
# If ctc_weight is set, the decoder uses CTC + attention beamsearch. This
# improves the performance, but slows down decoding. For a description of
# the other parameters, please see the speechbrain.decoders.S2SRNNBeamSearchLM.
# It makes sense to have a lighter search during validation. In this case,
# we don't use the LM and CTC probabilities during decoding.
valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
embedding: !ref <embedding>
decoder: !ref <decoder>
linear: !ref <seq_lin>
ctc_linear: !ref <ctc_lin>
bos_index: !ref <bos_index>
eos_index: !ref <eos_index>
blank_index: !ref <blank_index>
min_decode_ratio: !ref <min_decode_ratio>
max_decode_ratio: !ref <max_decode_ratio>
beam_size: !ref <valid_beam_size>
eos_threshold: !ref <eos_threshold>
using_max_attn_shift: !ref <using_max_attn_shift>
max_attn_shift: !ref <max_attn_shift>
coverage_penalty: !ref <coverage_penalty>
temperature: !ref <temperature>
# The final decoding on the test set can be more computationally demanding.
# In this case, we use the LM + CTC probabilities during decoding as well.
# Please, remove this part if you need a faster decoder.
test_search: !new:speechbrain.decoders.S2SRNNBeamSearchLM
embedding: !ref <embedding>
decoder: !ref <decoder>
linear: !ref <seq_lin>
ctc_linear: !ref <ctc_lin>
language_model: !ref <lm_model>
bos_index: !ref <bos_index>
eos_index: !ref <eos_index>
blank_index: !ref <blank_index>
min_decode_ratio: !ref <min_decode_ratio>
max_decode_ratio: !ref <max_decode_ratio>
beam_size: !ref <test_beam_size>
eos_threshold: !ref <eos_threshold>
using_max_attn_shift: !ref <using_max_attn_shift>
max_attn_shift: !ref <max_attn_shift>
coverage_penalty: !ref <coverage_penalty>
lm_weight: !ref <lm_weight>
ctc_weight: !ref <ctc_weight_decode>
temperature: !ref <temperature>
temperature_lm: !ref <temperature_lm>
# This function manages learning rate annealing over the epochs.
# We here use the NewBoB algorithm, that anneals the learning rate if
# the improvements over two consecutive epochs is less than the defined
# threshold.
lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
initial_value: !ref <lr>
improvement_threshold: 0.0025
annealing_factor: 0.8
patient: 0
# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
opt_class: !name:torch.optim.Adadelta
lr: !ref <lr>
rho: 0.95
eps: 1.e-8
# Functions that compute the statistics to track during the validation step.
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
split_tokens: True
# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
model: !ref <model>
scheduler: !ref <lr_annealing>
normalizer: !ref <normalize>
counter: !ref <epoch_counter>
# This object is used to pretrain the language model and the tokenizers
# (defined above). In this case, we also pretrain the ASR model (to make
# sure the model converges on a small amount of data)
#pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
# collect_in: !ref <save_folder>
# loadables:
# lm: !ref <lm_model>
# tokenizer: !ref <tokenizer>
# model: !ref <model>
# paths:
# lm: !ref <pretrained_path>/lm.ckpt
# tokenizer: !ref <pretrained_path>/tokenizer.ckpt
# model: !ref <pretrained_path>/asr.ckpt
而已 !如果您对如何执行此操作有想法,或者您对 SpeechBrain 库感到满意,请让我现在!感谢您阅读我的帖子
解决方案
自您发布此问题以来已经大约两个月了。也许你已经解决了。我也面临同样的问题。所以我想发布我是如何解决这个问题的,因为它可能对其他人有帮助。
我试图用 LibriSpeech Bengali 数据集训练模型。就我而言,问题出在标记器上。在text_pipeline
这一行中,我们从 YAML 文件中调用标记器:tokens_list = hparams["tokenizer"].encode_as_ids(transcription)
. 我们也在解码时调用它。现在,如果您查看 YAML 文件。您会发现标记器是:sentencepiece.SentencePieceProcessor
. 因此,这不使用.model
在训练标记器期间创建的文件。结果,它在标记化时创建了空列表。因此target_shape[j]
成为0
并导致ZeroDivisionError
. 所以,我修改text_pipeline
如下:
import sentencepiece
def text_pipeline(words):
yield words
tokenizer = sentencepiece.SentencePieceProcessor(model_file=hparams['tokenizer_model'])
tokens_list = tokenizer.encode_as_ids(words)
yield tokens_list
tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
yield tokens_bos
tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
yield tokens_eos
tokens = torch.LongTensor(tokens_list)
yield tokens
在 YAML 文件中,我.model
在tokenizer_model
字段中指出了文件的路径,如下所示:
tokenizer_model: path/to/1000_unigram.model
您可能还需要在compute_objectives
函数的解码阶段更改标记器。您还需要保留以下几行:
run_on_main(hparams["pretrainer"].collect_files)
hparams["pretrainer"].load_collected(device=run_opts["device"])
这些行将创建一个指向您的语言模型.ckpt
文件的符号链接。但是您不需要为 ASR 和标记器加载预训练的模型。所以更改以下块:
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
collect_in: !ref <save_folder>
loadables:
lm: !ref <lm_model>
tokenizer: !ref <tokenizer>
model: !ref <model>
paths:
lm: !ref <pretrained_path>/lm.ckpt
tokenizer: !ref <pretrained_path>/tokenizer.ckpt
model: !ref <pretrained_path>/asr.ckpt
至:
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
collect_in: !ref <save_folder>
loadables:
lm: !ref <lm_model>
paths:
lm: path/to/language/model.ckpt
但是,如果您想使用预训练的 ASR 模型和/或标记器,您可以相应地更改此块。
如果您还没有解决问题,我希望这可以帮助您。
推荐阅读
- perl - 如何解决 Perl 中的 @INC 相关问题
- ios - 基于模式的 UIControl 动作
- anychart - 如何在饼图中实现向下钻取?
- c# - 使用 Linq Lambda 按字符串子对象属性过滤实体对象列表
- wagtail - Wagtail 在 Debug = False 上返回 500 错误
- android - Android AsyncTaskLoader:如何在适配器中使用它
- excel - 如何使用 SAP 和 vba 改变数据?
- unix - Repository packages-microsoft-com-prod 在配置中多次列出
- greenplum - 通过 PXF 查询外部表失败
- java - Java 是否位于 Chomsky Hierarchy 的 Type-0 lavel 中?