首页 > 解决方案 > TensorFlow Dataset - 如何播放/转换 WAV 文件(int64)?

问题描述

我想测试以下数据集:https ://www.tensorflow.org/datasets/catalog/speech_commands

当我加载和播放音频时,我只是得到?随机的?噪音。

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds
import IPython.display as ipd


ds, ds_info = tfds.load('speech_commands', shuffle_files=False, with_info=True)
ds_info
tfds.core.DatasetInfo(
    name='speech_commands',
    full_name='speech_commands/0.0.2',
    description="""
    An audio dataset of spoken words designed to help train and evaluate keyword
    spotting systems. Its primary goal is to provide a way to build and test small
    models that detect when a single word is spoken, from a set of ten target words,
    with as few false positives as possible from background noise or unrelated
    speech. Note that in the train and validation set, the label "unknown" is much
    more prevalent than the labels of the target words or background noise.
    One difference from the release version is the handling of silent segments.
    While in the test set the silence segments are regular 1 second files, in the
    training they are provided as long segments under "background_noise" folder.
    Here we split these background noise into 1 second clips, and also keep one of
    the files for the validation set.
    """,
    homepage='https://arxiv.org/abs/1804.03209',
    data_path='C:\\Users\\abc\\tensorflow_datasets\\speech_commands\\0.0.2',
    download_size=2.37 GiB,
    dataset_size=9.07 GiB,
    features=FeaturesDict({
        'audio': Audio(shape=(None,), dtype=tf.int64),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=12),
    }),
    supervised_keys=('audio', 'label'),
    splits={
        'test': <SplitInfo num_examples=4890, num_shards=4>,
        'train': <SplitInfo num_examples=106497, num_shards=128>,
        'validation': <SplitInfo num_examples=121, num_shards=1>,
    },
    citation="""@article{speechcommandsv2,
       author = {{Warden}, P.},
        title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
      journal = {ArXiv e-prints},
      archivePrefix = "arXiv",
      eprint = {1804.03209},
      primaryClass = "cs.CL",
      keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
        year = 2018,
        month = apr,
        url = {https://arxiv.org/abs/1804.03209},
    }""",
)

音频文件是int64采样率为 16000 的类型数组。我找不到有关如何在此数据集中播放文件的任何信息。从其他数据集中,我能够播放 WAV 声音。不同之处之一是,其他 DS 使用浮点数组,而这个 DS 使用 int 数组。也许我错过了对话步骤?

ds_list = list(ds['validation'])

idx = -1
audio, label = ds_list[idx]['audio'], ds_list[idx]['label']
ipd.Audio(audio, rate=16_000)

我显然在数据集中尝试了多个 Indeces,但我总是得到噪音。一个音频条目看起来像这样: tf.Tensor([ -112 1285 -2002 ... -140 1000 -595], shape=(16000,), dtype=int64)

泰:)

标签: pythontensorflowaudio

解决方案


根据源代码,在描述页面 [1] 上,声明:

它的主要目标是提供一种构建和测试小型模型的方法,该模型可以从一组十个目标词中检测何时说出一个词,同时尽可能少地从背景噪声或不相关的语音中误报。

起初,我可以播放如您所展示的嘈杂的 wavfile。然后,我根据 [2] 修改我的代码以产生更清晰的声音。

我使用以下代码将张量转换为 wav 格式。

import scipy.io.wavfile as wavfile
import tensorflow as tf
import tensorflow_datasets as tfds

# load speech commands dataset
ds = tfds.load('speech_commands', split=['train', 'validation', 'test'],
           shuffle_files=True)

# convert from tfds format to list
ds_train = list(ds[0])
ds_val = list(ds[1])

# convert from tensor int64 to numpy float32
sc1 = ds_list[0]['audio'].numpy().astype(np.float32)/np.iinfo(np.int16).max()
sv1 = ds_list[1]['audio'].numpy().astype(np.float32)/np.iinfo(np.int16).max()

# save as wav
wavfile('sc_train_1.wav', 16000, sc1)
wavfile('sc_train_1.wav', 16000, sv1)

诀窍是将 int64 转换为 float32 并除以 np.int16: 的最大值.astype(np.float32)/np.iinfo(np.int16).max()

现在,我可以听到比以前的 int64 格式更干净的声音。

[1] https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/audio/speech_commands.py
[2] https://github.com/google-research/google-research/blob/master/non_semantic_speech_benchmark /train_and_eval_sklearn_small_tfds_dataset.ipynb


推荐阅读