Audio Augmentation in TensorFlow and PyTorch

Source: Deephub Imba


This article is approximately 2100 words long and is suggested to be read in 9 minutes.
This article will introduce two methods to apply augmentation to datasets in TensorFlow.

For image-related tasks, common data augmentation methods include rotating, blurring, or resizing images. This is because the inherent properties of images make data augmentation very intuitive compared to other data types; we can visually assess how a specific image has been transformed and make a preliminary judgment on the effects with the naked eye. Although augmentation is common in the image domain, it can also be applied in other fields. This article will introduce data augmentation methods in the audio domain.

In this article, we will describe two methods of applying augmentation to datasets in TensorFlow. The first method directly modifies the data; the second method does this during the forward propagation of the network. Additionally, we will introduce how to achieve the same functionality using the built-in methods of torchaudio.

Direct Audio Augmentation

First, we need to generate an artificial audio dataset. We do not need to load a pre-existing dataset but can replicate a sample from the librosa library as needed:

import librosa
import tensorflow as tf

def build_artificial_dataset(num_samples: int):
  data = []
  sampling_rates = []

  for i in range(num_samples):
      y, sr = librosa.load(librosa.ex('nutcracker'))
      data.append(y)
      sampling_rates.append(sr)
  features_dataset = tf.data.Dataset.from_tensor_slices(data)
  labels_dataset = tf.data.Dataset.from_tensor_slices(sampling_rates)
  dataset = tf.data.Dataset.zip((features_dataset, labels_dataset))

  return dataset

ds = build_artificial_dataset(10)

During this process, a Dataset object is created; we can also use pure NumPy arrays, which can be chosen based on actual needs.

Now that the small dataset is available, we can start applying augmentation. For simplicity, this article uses the audiomentations library, and we will only use three augmentation methods: PitchShift, Shift, and ApplyGaussianNoise. The first two shift the pitch (PitchShift) and the data (Shift, which can be thought of as rolling the data; for example, a dog’s bark will be moved +5 seconds). The last transformation makes the signal noisier, increasing the challenge for the neural network. Next, we will combine all three augmentation functions into a pipeline:

from audiomentations import Compose, AddGaussianNoise, PitchShift, Shift

augmentations_pipeline = Compose(
  [
      AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),
      PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
      Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),
  ]
)

Before inputting the data, some additional code must be written. This is because we are using a Dataset object, and this code tells TensorFlow to temporarily convert the tensor into a NumPy array before feeding it into the data augmentation processing pipeline:

def apply_pipeline(y, sr):
  shifted = augmentations_pipeline(y, sr)
  return shifted


@tf.function
def tf_apply_pipeline(feature, sr, ):
  """ Applies the augmentation pipeline to audio files
  @param y: audio data
  @param sr: sampling rate
  @return: augmented audio data
  """  
  augmented_feature = tf.numpy_function(
      apply_pipeline, inp=[feature, sr], Tout=tf.float32, name="apply_pipeline"
  )

  return augmented_feature, sr


def augment_audio_dataset(dataset: tf.data.Dataset):
  dataset = dataset.map(tf_apply_pipeline)

  return dataset

With these helper functions, we can expand our dataset. Finally, we need to add a dimension at the end to convert a single audio sample from (num_data_point,) to (num_data_points, 1), indicating that we have mono audio:

ds = augment_audio_dataset(ds)
ds = ds.map(lambda y, sr: (tf.expand_dims(y, axis=-1), sr))

This completes the direct audio data augmentation.

Audio Augmentation During Forward Propagation

Compared to the previous method, adding audio data in the network places the computational load during forward propagation.

To achieve this, we use the kapre library, which provides custom TensorFlow layers. We use the MelSpectrogram layer, which accepts raw (i.e., unmodified) audio data and computes the Mel spectrogram on the GPU.

Although this is not directly related to data augmentation, it has two benefits:

We can optimize the parameters for generating spectrograms during hyperparameter searches without having to repeatedly convert audio into spectrograms.
The transformations are performed directly on the GPU, making it faster in terms of the original transformation speed and device memory placement.

First, we load the audio layers provided by the kapre library. These layers take raw audio data and compute the spectrogram representation:

import kapre

input_layer = tf.keras.layers.Input(shape=input_shape, dtype=tf.float32)

melspectrogram = kapre.composed.get_melspectrogram_layer(
  n_fft=1024,
  return_decibel=True,
  n_mels=256,
  input_data_format='channels_last',
  output_data_format='channels_last')(input_layer)

Then, we add an augmentation layer from the spec-augment package. This package implements the SpecAugment paper.[1] It masks parts of the spectrogram. The masking confuses the information needed by the neural network, enhancing the learning effect. This modification forces the network to focus on other features, thereby enhancing its generalization ability to unseen data:

from spec_augment import SpecAugment

spec_augment = SpecAugment(freq_mask_param=27, # F in paper
                          time_mask_param=100, # T in paper
                          n_freq_mask=1, # mF in paper
                          n_time_mask=2, # mT in paper
                          mask_value=-1, )(melspectrogram)

Finally, for our case, we add an untrained residual network with an arbitrary ten classes to classify the data:

spec_augment = tf.keras.applications.resnet_v2.preprocess_input(spec_augment)
core = tf.keras.applications.resnet_v2.ResNet152V2(
      input_tensor=spec_augment,
      include_top=False,
      pooling="avg",
      weights=None,
  )
core = core.output

output = tf.keras.layers.Dense(units=10)(core)

resnet_model = tf.keras.Model(inputs=[input_layer], outputs=[output], name="audio_model")

Thus, we have a deep neural network that can augment audio data during forward propagation.

torchaudio

The methods described above are for TensorFlow; what about PyTorch? You can directly use the official torchaudio package.

torchaudio implements TimeStretch, TimeMasking, and FrequencyMasking. Let’s look at the code provided by the official documentation.

TimeStretch:

spec = get_spectrogram(power=None)
strech = T.TimeStretch()

rate = 1.2
spec_ = strech(spec, rate)
plot_spectrogram(spec_[0].abs(), title=f"Stretched x{rate}", aspect='equal', xmax=304)

plot_spectrogram(spec[0].abs(), title="Original", aspect='equal', xmax=304)

rate = 0.9
spec_ = strech(spec, rate)
plot_spectrogram(spec_[0].abs(), title=f"Stretched x{rate}", aspect='equal', xmax=304)

TimeMasking:

torch.random.manual_seed(4)

spec = get_spectrogram()
plot_spectrogram(spec[0], title="Original")

masking = T.TimeMasking(time_mask_param=80)
spec = masking(spec)

plot_spectrogram(spec[0], title="Masked along time axis")

FrequencyMasking:

torch.random.manual_seed(4)

spec = get_spectrogram()
plot_spectrogram(spec[0], title="Original")

masking = T.FrequencyMasking(freq_mask_param=80)
spec = masking(spec)

plot_spectrogram(spec[0], title="Masked along frequency axis")

Conclusion

In this blog post, we introduced methods for audio augmentation in two mainstream deep learning frameworks. So if you are a TensorFlow enthusiast, you can test the two methods we introduced. If you are a PyTorch enthusiast, you can directly use the official torchaudio package.

References

[1] Park et al., Specaugment: A simple data augmentation method for automatic speech recognition, 2019, Proc. Interspeech 2019

https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html

Editor: Wang Jing

Audio Augmentation in TensorFlow and PyTorch

Direct Audio Augmentation

Audio Augmentation During Forward Propagation

torchaudio

Conclusion

Leave a Comment Cancel reply