Transfer Learning for Speaker Verification

Speaker recognition is the task of recognizing a person’s identity from his or her voiceprint. It has been gaining much popularity among machine learning and speech researchers in today’s digital time. It has become more relevant in recent times as digital technology strives to ease the day-to-day affairs of human lives. From access control systems to remote authentication, voice biometry finds applications across enterprises and institutions.

With increased availability of data and powerful computing resources, researches have been drawn towards adopting deep learning techniques for addressing this problem. Deep learning techniques have, indeed, proven to be the best among all techniques available today. One such popular technique used for speaker verification involve extracting speaker embeddings from a TDNN (Time Delay Neural Network). These embeddings, popularly known as xvectors, are further modelled using a backend classifier like PLDA (Probabilistic Linear Discriminant Analysis) model to implement the system.

One common issue with the xvector model, or, for that matter, any such model, is that it works better if tested on data from the same domain as that of the training data. The performance tends to deteriorate for data from a different domain. Training the backend classifier with some in-domain data has been found to be useful in such scenario. The other solution is to transfer learn the model if enough in-domain data is available. And given that there are some good models available for download (free for research), the case for transfer learning becomes even stronger.

One trained xvector model is available at Kaldi-ASR. The details of the training data are available at the website. It can be easily downloaded and tested for any specific scenario. We can transfer learn this model to suit our data condition better using the kaldi toolkit. Here are the important steps involved in transfer-learning the xvector model:

Change the nnet3 config file to incorporate the changes in terms of the number of output classes/labels (speakers) in the training data.
Reinitialize and resize the output layer of the pre-trained model network so that it may be trained with the in-domain data.
Optionally, we may freeze the weights of some of the layers and train weights of particular layers only. This is a good idea if data proves insufficient to train the weights of all the layers.

The nnet3 config file inside “xvector_nnet_1a” folder of the downloaded directory may be referred to for creating a new config file that can help us to reinitialize and resize the final layer. We are required to create a config file that would look something like this:-

component name=output.affine type=NaturalGradientAffineComponent input-dim=512 output-dim=num_speakers param-stddev=0.0 bias-stddev=0.0 max-change=1.5 component-node name=output.affine component=output.affine input=tdnn7.batchnorm component name=output.log-softmax type=LogSoftmaxComponent dim=num_speakers component-node name=output.log-softmax component=output.log-softmax input=output.affine output-node name=output input=output.log-softmax objective=linear

First, we need to change the value for number of speakers in the config file (denoted as num_speakers in the above example) with the count of speakers in our training data. Then, “nnet3-copy” functionality may be used as given below to bring re-initialization and resizing into effect.

nnet3-copy –nnet-config=new_config exp/xvector_nnet_1a/final.raw exp/exp_fold/0.raw

Sometimes, it may be helpful to freeze the weights of some layers and retrain others. To freeze the weights of any of the layers, we may use the “–edits” option of “nnet-copy” to set the learning rate to 0. For example, to set the learning rate of layer 1 to 0, we may perform the following operation:

nnet3-copy –edits=”set-learning-rate-factor name=tdnn1.affine learning-rate=0” exp/xvector_nnet_1a/final.raw exp/exp_fold/0.raw

We can then run local/nnet3/xvector/tuning/run_xvector_1a.sh from stage 6, setting “train_stage” option to 0. Before that the egs data needs to be readied using the in-domain data running run_xvector_1a.sh (till stage < 6).

This should help us to transfer learn the xvector model with in-domain data. It may be, however, noted that most researchers suggest training the backend model like PLDA (Probabilistic Linear Discriminant Analysis) on xvectors with in-domain data, instead of transfer learning the xvector model itself. Nevertheless, if there is enough data available, it could be still helpful to transfer learn the TDNN network.