Speech Recognition Method Based on Multi-Task Loss with Additional Language Model

Click the blue text to follow us

DOI：10.3969/j.issn.1671-7775.2023.05.010

Open Science (Resource Service) Identifier Code (OSID):

Citation Format: Liu Yongli, Zhang Shaoyang, Wang Yuheng, et al. Speech Recognition Method Based on Multi-Task Loss with Additional Language Model[J]. Journal of Jiangsu University (Natural Science Edition), 2023, 44(5):564-569.

Fund Project: Shaanxi Provincial Key Industry Innovation Chain (Group) Project (2021ZDLGY07-06)

Author Introduction:

Liu Yongli (1997—), Male, from Pingliang, Gansu, Master’s Student ([email protected]), mainly engaged in research on artificial intelligence and big data.

Zhang Shaoyang (1971—), Male, from Xiangfen, Shanxi, Professor ([email protected]), mainly engaged in research on intelligent transportation and big data..

Speech Recognition Method Based on Multi-Task Loss with Additional Language Model

Liu Yongli1, Zhang Shaoyang1, Wang Yuheng1, Xie Yi2

(1. School of Information Engineering, Chang’an University, Xi’an, Shaanxi 710064, China; 2. Operation Management Branch of Shaanxi Transportation Holding Group Co., Ltd., Xi’an, Shaanxi 710065, China)

Abstract：To address the issues of the overly flexible alignment of Attention being poorly adaptable in complex environments, and the insufficient utilization of language features by simple end-to-end models, a speech recognition method based on multi-task loss with an additional language model is studied. By analyzing the characteristics of speech signals, features containing more information are selected during training. Based on the Attention-based Conformer end-to-end model, a multi-task loss training model is adopted using CTC loss to assist the pure Conformer (Attention), resulting in the Conformer-CTC speech recognition model. On the basis of the Conformer-CTC model, by analyzing and comparing the characteristics and effects of some language models, the Transformer language model is added to the training of the above model through a re-scoring mechanism, ultimately obtaining the Conformer-CTC-Transformer speech recognition model. Experiments on the AISHELL-1 dataset show that the Conformer-CTC model reduces the character error rate (CER) by 0.49% compared to the pure Conformer (Attention) model on the test set, while the Conformer-CTC-Transformer model further reduces the CER by 0.79% compared to the Conformer-CTC model on the test set. CTC loss can improve the adaptability of Attention alignment in complex environments, and the addition of the Transformer language model through re-scoring can further enhance recognition accuracy by 0.30%. Compared to existing end-to-end models, the Conformer-CTC-Transformer model demonstrates better recognition performance, indicating its effectiveness.

Keywords：Speech recognition; deep learning; language model; multi-task loss; Conformer; Transformer; CTC

Speech recognition method based on multi-task loss with additional language model

LIU Yongli1, ZHANG Shaoyang1, WANG Yuheng1, XIE Yi2

(1. School of Information Engineering, Chang′an University, Xi′an, Shaanxi 710064, China; 2. Operation Management Branch of Shaanxi Transportation Holding Group Co., Ltd., Xi′an, Shaanxi 710065, China)

Abstract：To solve the problems that the Attention′s overly flexible alignment was poorly adaptable in complex environments and the language features were not fully utilized by simple end-to-end models, a speech recognition method was investigated based on multi-task loss with additional language model. By analyzing the characteristics of the speech signal, the features containing more information were selected in the training. Based on the Attention-based Conformer end-to-end model, the model was trained using multi-task loss of CTC loss assisted pure Conformer (Attention), and the Conformer-CTC speech recognition model was obtained. Based on the Conformer-CTC model, by analyzing and comparing the characteristics and effects of some language models, the Transformer language model was added to the training of the above model through re-scoring mechanism, and the Conformer-CTC-Transformer speech recognition model was obtained. The experiments on the above model were completed on the AISHELL-1 data set. The results show that compared with the pure Conformer (Attention) model, the character error rate (CER) of the Conformer-CTC model on the test set is reduced by 0.49%, and the CER of the Conformer-CTC-Transformer model on the test set is reduced by 0.79% compared with the Conformer-CTC model. The adaptability of Attention alignment in complex environments can be improved by CTC loss, and after re-scoring the Transformer-CTC model with the Transformer language model, the recognition accuracy can be increased by 0.30% again. Compared with some existing end-to-end models, the recognition effect of the Conformer-CTC-Transformer model is better, indicating that the model has certain effectiveness.

Key words： speech recognition; deep learning; language model; multi-task loss; Conformer; Transformer; CTC

Classification Number:TP391.9

Document Mark Code: A

Article Number: 1671-7775(2023)05-0564-06

Received Date: 2022-07-13

Speech recognition technology is an important branch of speech signal processing technology, involving many disciplines including acoustics, phonetics, and computer science, making it an interdisciplinary technology. Speech recognition technology studies how to convert speech signals into text information, implemented through computer technology, enabling machines to automatically convert input speech signals into corresponding text information output. Since its development in the 1950s, several representative methods have emerged, such as recognition methods based on acoustics and phonetics, pattern matching, and deep learning. The recognition method based on pattern matching, which is currently the most mature, performs well with the DNN-HMM model based on Hidden Markov Models. Recognition methods based on deep learning are currently popular with good adaptability in complex environments. Deep learning uses deep graphs with multiple processing layers for high-level abstract modeling of data, and its strong adaptability and good ability to handle complex environments have facilitated its rapid development. In recent years, automatic speech recognition (ASR) has gradually shifted from hybrid models based on deep neural networks to end-to-end models.

The end-to-end models mainly include two types: connectionist temporal classification (CTC)-based end-to-end models and attention mechanism-based encoder-decoder models. Although their training process is simple, allowing direct output of text without training acoustic and language models, and their network is single, making it more compact compared to traditional hybrid models, there are still some shortcomings due to their late start. First, since end-to-end models do not use language models, they fail to fully utilize language features, which is detrimental to improving recognition accuracy. Secondly, the two main implementation methods of end-to-end models have their own shortcomings; CTC-based models cannot effectively model the dependencies between words in the label sequence, while Attention initially has too broad a focus, making it difficult for training to converge, and it has no restrictions on the alignment between frames and labels, resulting in many ineffective computations. Researchers have combined Attention and CTC, such as in literature [5] which uses a hybrid CTC/Attention architecture for accented Mandarin recognition, and literature [6] which combines Transformer with CTC for end-to-end speech recognition.

To address these issues, considering that CTC has the ability to directly optimize the likelihood of the input sequence and target output sequence, and literature [7] shows that the Conformer end-to-end model outperforms the Transformer end-to-end model. This paper proposes to use the Attention-based Conformer end-to-end model as a basis, employing CTC loss to assist its training to obtain the Conformer-CTC speech recognition model; furthermore, by analyzing and comparing the characteristics and effects of some language models, the Transformer language model is added to the training of the above model through a re-scoring mechanism, ultimately obtaining the Conformer-CTC-Transformer speech recognition model; finally, testing is conducted on the open-source 178-hour dataset (AISHELL-1) and compared with some existing end-to-end models.

1 Model Structure

The end-to-end speech recognition model consists of a single neural network, optimizing useful objectives with a loss function as the training objective, improving training efficiency. The main implementations include CTC-based end-to-end models and Attention-based Encoder-Decoder models. Experimental data in literature [8] also indicates that the character error rate (CER) of the Attention-based Encoder-Decoder model is lower than that of CTC-based end-to-end models. In experiments using the Attention-based Encoder-Decoder model, the word error rate (WER) for original speech with an 8.3% WER was reduced to 7.4% after transcription. The experimental results show that the recognition performance of the Attention-based Encoder-Decoder speech recognition model outperforms that of the CTC-based end-to-end model. Currently, the well-performing Attention-based Encoder-Decoder models include Transformer and Conformer [7]. As seen in literature [7], the Conformer model performs significantly better than the Transformer model on LibriSpeech.

Based on the above analysis, the basic network structure of the speech recognition model in this paper selects Conformer, using CTC for assistance during training, and adding the Transformer language model for re-scoring during model decoding. The final structure of the ASR model is shown in Figure 1.

Figure 1 ASR Model Structure

2 Determination of Each Module Structure in the Model

2.1 Speech Signal Features and Language Models

In the speech processing process, commonly used speech signal features include spectrogram features, Mel-frequency cepstral coefficients, and FBank features. Since this paper adopts neural network modeling, the features suitable for neural network modeling among speech signal features are FBank features and spectrogram features, as shown in Figure 2.

Figure 2 FBank Feature Map and Spectrogram Feature Map

In the extraction process of speech signal features, compared to the extraction of FBank features, the extraction process of spectrogram features does not require filtering through a Mel filter bank, encompassing all spectral features of the speech signal. Therefore, the spectrogram features retain more original information than FBank features, and their extraction process is simpler than that of FBank, although FBank features are currently the most widely used. To further compare the two, this paper extracts both types of speech features, training the Conformer end-to-end ASR model based on a Transformer encoder under the same conditions, in order to select the better-performing feature for subsequent experiments.

The test results of the experimental models on the AISHELL-1 dataset are shown in Table 1.

Table 1 Testing Results of Models Extracting Different Speech Signal Features %

As seen in Table 1, the model trained with spectrogram features extracted from the dataset has a lower CER than the model trained with FBank features, indicating relatively better recognition performance in the experiments.

The language model is used to represent the correspondence between words, established based on textual information. Its types mainly include rule-based language models, statistical language models, and neural network language models, with the neural network language model being the most commonly used. Among neural network language models, Transformer and long short-term memory (LSTM) networks are frequently applied. Transformer supports parallel computation and performs better in terms of performance, but is weaker in establishing dependency relationships compared to LSTM. LSTM can better handle time-related issues, but its high memory consumption limits its use in resource-constrained environments (e.g., portable devices). To select the better model for subsequent experiments, both types of language models are trained under the same conditions regarding training iterations, and their performance is tested on the AISHELL-1 dataset, with the results shown in Table 2.

Table 2 Testing Results Based on Different Network Language Models

As seen in Table 2, the Transformer language model trained during the experiment has a lower Loss value and a smaller perplexity on the AISHELL-1 dataset compared to the LSTM language model, resulting in better model performance.

2.2 Conformer Network and Multi-Task Loss

The Conformer model was proposed by Google in 2020 and is a network model evolved from Transformer and CNN. The self-attention layer in the Transformer network excels at extracting global dependency information from sequences, while CNN effectively captures local feature information. Leveraging the advantages of both, the Conformer enhances the performance of Transformer in speech recognition through convolution, modeling local and global features of audio sequences efficiently. The original Conformer network’s decoder uses a single LSTM, while its encoder is improved based on the Transformer encoder, as shown in Figure 3.

Figure 3 Conformer Encoder Network Structure

As seen in Figure 3, the Conformer encoder includes an additional convolution layer for extracting local features, while the feedforward neural network layer is split into two. The core of the Conformer network is its encoder, which computes the input vector as follows:

(1)

xi″=xi′+MHSA(xi′),

(2)

xi‴=xi″+Conv(xi″),

(3)

(4)

In these equations, FFN(x) represents data processed through the feedforward neural network layer; MHSA(x) represents data processed through the multi-head attention layer; Conv(x) represents data processed through the convolution layer; and Layernorm(x) represents data processed through the layer normalization layer.

Since the original Conformer network’s decoder uses a single-layer LSTM, this limits the overall network performance. Therefore, a better Transformer decoder is used as a replacement. The Transformer decoder can improve decoding accuracy through attention mechanisms, but it converges slowly and has no restrictions on the alignment between frames and labels, resulting in many ineffective computations. CTC can effectively solve alignment issues; it does not require pre-segmenting the input sequence and can directly align the output sequence with the input sequence over time, allowing speech frames and their corresponding text labels to correspond roughly. Thus, by changing the network structure, fusion decoding can be achieved, allowing CTC to assist the Attention mechanism in completing the decoding task. This not only enables the hybrid model to utilize label dependencies but also accelerates the convergence of the attention decoder. When the losses of Attention and CTC are converted into a unified loss value, errors can be calculated based on this unified loss value, enabling simultaneous training of the CTC loss module, Attention decoder module, and Conformer encoder module. Literature [5] proposes a hybrid loss function; a similar hybrid loss function can be constructed here:

Lhyb(yn|y1:n-1)=λlog PCTC(y|x)+(1-λ)log PAttention(y|x,h1:T′),

(5)

In this equation, the first part on the right-hand side is the CTC loss function, and the latter part is the Attention loss function; λ is the interpolation weight, with λ∈[0,1].

2.3 Re-scoring Mechanism of the Additional Language Model

Although the end-to-end ASR model can directly convert speech signals into corresponding text information, it does not utilize a language model containing text information, neglecting semantic information. Moreover, due to the presence of many homophones in Chinese, its training is also limited by the amount of corpus data, leading to poor final recognition performance. The language model, as a prior item in speech recognition, can significantly judge the accuracy of semantics and grammatical habits, and its training transcripts are easily obtainable, incurring low training costs. Therefore, incorporating a language model for re-scoring in the end-to-end model can effectively improve recognition accuracy, typically achieved through shallow fusion during the decoding process, with the fusion formula as follows:

score(yn|y1:n-1,h1:T′)=Lhyb(yn|y1:n-1,h1:T′) + αlog plm(yn|y1:n-1),

(6)

In this equation, α is the weight coefficient, with α∈[0,1]; Lhyb(yn|y1:n-1,h1:T′) is the joint decoding loss when the historical sequence is y1:n-1 and the decoder output is h1:T′; αlog plm(yn|y1:n-1) is the language model decoding probability for the historical sequence from y1 to yn-1.

3 Experimental Results and Analysis

3.1 Experimental Data and Basic Configuration

The dataset used in the experiment is the open-source 178-hour dataset (AISHELL-1), recorded by 400 speakers with accents from different regions of China, covering 11 fields of social life with high speech quality.

To obtain better corpus data to improve training performance, the experiment employs data augmentation methods using enhanced audio speed. It is generally believed that data augmentation is an operation performed on the dataset before model training. The features extracted from speech signals are spectral features, which are used for training the model. The Conformer model consists of 12 layers, each with a dimension of 2048; the number of heads in the Attention module is 4, with each layer having a dimension of 256. The weight of CTC in the hybrid decoding is set to 0.3 to assist training. The additional language model used is the Transformer language model, which is used for re-scoring to obtain better results.

3.2 Experimental Analysis

This paper sequentially verifies the performance of the Conformer-CTC model, Conformer-CTC-LSTM model, and Conformer-CTC-Transformer model on the AISHELL-1 dataset, comparing their performance with mainstream end-to-end models in recent years. The comparison curve of the CER for each ASR model is shown in Figure 4, and the experimental result data is shown in Table 3.

Table 3 Experimental Results of Different ASR Models on the AISHELL-1 Dataset %

Figure 4 Comparison Curve of CER for Each ASR Model

As observed in Figure 4, the CER values of the three models during the training process show little difference, all converging quickly during training. Relatively speaking, from Figure 4, it can be seen that the validation CER curve of the Conformer-CTC-Transformer model decreases the fastest from rounds 0 to 4, indicating it converges more easily, and the subsequent training rounds maintain the lowest CER values.

The experimental data in Table 3 clearly shows that the Conformer-CTC model, trained with CTC loss assistance, reduces CER by 0.49% on the test set compared to the pure Conformer (Attention) end-to-end model; the model effect after re-scoring with the additional language model achieves the best results, with the Conformer-CTC-Transformer model exhibiting a lower CER compared to the Conformer-CTC-LSTM, reducing CER by 0.30% compared to the Conformer-CTC model, and achieving a 0.79% reduction in CER compared to the pure Conformer (Attention) end-to-end model on the test set. The experimental results indicate that CTC loss can somewhat improve the deficiencies of the pure Conformer (Attention) end-to-end model during training, and using the additional language model for re-scoring can further enhance recognition accuracy.

The data in Table 3 shows that compared to other end-to-end models, the Conformer-CTC-Transformer model further reduces the character error rate, clearly demonstrating its effectiveness.

4 Conclusion

1) In the testing results on the AISHELL-1 dataset, the Conformer-CTC model trained with CTC loss assistance reduces the character error rate by 0.49% compared to the pure Conformer (Attention) end-to-end model, improving recognition accuracy. The experimental results indicate that during the training of the Encoder-Decoder model based on Attention using CTC loss assistance, CTC loss can improve the difficulty of convergence and the lack of restrictions on the alignment relationship between frames and labels in the pure Conformer (Attention) end-to-end model to some extent.

2) After re-scoring with the additional language model in the training of the Conformer-CTC model, the recognition accuracy of the original model is further improved, reducing the CER on the AISHELL-1 dataset. Among them, the Conformer-CTC-Transformer model achieves a character error rate of 4.90% on the test set, achieving the best experimental results.

3) A comparison with other end-to-end models indicates that based on training the pure Conformer (Attention) end-to-end model with CTC loss assistance, the additional language model for re-scoring leads to a higher recognition accuracy for the final Conformer-CTC-Transformer model, demonstrating its effectiveness.

References

[1] DOKUZ Y, TUFEKCI Z. Mini-batch sample selection strategies for deep learning based speech recognition [J]. Applied Acoustics, DOI: 10.1016/j.apacoust.2020.107573.

[2] Yu Kun, Zhang Shaoyang, Hou Jiazhen, et al. Current Status and Prospects of Speech Recognition and End-to-End Technology [J]. Computer Systems Applications, 2021, 30(3): 14-23.

[3] Deng Huizhen. Speech Recognition Based on Local Self-Attention CTC [D]. Harbin: Heilongjiang University, 2021.

[4] Das A, Li J Y, Zhao R, et al. Advancing connectionist temporal classification with attention modeling [C]∥Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2018: 4769-4773.

[5] Yang Wei, Hu Yan. Hybrid CTC/Attention Architecture for End-to-End Multi-Accent Mandarin Speech Recognition [J]. Application Research of Computers, 2021, 38(3): 755-759.

[6] Xie X K, Chen G, Sun J, et al. TCN-Transformer-CTC for End-to-End Speech Recognition [J]. Application Research of Computers, 2022, 39(3): 699-703.

[7] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-Augmented Transformer for Speech Recognition [C]∥Proceedings of the Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association, 2020: 5036-5040.

[8] Bahdanau D, Chorowski J, Serdyuk D, et al. End-to-End Attention-Based Large Vocabulary Speech Recognition [C]∥Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2016: 4945-4949.

[9] Biadsy F, Weiss R J, Moreno P J, et al. Parrotron: An End-to-End Speech-to-Speech Conversion Model and Its Applications to Hearing-Impaired Speech and Speech Separation [C]∥Proceedings of the Annual Conference of the International Speech Communication Association. Lous Tourils, Baixas, France: International Speech Communication Association, 2019: 4115-4119.

[10] Ma R, Liu Q, Yu K. Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training [C]∥Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway: IEEE, 2019: 62-69.

[11] Ge Y Z, Xu X, Yang S R, et al. Survey on Sequence Data Augmentation [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7): 1207-1219.

[12] Yao Z Y, Wu D, Wang X, et al. WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit [C]∥Proceedings of the 22nd Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association, 2021: 2093-2097.

[13] Zhu X C, Zhang F, Gao L, et al. Research on Speech Recognition Based on Residual Network and Gated Convolution Network [J]. Computer Engineering and Applications, 2022, 58(7): 185-191.

[14] Liang C D, Xu M L, Zhang X L. Transformer-Based End-to-End Speech Recognition with Residual Gaussian-Based Self-Attention [C]∥Proceedings of the 22nd Annual Conference of the International Speech Communication Association. [S.l.]: International Speech Communication Association, 2021: 1495-1499.

(Editor-in-Chief: Liang Jiafeng)

Dear readers, the source PDF of the article can be downloaded by clicking “Read Original” on our journal’s website.

Like + View, let’s work hard together ↓↓↓