Multi-Dialect Voice Recognition Method for the Railway Sector

0 Introduction

The railway, as an important national infrastructure, integrates intelligent customer service systems with cloud computing, big data, and artificial intelligence technologies, enhancing service efficiency and passenger experience. Since 2018, the railway industry has been exploring the intelligentization of the 12306 customer service system, fully implementing the intelligent customer service system by the end of 2023. However, the intelligent voice navigation subsystem faces challenges in recognizing railway-specific terminology and dialects. To address these issues, this research proposes a multi-dialect voice recognition method that does not require switching, integrating knowledge from the railway sector. This includes a dialect recognition model based on the RepVGG network, an optimized Transformer speech recognition model, and a language model based on a railway domain text corpus using LSTM, aimed at improving the performance of multi-dialect voice recognition systems in the railway sector.

Multi-Dialect Voice Recognition Method for the Railway Sector

1 Technical Route and Dataset Construction

This section introduces the technical route and dataset construction for the multi-dialect voice recognition technology aimed at the railway sector. The technical route is based on an intelligent knowledge base for railway passenger transport, integrating knowledge from the railway field to construct a unified modeling unit for the speech recognition model, applied to the railway passenger transport domain through audio processing, feature extraction, and text output. The intelligent knowledge base for railway passenger transport consists of knowledge management and knowledge construction, forming material knowledge, traditional knowledge, and intelligent knowledge. Regarding dataset construction, dialect voice datasets and railway domain text datasets are built based on the intelligent knowledge base for railway passenger transport. The dialect voice dataset is constructed by filtering corpora, extracting special vocabulary, and building a sentence + slot corpus, with a total duration of approximately 47,500 hours. The railway domain text dataset is constructed by filtering key data and calculating relevance using statistical methods, totaling about 180,000 entries, covering services such as electronic tickets and ticket inquiries. To ensure the rationality of dataset coverage, the training set, validation set, and test set are proportionally divided and tokenized.

Multi-Dialect Voice Recognition Method for the Railway Sector

2 Design of Multi-Dialect Voice Recognition Model Integrating Railway Knowledge

This section introduces the design of the multi-dialect voice recognition model integrating railway knowledge. The workflow includes four steps: 1) Using the dialect recognition model to identify language information from audio features; 2) Inputting the language information into the multi-dialect voice recognition model based on Transformer, adopting a Two-pass decoding method; 3) Using the results from the first stage of the Two-pass decoding to re-score the decoding with the railway domain language model based on LSTM; 4) Combining the decoding results from steps two and three to output the highest-scoring N-best results as the final recognition output. The dialect recognition model is constructed based on the lightweight convolutional neural network RepVGG to predict dialects and assist the speech recognition model in decoding. The multi-dialect voice recognition model is optimized based on the original Transformer, incorporating a language residual module and dialect vector to enhance the model’s ability to distinguish and extract multilingual information. The model is trained using a hybrid CTC/Attention structure and adopts a multi-task learning approach to jointly train the speech recognition task and language recognition task. Additionally, to address the challenge of recognizing regional railway terminology, a railway domain language model based on LSTM was specifically trained to assist the multi-dialect voice recognition model in decoding, improving recognition accuracy.

Multi-Dialect Voice Recognition Method for the Railway Sector

3 Experimental Results and Analysis

This section presents the experimental results and analysis of the multi-dialect voice recognition method for the railway sector. Evaluation criteria include Character Error Rate (CER) and Sentence Error Rate (SER). The experimental environment is configured with high-performance CPU, GPU, memory, and operating system, utilizing the Pytorch framework and Wenet toolkit. The network parameter settings detail the configuration of the acoustic input network, encoder, decoder, and dialect recognition model, as well as the number of layers and hidden dimensions of the railway domain language model using LSTM. The training process sets parameters such as batch_size, accum_grad, grad_clip, and epoch, and employs CTC loss auxiliary weights. The experimental results indicate that the improved multi-dialect voice recognition model Rep-Transformer achieves recognition accuracy above 90% for Mandarin, Sichuan dialect, and Cantonese, effectively realizing cross-dialect recognition. Although the accuracy of single-language models slightly decreased, the accuracy of cross-dialect recognition significantly improved. After further integrating the domain language model, both the Character Error Rate and Sentence Error Rate significantly decreased across the three dialects, validating the effectiveness of the proposed method and providing theoretical and technical support for the intelligentization of railway customer service systems.

Multi-Dialect Voice Recognition Method for the Railway Sector

4 Conclusion

This section proposes a multi-dialect voice recognition method that integrates railway domain knowledge, incorporating a dialect recognition model based on RepVGG and a railway domain language model based on LSTM, improving the speech recognition model based on Transformer. On a self-built dataset, the Rep-Transformer model achieves recognition accuracy above 90% for Mandarin, Sichuan dialect, and Cantonese, effectively improving the speech recognition performance for professional terminology in the railway sector. The proposed multi-task training method for the speech recognition model and language recognition avoids the cumbersome steps of independent training and joint optimization. Due to the high cost of constructing the dialect dataset, only three classic languages have been studied, and other dialects have not been explored in depth. From the perspective of comprehensive application in the railway customer service system, the proposed method still has room for improvement in practicality and scalability. Future work will combine transfer learning to study multi-dialect voice recognition algorithms in the railway domain under low-resource conditions, enhancing the intelligent service capabilities of railway customer service systems.

The above content is automatically generated by AI for reference only. For details, see “China Railway” 2025, Issue 1.

Related Information

Authors:

Yang Lipeng, Institute of Electronic Computing Technology, China Academy of Railway Sciences.

Hu Conggang, Institute of Electronic Computing Technology, China Academy of Railway Sciences.

Chen Hualong, Institute of Electronic Computing Technology, China Academy of Railway Sciences.

Han Keke, Institute of Electronic Computing Technology, China Academy of Railway Sciences.

Liu Feng, Passenger Transport Department, China State Railway Group Co., Ltd.

Zhang Zhike, Passenger Transport Department, China State Railway Group Co., Ltd.

Citation:Yang Lipeng, Hu Conggang, Chen Hualong, et al. Multi-Dialect Voice Recognition Method for the Railway Sector [J]. China Railway, 2025(1): 30-39.

Multi-Dialect Voice Recognition Method for the Railway Sector

END

Leave a Comment Cancel reply