Decoding Natural Language from Brain Activity: Tasks and Cutting-Edge Methods

MLNLP Community is a well-known machine learning and natural language processing community in China and abroad, covering NLP graduate students, university teachers, and corporate researchers.

The Vision of the Community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for the progress of beginners.

Reprinted from | Harbin Institute of Technology SCIR

Author | Zhao Yi

1 Introduction

Language is not only a tool for human communication but also the foundation of thought and intelligence. Understanding how the brain decodes and processes language information is key to revealing the essence of human intelligence. With the rapid development of technologies such as brain-computer interfaces (BCI), we now have the potential to decode natural language from brain activity. This research direction is crucial not only for the development of cognitive science and neuroscience but also offers new hope for individuals who have lost the ability to communicate due to neurodegenerative diseases or trauma. The development of this direction will greatly expand our understanding of how the human brain processes language and may open up entirely new ways of communication. The greatest demand for decoding natural language from brain activity arises in patients with movement and language disorders caused by acute or degenerative injuries to the pyramidal tract or lower motor neurons. When movement and language disorders are particularly severe, as in locked-in syndrome (LIS), patients may completely lose motor control, making it impossible to initiate or maintain communication independently, limited to answering simple questions with slight movements such as blinking or eye movement. BCI technology provides a bridge between the brain and the outside world, reading signals generated by the human brain and converting them into desired cognitive tasks, allowing those who cannot speak due to movement disorders to communicate solely through their brain signals without moving any body parts. Significant progress has been made in many BCI paradigms to assist these patients in communication, including P300, steady-state visual evoked potential (SSVEP), and motor imagery (MI). P300 and SSVEP utilize external stimuli, such as flashing screens or auditory beeps, to induce distinguishable brain patterns. Motion-imagery-based systems identify spontaneous motor intentions of the human brain without the need for external stimulation. However, these paradigms typically only output text in the form of thought typing and cannot replace the speed and flexibility of verbal communication. In everyday conversations, the average number of words exchanged per minute can reach seven times the speed of thought typing. Therefore, decoding natural language from brain activity, more specifically decoding natural language from brain activity during speech or imagined speech, has a significant speed advantage over previous BCI paradigms and also allows patients to communicate with less effort.

2 Data Collection

To capture the signals generated by the brain during speech or imagined speech, various neuroimaging methods have been applied. These methods mainly include non-invasive methods such as electroencephalography (EEG), magnetoencephalography (MEG), and functional magnetic resonance imaging (fMRI), as well as invasive methods such as electrocorticography (ECoG). Invasive methods can provide sufficient spatial and temporal resolution while having a high signal-to-noise ratio (SNR), but the higher medical risks limit their widespread use in clinical and daily applications. This has led to a focus on and extensive research into brain activity decoding based on non-invasive methods.

Decoding Natural Language from Brain Activity: Tasks and Cutting-Edge Methods

Figure 1 Comparison of Various Neuroimaging Methods

2.1 ECoG

Electrocorticography (ECoG) is an invasive neural recording technique that measures electrical activity on the surface of the brain cortex by implanting electrode arrays in the subdural space of the brain. These electrodes are typically disk-shaped and made of platinum-iridium, embedded in soft silicone sheets. The signals recorded by ECoG have high spatial and temporal resolution, providing accurate information about brain activity. Due to its accuracy and high signal-to-noise ratio, ECoG has widespread applications in clinical neuroscience, particularly in identifying the seizure sources in patients with drug-resistant epilepsy and determining the cortical areas critical for brain function to preserve during resection surgery. One major advantage of ECoG is its ability to cover a large area of the cortical surface while providing sufficient spatial resolution, which is of significant value and significance for studying widely distributed neural networks such as language and motor control networks.

Decoding Natural Language from Brain Activity: Tasks and Cutting-Edge Methods

2.2 EEG

Electroencephalography (EEG) is a widely used non-invasive neural recording technique that measures the electrical signals generated by brain activity by placing electrodes on the scalp. EEG is primarily used to monitor and study the brain’s electrophysiological activity, particularly for diagnosing and researching epilepsy, sleep disorders, brain injuries, and various neurological diseases. As a non-invasive method, EEG has high temporal resolution, capable of capturing rapid changes in brain electrical activity, providing sub-millisecond level timing information, which is very useful for studying how the brain processes information in a short time. However, EEG has relatively low spatial resolution, making it difficult to accurately locate electrical activity in specific brain areas, limiting its application in precise brain mapping. Another limitation of EEG is its low signal-to-noise ratio (SNR). The target components in the signal are difficult to identify from background activity, which may come from muscle or organ activity, eye movements, or blinking. Despite these issues, given EEG’s non-invasiveness, portability, and low cost, it remains an extremely important tool in neuroscience, clinical neurology, and brain-computer interface research.

2.3 MEG

Magnetoencephalography (MEG) is a non-invasive neuroimaging technique that measures brain activity by recording the changes in the magnetic fields caused by the activity of neurons in the brain. At the cellular level, individual neurons in the brain have electrochemical properties that lead to charged ions flowing through the cells. The net effect of this slow ionic current flow generates an electromagnetic field. Although the field strength generated by individual neurons can be negligible, when a large number of neurons in a specific area are activated together, a measurable magnetic field can be produced outside the head. The neural magnetic signals generated by the brain are extremely weak, so MEG scanners need to use superconducting sensors and be placed in magnetically shielded rooms for measurement. MEG can provide temporal characteristics of brain activity with sub-millisecond precision and offers more accurate spatial localization of neural activity than EEG. Despite its relatively strict usage conditions, its advantages in spatial and temporal resolution make it an extremely important technical tool in neuroscience and clinical research.

2.4 fMRI

Functional magnetic resonance imaging (fMRI) operates on the principle of using BOLD (blood-oxygen-level-dependent) contrast to detect changes in brain activity. BOLD contrast utilizes the differences in magnetic properties between oxygenated hemoglobin and deoxygenated hemoglobin in the blood. When a part of the brain is active, it requires more oxygen to support its functions. To meet this demand, blood flow increases to bring more oxygenated hemoglobin. Oxygenated hemoglobin is magnetically neutral, while deoxygenated hemoglobin is magnetic. Therefore, when blood flow to a region increases, the BOLD signal in that area also increases. fMRI has high spatial resolution but low temporal resolution. A single scan can measure about 100,000 voxels, while MEG sensors typically measure fewer than 300. However, a pulse of neural activity may cause BOLD to rise and fall over approximately 10 seconds; for naturally spoken English, the brain images captured per scan may be influenced by more than 20 words. This means that decoding brain activity is an ill-posed problem. Despite this challenge for decoding continuous language, some work has explored and attempted this direction.

3 Cutting-Edge Work

Below, several recent works related to decoding natural language from brain activity will be introduced. Currently, mainstream approaches [1,2,3] involve decoding text end-to-end from brain activity. These works typically adopt an encoder-decoder model structure to map brain signals to continuous text. With the emergence of pre-trained language models, cutting-edge works [2,3] have gradually applied them to brain activity decoding, typically using them as decoders and training them together with randomly initialized encoders. There are also works [4] attempting to decode brain activity using a non-end-to-end approach. Beyond text decoding, some works [6] study aligning brain signals to high-quality representations generated by pre-trained models, thus mapping brain signals into a well-structured vector space formed by the outputs of the pre-trained models.

3.1 End-to-End Decoding

Machine translation of cortical activity to text with an encoder-decoder framework (Nature neuroscience 2020) [1] Prior to this work, most studies decoding natural language from brain activity were typically limited to isolated phonemes or monosyllabic words. Work decoding continuous text was relatively rare and performed poorly. The article models the problem as a machine translation problem, treating brain signals as the source language and corresponding continuous text as the target language, thus transferring model methods from the machine translation field to the task of brain activity decoding.

The article designed a simple neural network structure with an encoder-decoder framework to decode continuous text from ECoG signals. As shown in the figure below, for the input raw ECoG signals, the model first performs strided convolution along the time dimension to extract temporal features and downsample to 16Hz, then inputs the encoder-decoder structure’s LSTM network to decode continuous text. To guide the encoder to encode meaningful information, in addition to training the model end-to-end to decode continuous text from ECoG signals, the article also added an auxiliary loss during the training phase, forcing the model to accurately predict the corresponding audio representation of speech at each time step based on the hidden layer representation of the encoder (the Mel-frequency cepstral coefficients (MFCC) of audio were used as low-level representations of audio).

The article collected repeated dictations of 30 to 50 sentences from each subject, along with about 250 electrodes recording ECoG signals from the lateral fissure of the brain at the same time. The proposed method significantly improved accuracy compared to previous studies, with the average word error rate (WER) for some participants dropping to 7%, a result significantly better than the over 60% error rate in previous studies, providing important reference significance for future research.

Open Vocabulary Electroencephalography-to-Text Decoding and Zero-Shot Sentiment Classification (AAAI 2022) [2] In the fields of neuroscience and brain-computer interfaces, the collection of brain activity data often faces a series of challenges, ultimately leading to the size of the collected dataset being relatively small, which becomes an important limitation for the development of related research and applications. Due to the lack of training data, traditional works decoding natural language from brain activity are often limited to small and closed vocabularies, making it difficult to generalize to words and sentences outside the training set. This work is the first to use a pre-trained language model (the article uses BART [6]) for continuous text decoding of EEG signals. Leveraging the pre-trained language model’s capabilities in understanding syntactic features, semantic features, and long-distance dependencies, this work expands the vocabulary to about 50,000 (i.e., the vocabulary size of BART) while maintaining good generalization ability under conditions of data scarcity.

The article views the human brain as a special text encoder and proposes a novel framework called BrainBART. This framework treats the EEG feature sequence as encoded continuous text and maps the input EEG feature sequence to the embedding layer representation of BART through an additional encoder, as shown in the figure below. The goal during training is to minimize the cross-entropy loss of text reconstruction. Additionally, the article proposes a zero-shot sentiment classification method that first converts the EEG feature sequence into text and then predicts sentiment labels using a text classifier.

This work used the ZuCo dataset [7,8], which contains EEG and eye-tracking data recorded while subjects performed natural reading tasks. BrainBART achieved a BLEU-1 score of 40.1% in continuous text decoding and an F1 score of 55.6% in zero-shot ternary sentiment classification, significantly outperforming supervised baselines.

UniCoRN: Unified Cognitive Signal Reconstruction Bridging Cognitive Signals and Human Language (ACL 2023) [3] Although continuous text decoding of EEG signals has achieved some success, research on generating continuous text from fMRI signals is relatively scarce, mainly due to the low temporal resolution of fMRI. Previous methods for decoding fMRI signals typically relied on feature extraction from predefined regions of interest (ROI), failing to effectively utilize time-series information while often overlooking the importance of efficient encoding. To address these issues and avoid using separate complex processes to decode language from specific modalities of brain signals, the article proposes a universal brain signal decoding framework called UniCoRN, applicable to decoding brain signals of various modalities.UniCoRN adopts an encoder-decoder framework, leveraging the powerful decoding capabilities of pre-trained language models, and builds an effective encoder through snapshot and sequence reconstruction, allowing the model to analyze the temporal dependencies between individual snapshots and snapshot sequences, thereby maximizing the extraction of information from brain signals.

Below, the overall framework of the model will be introduced using fMRI signal decoding as an example. UniCoRN consists of two stages: brain signal reconstruction, aimed at training encoders for specific modalities of brain signals; and brain signal decoding, which converts the representations of brain signals from the first stage into natural language. The deep idea here is to treat each snapshot of brain signals (such as a single fMRI frame) as a word-level representation of the “language spoken by the human brain” and to obtain word embeddings for this language through the encoder, ultimately converting it into real human language, similar to traditional machine translation tasks. The brain signal reconstruction phase can be subdivided into snapshot reconstruction and sequence reconstruction, training the encoder to integrate the internal features of each snapshot and the temporal relationships between snapshots in the time series. As shown in the figure, the snapshot reconstruction phase (phase 1) encodes each fMRI frame through a snapshot encoder, aiming to reconstruct the original fMRI frame as the training objective; the sequence reconstruction phase (phase 2) inputs the encoded representations of continuous fMRI frames into a sequence encoder to generate serialized representations and continues training with the same objective as the previous phase. After the brain signal reconstruction phase, the decoder previously used to reconstruct the original fMRI frame is replaced with a text decoder for the final text generation (phase 3). The article selects BART as the text decoder and uses cross-entropy loss for training.

UniCoRN achieved a BLEU-4 score of 34.77% in the continuous text decoding task of fMRI signals (Narratives dataset [9]) and a BLEU-4 score of 62.90% in the continuous text decoding task of EEG signals (ZuCo data [7]), surpassing previous baselines. Experimental results indicate that decoding language from fMRI signals is feasible and that using a unified structure to decode brain signals of different modalities is effective.

3.2 Non-End-to-End Decoding

Semantic reconstruction of continuous language from non-invasive brain recordings (Nature Neuroscience 2023) [4] This work proposes a method to reconstruct auditory stimuli (in the form of natural language) that subjects are hearing or imagining from fMRI signals. Achieving this requires overcoming the low temporal resolution of fMRI. To solve this problem, the proposed decoder does not adopt an end-to-end decoding approach but instead generates candidate word sequences, evaluates the likelihood of each candidate eliciting the current measured brain response, and then selects the best candidate to achieve decoding.

The framework of the method is shown in the figure below. Three subjects listened to 16 hours of narrative stories, and their blood-oxygen-level-dependent (BOLD) functional magnetic resonance imaging (fMRI) responses were recorded. The article trained a coding model for each subject to predict the corresponding brain responses from the semantic representations of text stimuli. To reconstruct language from brain activity, the article employs a beam search algorithm to generate candidate sequences word by word. The proposed method maintains several of the most likely candidate sequences, and when a new word is detected through activity in the brain’s auditory and language regions, it uses a language model to generate the most likely continuations for each candidate sequence. Then, the previously trained coding model scores the likelihood of each continuation eliciting the current measured brain response and retains the most likely continuation. Experimental results show that the recognition accuracy of the method is significantly higher than expected by chance, proving the method’s effectiveness.

3.3 Signal Alignment Research

Decoding speech from non-invasive brain recordings (Arxiv 2022, Meta AI) [5] This work proposes a data-driven method using a single architecture to decode natural language from MEG or EEG signals. The article introduces a convolutional neural network as the encoder for brain signals and trains it using contrastive objectives to align with the deep audio representations generated by the pre-trained speech self-supervised model wav2vec-2.[10]

Theoretically, the brain signal encoder can be trained through regression loss to predict the Mel-frequency cepstral coefficients of the corresponding audio and use the encoder’s output as a representation of the brain signals. However, in practice, the article observes that the representations generated by this direct regression method are often dominated by indistinguishable broadband components. To address this issue, the article first infers that regression may be an ineffective loss and replaces it with the contrastive loss of the CLIP model [11], originally designed to match and align the deep representations of text and images across modalities. The article further determines that Mel-frequency cepstral coefficients are unlikely to match the rich brain activity, as they only contain low-level representations of sound. Here, the article replaces Mel-frequency cepstral coefficients with the output representations of wav2vec-2.0, which effectively encodes multi-layer language features, and studies have shown a linear relationship between it and brain activation. Finally, the article proposes a CNN network considering subject differences as the encoder for brain activity.

The article validated the model on four public MEG/EEG datasets [12,13,14,15], demonstrating the model’s ability to identify matching audio segments using 3 seconds of MEG/EEG signals (i.e., zero-shot decoding), achieving a maximum TOP-10 accuracy of 72.5% on MEG and 19.1% on EEG. Although the experiments in the article were limited to decoding audio segments and individual words, the methods and ideas presented can serve as a foundation for subsequent work, effectively transferring to many tasks, including continuous text decoding.

4 Conclusion

This article reviews the task and cutting-edge methods of decoding natural language from brain activity. The continuous development of cutting-edge methods not only deepens our understanding of the interaction between language and the brain but also lays a solid foundation for the development of advanced brain-computer interface technologies. Despite significant progress, this field still faces challenges such as a lack of brain activity data and low signal-to-noise ratios in non-invasive methods, limiting the applicability of methods in practical applications. Future work needs to acquire higher quality and larger-scale brain activity data while also innovating algorithms and models to maximize the use of limited data. Finally, interdisciplinary collaboration, such as the combination of neuroscience, linguistics, and computer science, will provide new perspectives for understanding the complex mechanisms of the brain processing language, pushing the field towards more precise and practical directions.

References

[1] Makin J G, Moses D A, Chang E F. Machine translation of cortical activity to text with an encoder–decoder framework[J]. Nature neuroscience, 2020, 23(4): 575-582.[2] Wang Z, Ji H. Open vocabulary electroencephalography-to-text decoding and zero-shot sentiment classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 36(5): 5350-5358.[3] Xi N, Zhao S, Wang H, et al. UniCoRN: Unified Cognitive Signal Reconstruction bridging cognitive signals and human language[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023: 13277-13291. [4] Tang J, LeBel A, Jain S, et al. Semantic reconstruction of continuous language from non-invasive brain recordings[J]. Nature Neuroscience, 2023: 1-9.[5] Défossez A, Caucheteux C, Rapin J, et al. Decoding speech from non-invasive brain recordings[J]. arXiv preprint arXiv:2208.12266, 2022.[6] Lewis M, Liu Y, Goyal N, et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7871-7880.[7] Hollenstein N, Rotsztejn J, Troendle M, et al. ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading[J]. Scientific data, 2018, 5(1): 1-13.[8] Hollenstein N, Troendle M, Zhang C, et al. ZuCo 2.0: A Dataset of Physiological Recordings During Natural Reading and Annotation[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 138-146.[9] Nastase S A, Liu Y F, Hillman H, et al. The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension[J]. Scientific data, 2021, 8(1): 250.[10] Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. Advances in neural information processing systems, 2020, 33: 12449-12460.[11] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PMLR, 2021: 8748-8763.[12] Schoffelen J M, Oostenveld R, Lam N H L, et al. A 204-subject multimodal neuroimaging dataset to study language processing[J]. Scientific data, 2019, 6(1): 17.[13] Gwilliams L, King J R, Marantz A, et al. Neural dynamics of phoneme sequencing in real speech jointly encode order and invariant content[J]. BioRxiv, 2020: 2020.04. 04.025684.[14] Broderick M P, Anderson A J, Di Liberto G M, et al. Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech[J]. Current Biology, 2018, 28(5): 803-809. e3.[15] Brennan J R, Hale J T. Hierarchical structure guides rapid linguistic predictions during naturalistic listening[J]. PloS one, 2019, 14(1): e0207741.

Technical Exchange Group Invitation

Decoding Natural Language from Brain Activity: Tasks and Cutting-Edge Methods

△ Long press to add the assistant

Scan the QR code to add the assistant’s WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

to apply to join the Natural Language Processing/Pytorch and other technical exchange groups

About Us

MLNLP Community is a grassroots academic community jointly built by machine learning and natural language processing scholars from home and abroad. It has now developed into a well-known machine learning and natural language processing community, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing and a wide range of enthusiasts.

The community can provide an open communication platform for related practitioners’ further education, employment, and research. Everyone is welcome to follow and join us.

Decoding Natural Language from Brain Activity: Tasks and Cutting-Edge Methods

1 Introduction

2 Data Collection

2.1 ECoG

2.2 EEG

2.3 MEG

2.4 fMRI

3 Cutting-Edge Work

3.1 End-to-End Decoding

3.2 Non-End-to-End Decoding

3.3 Signal Alignment Research

4 Conclusion

References

About Us

Leave a Comment Cancel reply