K-BERT Model: Knowledge Empowerment with Knowledge Graphs

Author丨Zhou Peng

Affiliation丨Tencent

Research Direction丨Natural Language Processing, Knowledge Graph

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

Background

In the past two years, unsupervised pre-trained language representation models such as Google’s BERT have achieved remarkable results in various NLP tasks. These models are pre-trained on large-scale open-domain corpora to obtain general language representations and then fine-tuned on specific downstream tasks to absorb domain-specific knowledge. However, due to the domain differences between pre-training and fine-tuning, these models perform poorly on knowledge-driven tasks. For example, when handling electronic medical record (EMR) analysis tasks in the medical field, Google BERT pre-trained on Wikipedia fails to fully realize its value.

When reading domain-specific texts, ordinary people can only understand words based on their context, while experts can infer using relevant domain knowledge. Currently available pre-trained models like BERT, GPT, and XLNet are trained on open-domain corpora, much like an ordinary person; they can read general texts but lack sufficient background knowledge for specialized domain texts.

One way to solve this problem is to use domain-specific pre-trained models, but the pre-training process is very time-consuming and resource-intensive, making it difficult for ordinary researchers to achieve. For instance, if we want the model to acquire the knowledge that “acetaminophen can treat colds,” the training corpus must contain many sentences where “acetaminophen” and “cold” appear together. Moreover, introducing expert knowledge through domain-specific pre-training has poor interpretability and controllability.

Besides the above strategies, what else can we do to make the model a domain expert? Knowledge Graphs (KG) are a great solution.

As knowledge is refined into structured forms, many KGs have been built in various fields, such as SNOMED-CT in medicine and HowNet in Chinese concepts. If KGs can be integrated into pre-trained language models, they will equip the models with domain knowledge, thus improving performance on specific domain tasks while reducing the cost of large-scale pre-training. Additionally, Knowledge Graphs are highly interpretable since the injected knowledge can be manually edited.

What research combines knowledge graphs with language models? The most representative is Tsinghua’s ERNIE, which uses an independent TransE algorithm to obtain entity vectors and then embeds these entity vectors into BERT. Tsinghua ERNIE’s work is quite enlightening, but there are still some areas for improvement, such as:

1. The relational information in the knowledge graph has not been utilized;

2. The entity vectors and word vectors are obtained using different methods, which may lead to spatial inconsistencies;

3. For large-scale knowledge graphs with a vast number of entities, the entity vector table can occupy a significant amount of memory.

Moreover, introducing excessive knowledge into the language representation model may alter the original meaning of sentences, referred to as the knowledge noise problem in this paper. To address the above issues, the researchers attempt to unify entity vectors and word vectors, injecting knowledge into the language representation model using a unified vector space.

Method

Based on the above ideas, the researchers made certain improvements to Google BERT and proposed the K-BERT model. Below is the specific concept of K-BERT, and Figure 1 shows the overall architecture of K-BERT.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

▲ Figure 1. Overall Architecture of K-BERT

When a sentence “Tim Cook is currently visiting Beijing now” is input, it first goes through a knowledge layer, which injects the triplet information associated with the knowledge graph (such as Apple-CEO-Tim Cook, Beijing-capital-China, etc.) into the sentence, forming a sentence tree rich in background knowledge.

As can be seen, through the knowledge layer, a sentence sequence is transformed into a tree structure or graph structure, which contains background knowledge that was originally absent in the sentence; we know that “the CEO of Apple is currently in China”.

After obtaining the sentence tree, a problem arises. Traditional BERT-like models can only process sequential structures of sentence input, while the graph structure of the sentence tree cannot be directly input into the BERT model. If we force the sentence tree to be flattened into a sequence for model input, it will inevitably lead to the loss of structural information. Here, K-BERT proposes a clever solution called soft position and visible matrix. Let’s take a closer look at the specific implementation method.

As we know, before inputting a sentence sequence into BERT, a position encoding is added to each token in the sequence, indicating the position of the token in the sentence, for example, “Tim(0) Cook(1) is(2) currently(3) visiting(4) Beijing(5) now(6)”. Without position encoding, the BERT model has no sequential information, equivalent to a bag-of-words model.

In K-BERT, the sentence tree is first flattened. For example, after flattening the sentence tree in Figure 2, it becomes “[CLS] Tim Cook CEO Apple is currently visiting Beijing capital China is_a City now”.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

▲ Figure 2. Soft Position and Hard Position

Clearly, the flattened sentence is messy and hard to read; K-BERT uses soft position encoding to restore the sequential information of the sentence tree, i.e., “[CLS](0) Tim(1) Cook(2) CEO(3) Apple(4) is(3) visiting(4) Beijing(5) capital(6) China(7) is_a(6) City(7) now(6)”, where we can see that the position encoding for “CEO(3)” and “is(3)” is both 3 because they both follow “Cook(2)”.

However, just using soft positions is not enough, as it might lead the model to mistakenly believe that Apple (4) follows is (3), which is incorrect. The biggest highlight of K-BERT lies in the Mask-Transformer, which uses the visible matrix to introduce the structural information from the graph or tree structure into the model.

Recalling BERT’s Self-attention, the embedding of a word comes from its context. The core idea of Mask-Transformer is to let the embedding of a word come only from the context of its same branch, while words from different branches do not influence each other. This is achieved through the visible matrix; the visible matrix corresponding to the sentence tree in Figure 2 is shown in Figure 3, where there are a total of 13 tokens, resulting in a 13*13 matrix, with red indicating that the two corresponding tokens are visible to each other, and white indicating that they are not.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

▲ Figure 3. Visible Matrix

Now that we have the visible matrix, how should it be used? It’s quite simple: it’s the Mask-Transformer. For a visible matrix M, the values of the mutually visible red points are set to 0, while the mutually invisible white points are set to negative infinity, and then M is added to the softmax function calculating self-attention, as shown in the following formula.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

The above formula is a simple modification of the self-attention in BERT, adding M while the rest remains unchanged. If two words are mutually invisible, their influence coefficient S[i,j] will be 0, meaning there is no influence between their hidden states h. This way, the structural information of the sentence tree is input into BERT.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

▲ Figure 4. [Apple] can only indirectly affect [CLS] through [Cook], reducing the impact of knowledge noise.

In summary, the process of Mask-Transformer receiving the sentence tree as input is shown in Figure 5.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

▲ Figure 5. Input Process of Sentence Tree

In fact, it corresponds to the structural diagram in the original paper, as shown in Figure 6, where a sentence tree is represented using a token sequence for content and a visible matrix for structural information.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

▲ Figure 6. Conversion of Sentence Tree into Embedding Representation and Visible Matrix

As seen in Figure 6, besides the soft positions and visible matrix, the other structures remain consistent with Google BERT, which gives K-BERT a great feature—Compatibility with BERT-like model parameters. K-BERT can directly load publicly available pre-trained BERT models such as Google BERT, Baidu ERNIE, and Facebook RoBERTa without the need for additional pre-training, saving users a significant amount of computational resources.

Experimental Results

Next, let’s look at the experimental results of K-BERT. First, this paper utilized three knowledge graphs: CN-DBpedia, HowNet, and a self-built medical knowledge graph (MedicalKG). The evaluation tasks were divided into two categories: open-domain tasks and specialized domain tasks. There are a total of 8 open-domain tasks, namely Book review, Chnsenticorp, Shopping, Weibo, XNLI, LCQMC, NLPCC-DBQA, and MSRA-NER. The experimental results are shown in the following tables.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

It can be seen that K-BERT has a slight improvement over Google BERT on open-domain tasks, but the enhancement is not significant. The possible reason is that open-domain tasks do not require background knowledge.

To test the effect on tasks requiring “background knowledge,” researchers used four specific domain tasks: financial Q&A, legal Q&A, financial entity recognition, and medical entity recognition. The experimental effects are shown in the figure below.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

It can be seen that the performance on specific domain tasks is quite good, as these tasks require a high level of background knowledge. Overall, knowledge graphs are suitable for enhancing tasks that require background knowledge, while the effect on open-domain tasks that do not require such knowledge is often not significant.

This work has been included in AAAI-2020. The researchers also pointed out that there are still many issues with K-BERT that need to be resolved, such as how to enhance the model’s robustness when the quality of the knowledge graph is poor, and how to eliminate erroneous associations caused by polysemy during entity linking. The researchers hope to bring structured knowledge graphs into the NLP community, but much effort is still needed. K-BERT is not yet perfect and will continue to be updated in the future; everyone is welcome to follow its progress.

Postscript

The code for K-BERT has been open-sourced, and the original paper and project addresses are as follows:

Paper Address:https://arxiv.org/abs/1909.07606v1

Project Address:https://github.com/autoliuweijie/K-BERT

If you are interested in natural language processing and knowledge graphs and wish to engage in research in this area, please feel free to contact us.

Contact Email：[email protected]

Contact Email:[email protected]

References

[1] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[2] Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; and Liu, Q. 2019. ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129.

[3] Xu, B.; Xu, Y.; Liang, J.; Xie, C.; Liang, B.; Cui, W.; and Xiao, Y. 2017. CN-DBpedia: A never-ending Chinese knowledge extraction system. International conference industrial, engineering and other applications applied intelligent systems 428–438.

[4] Dong, Z.; Dong, Q.; and Hao, C. 2006. Hownet and the computation of meaning.

Click the title below for more content:

AAAI 2020 | Semantically Aware BERT (SemBERT)
From Word2Vec to BERT
Recent Knowledge Graph Papers Worth Reading, Summarized Here
Post-BERT Era NLP Pre-trained Models
Does BERT’s Success Depend on Spurious Statistical Cues?
Looking at Versatile Self-Attention from Three Major Conference Papers

#Submission Channel#

Let Your Paper Be Seen by More People

How can more quality content reach readers through shorter paths, reducing the cost of readers seeking quality content? The answer is: people you don’t know.

There are always some people you don’t know who know what you want to know. PaperWeekly may serve as a bridge to facilitate the collision of scholars and academic inspirations from different backgrounds and directions, sparking more possibilities.

PaperWeekly encourages university labs or individuals to share various quality content on our platform, which can be interpretations of the latest papers, learning experiences, or technical insights. Our only goal is to let knowledge flow.

📝 Submission Standards:

• The manuscript must be an original work by the individual, and the author’s personal information (name + school/work unit + degree/position + research direction) must be indicated.

• If the article is not a first release, please remind us during submission and attach all published links.

• PaperWeekly assumes every article is a first release and will add an “original” label.

📬 Submission Email:

• Submission Email: [email protected]

• All article illustrations should be sent separately as attachments.

• Please leave immediate contact information (WeChat or phone) so we can communicate with the author during editing and publishing.

🔍

Now, you can also find us on 「Zhihu」

Search for 「PaperWeekly」 on the Zhihu homepage

Click 「Follow」 to subscribe to our column

About PaperWeekly

PaperWeekly is an academic platform that recommends, interprets, discusses, and reports on cutting-edge research papers in artificial intelligence. If you are researching or working in the AI field, feel free to click 「Discussion Group」 in the WeChat public account backend, and our assistant will bring you into the PaperWeekly discussion group.

K-BERT Model: Knowledge Empowerment with Knowledge Graphs

▽ Click | Read the Original Text | Download Paper & Source Code

Background

Method

Experimental Results

Postscript

References

Leave a Comment Cancel reply