Recommendations for Data Risk Management in Banking Using NLP Technology

By / Yuling Yu, People’s Bank of China, Liaoning Branch

Hao Zhang, Gao Dengk, People’s Bank of China, Fushun Branch

Since the release of the AI chatbot ChatGPT in November 2022, it has rapidly gained popularity due to its outstanding language processing capabilities. This model can be widely integrated with various fields, and its development potential and application scenarios have sparked endless imagination in the capital market, receiving much attention. In the tech sector, the emergence of ChatGPT has triggered a new wave in the field of artificial intelligence, with major tech giants both domestically and internationally releasing their own AI products based on natural language models.

In today’s information age, the emergence of AI systems based on natural language models provides a new approach and method for the digital transformation of the banking industry. According to public reports, several banks have publicly announced their partnerships with a well-known domestic tech company’s AI system. Once this technology is applied in practice, it will inevitably bring a more efficient and convenient business system to the banking industry. However, we must also recognize that the establishment of any intelligent model requires massive data support, and the data involved in the banking industry is related to national financial security and the security of personal information. Therefore, the application of new technologies will inevitably require a stricter data governance system to ensure the accuracy, completeness, and security of banking data. Thus, while the AI systems based on natural language models provide new development opportunities for the banking industry, they will also impose higher requirements on the data governance system.

Application Scenarios of Natural Language Models in the Banking Industry

1. Risk Identification and Control

Currently, banking institutions generally possess basic risk control capabilities through internal data model analysis of account anomalies, but the acquisition and processing of external data information still mainly rely on personnel conducting due diligence. The natural language model of artificial intelligence can automatically aggregate a large amount of external and internal data, and capture, analyze, mine, and infer textual information. Through this technology, risk analysis and assessment can be conducted to help banks identify potential risks and provide effective decision-making predictions. For example, by analyzing internet text information, potential credit risks can be identified, or a comprehensive analysis of the reputation of customers and institutions can be conducted to determine risk levels.

2. Customer Service

Traditional customer service robots mainly rely on keywords to determine the questions posed by customers and provide templated answers, which often makes it difficult to understand the actual needs of customers, and correct answers are frequently not provided to customer inquiries. Under the natural language model, customer service robots can engage in human-machine dialogue with customers, allowing customers to describe the scenarios and other situations that lead to issues using everyday, expressive language. The AI system will analyze customer intent, accurately understand customer needs, and respond to customer questions in a targeted manner, making the expression more aligned with interpersonal dialogue characteristics, thus facilitating customer understanding. Additionally, customer service robots can provide personalized financial products and services based on the needs described by customers, helping banks assist customers in achieving asset management and financial planning goals.

3. Rapid Content Generation

With the support of natural language processing technology, artificial intelligence has also demonstrated disruptive capabilities in content generation. For common tasks such as planning proposals, summary reports, and notifications, AI can quickly generate text, and the quality of the content produced can meet normal office needs, greatly improving the efficiency of written work. Furthermore, some companies are internally experimenting with using AI to generate the basic framework of webpages, with test results indicating that the development cycle for individual product features has been successfully shortened from several days to a few hours.

Existing Risks

Although NLP technology presents huge opportunities in the banking sector, there are also inevitable risks and challenges in data governance.

1. Leakage of Sensitive Information

The training of artificial intelligence relies on a large amount of data feeding. Depending on the application scenarios of the model, some sensitive information (e.g., personal identity, financial records, etc.) may be transmitted to the model. In this process, there may be multiple aspects that generate the risk of data leakage.

First, there is the traditional risk of data leakage. Any management deficiency in the transmission, use, or destruction of data can lead to information leakage. Especially with the current application of cutting-edge technologies in banks, it is common to adopt a collaborative development approach with internal and external parties, and the training of AI requires a large amount of human intervention, which may increase the number of contacts and frequency with sensitive data, thus creating leakage risks.

Second, artificial intelligence has a high degree of automation, and the process of content generation is not completely controllable. During the content generation phase, it may output sensitive data non-restrictively based on the sensitive data received during training, leading to the leakage of sensitive data, and there is even the possibility of malicious individuals inducing information extraction through leading questions.

2. Algorithm Bias

Due to the different sources and types of training data, algorithms may exhibit bias in text classification, natural language understanding, and machine translation. Additionally, in the application of artificial intelligence, machine learning algorithms themselves may have biases or misunderstandings, leading to incorrect answers or failures to address intent. For banks, if these errors in machine learning algorithms affect balance and fairness, then the newly generated data will inevitably be distorted, potentially leading to incorrect guidance or conclusions during risk assessments.

3. Data Pollution Issues

The natural language model can automatically generate a large amount of new data, and the high similarity and correlation of the generated data with the source data increases the difficulty of classifying, identifying, and archiving new data. If newly generated data enters the existing data system without strict review, it is highly likely to cause data pollution of the source data, and the entire data governance system will inevitably be affected.

4. Legal Compliance Issues

In addition to legal risks in data security, the content generated by natural language models also carries legal risks.

First, generated content may lead to commercial infringement. Such artificial intelligence generates content mainly by mining source data, conducting statistical analysis, and making certain modifications and compilations. If the data sources retrieved during the generation process are not commercially authorized, then the generated content will carry infringement risks.

Second, when customer service robots use models for service or product recommendations, the compliance of the automatically generated content with relevant legal regulations, and whether the recommended products have corresponding legal effects, are potential legal risk points that may lead to disputes.

Recommendations for Data Governance

Undoubtedly, NLP technology provides new ideas and solutions for the digital transformation of the banking industry, but we must recognize that the hidden dangers it presents may harm industry development and even affect national financial security. This requires strengthening the construction of the data governance system while using new technologies.

1. Clarify the Security Boundaries of Data

Unlike other business systems that can use simulated data for testing in a test environment, ensuring the accuracy and usability of AI training requires the use of real data during the training process, which imposes higher demands on data security management. Clarifying the security boundaries of sensitive data means retrieving data within the minimum range, avoiding unnecessary sensitive data from being accessed without authorization, and reducing the likelihood of sensitive data leakage during the output process of AI systems.

2. Establish a Reliable Training System

Developers must not only focus on the accuracy of the responses provided by NLP technology during training but also pay attention to the balance and fairness of the generated content, specifically correcting the entire model to reduce biases produced during system training and avoid erroneous results. Since the entire training process mainly relies on manual labeling by trainers, the construction of the personnel system is equally important. Training personnel should not only focus on technical skills but also consider their values and assessment standards.

3. Strengthen the Review and Management of Newly Generated Data

Given the inherent advantages of AI systems in data generation, they can produce massive amounts of data in a very short time, significantly increasing the difficulty of data governance. The banking industry has extremely strict requirements for data security, stability, and accuracy. In light of this contradiction, isolating source data and newly generated data may be a relatively prudent solution.

For newly generated data, a data review should be conducted first. Only after confirming the accuracy of the data can the reviewed data be merged with the source data. For data that has not been reviewed, caution should be exercised in its application and output to avoid potential data pollution caused by newly generated data.

4. Apply New Technologies Reasonably Within a Legal Framework

Any application of new technology may bring new legal issues. However, from the perspective of data governance, breaking down various aspects of NLP technology can help find applicable laws regarding intellectual property protection, data security, and consumer rights. The “Regulations on the Management of Algorithm Recommendation for Internet Information Services,” issued in 2022, clearly states that AI algorithms must adhere to the mainstream socialist values and must not use algorithms to harm national security and social public interests, disrupt economic order and social order, or infringe on the legitimate rights and interests of others. Therefore, developers should incorporate legal norms into the design framework from the outset. Compliance concepts should permeate the entire lifecycle of the AI system, combining algorithm regulation and data regulation. Each specific link, from model establishment, training, deployment, and iteration to system retirement, must be implemented within a legal framework and subject to effective supervision. Additionally, legal research on AI ethics and ownership should be conducted to close existing theoretical gaps and clarify legal responsibilities.

Conclusion

In summary, NLP technology has extensive prospects and application scenarios in the banking industry, providing a more efficient and convenient business system. However, there are also potential risks and challenges in data governance.

Therefore, the banking industry needs to establish a strict data governance system when applying this technology, clarify data security boundaries, strengthen the review and management of newly generated data, and reasonably apply new technologies within a legal framework to ensure the accuracy, completeness, and security of data. Only in this way can the role of NLP technology be better utilized, providing more reliable support for the digital transformation of the banking industry.

(This article was published in the “Financial Electronics” magazine, February 2024 issue.)