Exploring the Application of AIGC in System Operations and Maintenance

By / Bank of China Software Center Wang Wanting Fu Hui Yan Xiaofei Chen Ruoyu

With the rapid evolution of distributed architecture and the gradual implementation of cloud-native technologies, the IT systems of large commercial banks exhibit a combination of agile and stable characteristics, operating in parallel in complex architectures that coexist in both cloud and on-premises environments. This complexity demands more efficient and agile requirements for IT operations and maintenance. In recent years, the rapid development of Artificial Intelligence (AI) technology, particularly the emergence of AI-generated content (AIGC) technology, has become a catalyst for efficient IT system operations and maintenance. AIGC has derived a rich capability matrix (such as text generation, intelligent analysis, content recommendation, etc.), demonstrating powerful abilities in handling complex tasks. The Bank of China Software Center actively explores the application of AIGC technology in the operations and maintenance field, focusing on service applications to build a large model framework for operations and maintenance, aiming for intelligent fault resolution and performance optimization of IT systems, thereby improving operational efficiency and providing strong technical support for stable business operations.

AIGC Operations and Maintenance System

The Bank of China Software Center integrates industry large model technology with banking IT system operations and maintenance experience, gradually advancing the intelligent transformation and upgrading of the operations and maintenance system. This encompasses various intelligent operations and maintenance services, including platform technology support, enterprise-level observability systems, unified configuration centers, alarm analysis and intelligent disposal, and information report generation. A framework for AGI (Artificial General Intelligence) operations and maintenance has been proposed and constructed, based on enterprise-level public resources such as general and domain-specific large model libraries, AGI operations and maintenance scenario libraries, problem/event knowledge bases, algorithm libraries, and user control systems. This framework realizes unified information acquisition for operations and maintenance, full-link tracking of system applications, intelligent fault diagnosis and analysis, automated control of change implementation, and real-time decision-making operations and maintenance capabilities, providing users with a more efficient operations and maintenance service experience through a unified service desk. The AGI operations and maintenance architecture is shown in Figure 1.

Exploring the Application of AIGC in System Operations and Maintenance

Figure 1 AGI Operations and Maintenance Framework

Under the AGI operations and maintenance framework, how can we achieve true integrated operations and maintenance using existing enterprise-level IT systems? First, intelligent information generation links various enterprise databases and analysis systems through generative AI models to produce system inspection reports, event analysis reports, resource usage reports, and system configuration item reports in real-time. Second, interactive AI enables AI models to optimize prompt constraints to generate content and connect with RPA systems and other tool systems to achieve command-based conversational operations and maintenance. Third, intelligent alarms and self-healing mechanisms allow analytical AI large models to intelligently analyze observable operations and maintenance data, automatically invoking automation tools to heal system alarms.

Figure 2 illustrates an example of AGI operations and maintenance scenarios, fulfilling basic operations and maintenance needs through the invocation of large model API services at the MaaS layer, enterprise-level tools, and prompt tuning. For instance, Q&A-type AI large models can meet technical consulting and operations ticket handling solution recommendations, while task-oriented AI large models can achieve real-time monitoring information viewing and automatic task dispatching and execution.

Figure 2 Example of AGI Operations and Maintenance Scenario

1. Multi-channel Information Collection, Building the Foundation of Operations and Maintenance Data. Before the army moves, data must lead. The collection and integration of data is the foundation of all operations and maintenance work. The training phase of large models requires support from various types of data. The enterprise observability system integrates existing monitoring systems such as Zabbix, Prometheus, ELK, and Skywalking, while the enterprise unified configuration center combines various automation script tools, possessing multi-channel and high-efficiency data collection capabilities. It includes the management scope of basic resources, platform layers, application layers, and various operational logs, monitoring metrics, link information, and operations knowledge bases, stably supporting the collection of operations and maintenance data from tens of thousands of partitions.

The domain large model is problem-oriented, periodically analyzing and evaluating the collected data, optimizing preprocessing algorithms, reducing interference from redundant data, and highlighting meaningful observable operations and maintenance data. The domain AI model generates optimization suggestions for existing rule-based or machine learning-based alarm algorithms, timely adjusting the methods of alarm generation, aggregation, and convergence, effectively improving alarm quality. The alarm resolution strategies generated simultaneously by the domain model are attached to the early warning notifications, achieving a “housekeeper-style” service for data aggregation and global presentation.

2. Full-link Observability, Precisely Excluding System Faults. Full coverage, dynamic perception. The AGI operations and maintenance framework is based on the enterprise observability system and the enterprise unified configuration center, driven by domain large models and algorithm libraries, conducting link-level aggregation analysis and fault diagnosis from multiple dimensions such as system architecture, network topology, and applications. By embedding key path points and coloring fault links, the large amounts of basic data collected are linked together, and multiple rounds of model training and parameter adjustments are conducted. Combining multi-source historical data in the link, a baseline image of the application’s operating status is established. The real-time link data in the business cluster dynamically matches with the baseline image of the application, and fault information is reported, achieving transparency in the business link context, making “link as a service” and “fault as discovery” easily accessible, providing intelligent “double vision” for precise troubleshooting of systems and applications.

Fallen petals are not without feelings; they transform into spring mud to protect flowers further. After the fault analysis model completes the fault analysis diagnosis of key fault links, abnormal indicators, log data, and historical event/problem knowledge bases, the report generation model autonomously reviews the fault link logs, indicators, and other information, summarizes the issues, generates a fault summary report, and moves it to the pending disposal items database. After verification by the expert system, it is stored in the knowledge base and used as basic data for model optimization and iterative improvement, continuously enhancing the model’s diagnostic capabilities.

3. Change Implementation Control, Achieving Efficient and Agile Process Management. Focusing on the principles of “safety, agility, and efficiency,” changes, emergencies, and service requests are simplified and flexibly transformed, establishing lightweight and convenient processes suitable for cloud-native environments. AI general large models and vertical domain large models, when combined with robotic process automation (RPA) and automation tools, can achieve automation of change configuration and deployment processes. The model collects and analyzes system configuration information, environmental requirements, and application characteristics to generate configuration files suitable for specific environments and applications, and verifies these configuration files. Based on system configuration information and deployment strategies, automated deployment plans are generated, including steps, sequences, and dependencies. Automation tools execute the distribution of configuration files and the deployment of applications according to the generated deployment plan. After deployment or change completion, verification steps are executed; if issues are found, the system automatically rolls back to the previous available state based on predefined rollback strategies, realizing the “natural way” effect of change processes and implementation under intelligent operations and maintenance.

The intelligent change process is illustrated in Figure 3. By inputting user change information descriptions, the general large model decomposes the change tasks and distributes them to the domain large model for concurrent execution of sub-tasks. By constraining the output content of the large model, it connects different enterprise tools such as databases, search tools, and RPA systems for real-time information retrieval and task execution. Upon task completion, relevant information is fed back to the change responsible person or recorded for subsequent review.

Figure 3 Intelligent Change Process

4. Intelligent Operations, Providing Decision Support for Operations and Maintenance. Strategically planning, achieving victory from afar. The intelligent service layer of the AGI operations and maintenance framework simplifies system operations and maintenance changes through a “one-to-one” conversational approach, providing more reasonable decision support by integrating global information. In terms of resource allocation, the model analyzes system operating data, predicts future resource demands, responds in real-time to user needs, and assists the operations and maintenance team in formulating resource allocation and expansion plans, truly achieving autonomous awareness and elastic scaling of cloud resource usage. To effectively address network threat attacks, AIGC models assist in designing and generating network security protection systems and vulnerability mitigation strategies in multi-cloud environments. Based on network inspection, fault diagnosis reports, combined with comprehensive traffic collection, honeypot control, access control, and reasonable model prompts, barriers and gaps between different systems and applications are penetrated, generating security defense reports that are unified into the security operations center, establishing a multi-dimensional, deep defense network security system, and realizing intelligent analysis and decision-making of security defense strategies.

Those who do not plan for the overall situation cannot strategize for a single domain. The Bank of China has multiple production and testing data centers across various locations. Based on a multi-model integrated AGI operations and maintenance framework built from general and industry-specific models, operations and maintenance engineers can obtain accurate operations and maintenance guidance without leaving their homes, efficiently completing collaborative operations and maintenance tasks across multiple locations and centers.

Problems and Challenges

AIGC large models have already shown their potential in the operations and maintenance field, but face some challenges that cannot be ignored in practical applications.

1. Content Uncertainty. Content generated based on large models may be limited by the quality or quantity of training data and the reliability of the model, leading to biases or errors in the generated content. It is necessary to utilize the large model application framework to debug prompts to optimize its generation capabilities, such as constraints on output in JSON format.

2. Privacy and Security Issues. The banking business systems involve a large amount of sensitive information and critical data, requiring corresponding security measures and privacy protection measures, such as RLHF model alignment and security content detection before and after model generation, to prevent unauthorized access and data breaches.

3. User Trust Issues. The decision-making process of large models is often opaque, which may raise concerns about the interpretation and trust of their decisions. It requires time and effectiveness to gain user acceptance and trust.

Conclusion and Outlook

In the future, the new era brings new opportunities and challenges. As the application scenarios of AIGC technology continue to be explored, numerous large model frameworks have emerged in the industry, covering fields such as code development and system security. The Bank of China Software Center will continue to deepen research into AIGC technology, making it the preferred means for testing and operations personnel to conduct root cause analysis, fault troubleshooting and prediction, and system iteration and optimization. At the same time, we will actively monitor industry developments in AIGC technology and open-source community dynamics, exploring the application of this technology in various scenarios such as anti-money laundering in the banking industry, intelligent investment advisory, and fraud analysis in the gray and black industries, providing solid environmental support for the secure, stable, and efficient operation of businesses.

(This article was published in the “Financial Electrification” January 2024 issue)