Exploring AIGC Applications in System Operations

By | Wang Wanting, Fu Hui, Yan Xiaofei, Chen Ruoyu, China Bank Software Center

With the rapid evolution of distributed architecture and the gradual implementation of cloud-native technologies, the IT systems of large commercial banks exhibit a combination of agile and stable characteristics. Under the complex architecture of parallel operations both on and off the cloud, as well as the coexistence of centralized and distributed systems, there is a growing demand for more efficient and agile IT operations. In recent years, the rapid development of artificial intelligence (AI) technologies, particularly the emergence of AI-generated content (AIGC) technologies, has become a catalyst for efficient IT system operations. AIGC has derived a rich capability matrix (such as text generation, intelligent analysis, content recommendation, etc.), demonstrating powerful capabilities in handling complex tasks. The China Bank Software Center actively explores the application of AIGC technologies in the operations field, focusing on service applications and constructing a large model framework for operations, aiming for intelligent fault resolution and performance optimization of IT systems, thereby improving operational efficiency and providing strong technical support for stable business operations.

AIGC Operations System

The China Bank Software Center integrates industry large model technologies with banking IT system operations experience to gradually promote the intelligent transformation and upgrading of the operations system. This includes various intelligent operation services such as platform technology support, enterprise-level observability systems, unified configuration centers, alarm analysis and intelligent handling, and information report generation. A framework for AGI (Artificial General Intelligence) operations has been proposed and constructed, based on enterprise-level public resources such as general and domain-specific large model libraries, AGI operations scenario libraries, problem/event knowledge bases, algorithm libraries, and user control systems. This framework realizes unified acquisition of operational information, full-link tracking of system applications, intelligent diagnosis and analysis of faults, automated control of change implementations, and real-time decision-making operations, providing users with a more efficient operational service experience through a unified service desk. The AGI operations architecture is shown in Figure 1.

Exploring AIGC Applications in System Operations

Figure 1 AGI Operations Framework

Under the AGI operations framework, how can we achieve true integrated operations using existing enterprise-level IT systems? First, intelligent information generation links existing enterprise databases, analysis systems, etc., through generative AI models to produce real-time system inspection reports, event analysis reports, resource usage reports, and system configuration item reports; second, interactive AI allows AI models to generate content through optimized prompts and link to RPA system tools, enabling command-driven conversational operations; third, intelligent alarms and self-healing utilize analytical AI large models to intelligently analyze observable operational data and automatically implement system alarm healing using automation tools.

Figure 2 is an example of AGI operations scenarios. By calling the large model API services at the MaaS layer, enterprise-level tools, and tuning prompts, basic operational scenario requirements can be met. For instance, Q&A AI large models can fulfill needs such as technical consulting and operational ticket handling recommendations, while task-oriented AI large models can provide real-time monitoring information and automated task dispatching.

Figure 2 AGI Operations Scenario Example

1. Multi-channel Information Collection to Build the Foundation of Operations Data. Before the troops move, data must precede. Data collection and integration is the foundation of all operational work. The training phase of large models requires support from various types of data. The enterprise observability system integrates existing monitoring systems such as Zabbix, Prometheus, ELK, and Skywalking. The enterprise unified configuration center incorporates various automation script tools, enabling multi-channel and high-efficiency data collection capabilities, encompassing basic configuration, system and application operation logs, monitoring metrics, link information, operational knowledge bases, and more, effectively supporting the collection of operational data across tens of thousands of partitions.

The domain-specific large model is problem-oriented, optimizing preprocessing algorithms through regular analysis and assessment of collected data to reduce the interference of redundant data and highlight meaningful observable operational data. The domain AI model generates optimization suggestions for existing rule-based or machine learning-based alarm algorithms, promptly adjusting alarm generation, aggregation, and convergence methods to effectively improve alarm quality. The alarm resolution strategies generated concurrently by the domain model are appended to pre-warning notifications, achieving a “concierge-style” service of data aggregation and global presentation.

2. Full-link Observability to Accurately Exclude System Failures. Comprehensive coverage, dynamic perception. The AGI operations framework is based on the enterprise observability system and the enterprise unified configuration center, driven by domain large models and algorithm libraries, conducting link-level aggregation analysis and fault diagnosis across multiple dimensions such as system architecture, network topology, and applications. By employing key path embedding and fault chain coloring, the vast amount of basic data collected is interconnected, and multiple rounds of model training and parameter adjustments are performed, establishing a baseline profile of the application’s operational state using multi-source historical data in the link. Real-time link data from business clusters is dynamically matched with the application’s baseline profile, and fault information is reported, achieving transparency of the business link context and making “link as a service” and “fault as discovery” readily available, providing intelligent “dual eyes” for precise fault isolation of systems and applications.

Not all fallen petals are without feelings; they transform into spring mud to protect flowers. After the fault analysis model completes fault diagnosis on key fault chains, abnormal indicators, log data, and historical event/problem knowledge bases, the report generation model autonomously reviews fault chain logs, indicators, and other information, summarizes issues, generates fault summary reports, and moves them to the pending disposal items library. After verification by the expert system, the information is stored in the knowledge base and serves as foundational data for model optimization, continually enhancing the model’s diagnostic capabilities.

3. Change Implementation Control for Efficient and Agile Process Management. Focusing on the principles of “safety, agility, and efficiency,” changes, emergencies, and service requests undergo simplification and flexible transformation to establish lightweight and convenient processes suitable for cloud-native environments. AI general large models, vertical domain large models, and robotic process automation (RPA) tools can automate change configuration and deployment processes. The model collects and analyzes system configuration information, environmental requirements, and application characteristics to generate configuration files suitable for specific environments and applications, validating these configuration files. Based on system configuration information and deployment strategies, automated deployment plans are generated, detailing steps, sequences, and dependencies. Automation tools execute the distribution of configuration files and the deployment of applications according to the generated deployment plan. After deployment or changes are completed, validation steps are executed, and if issues are found, the system automatically restores to the previous available state based on pre-defined rollback strategies, achieving a “natural way” effect in change processes and implementations under intelligent operations.

The intelligent change process is illustrated in Figure 3. By inputting user change information descriptions, the general large model decomposes change tasks and distributes them to domain large models for concurrent execution of sub-tasks. By constraining the output content of the large model, it links different enterprise tools such as databases, search tools, and RPA systems for real-time information retrieval and task execution. Upon task completion, relevant information is fed back to the change responsible person or recorded for future review.

Figure 3 Intelligent Change Process

4. Intelligent Operations to Provide Decision Support for Operations. Strategic planning leads to victory from afar. The intelligent service layer of the AGI operations framework simplifies system operations and changes through “one-to-one” conversational methods, providing more reasonable decision support by integrating global information. In terms of system resource allocation, the model predicts future resource needs by analyzing system operation data, responding in real-time to user demands, and assisting the operations team in formulating resource allocation and expansion plans, truly achieving autonomous perception and elastic scaling of cloud resource usage. To effectively respond to network threats and attacks, the AGI model assists in designing and generating a network security protection system and vulnerability mitigation strategies tailored for multi-cloud environments. Based on network inspection and fault diagnosis reports, combined with comprehensive traffic collection, honeypot control, access control, and other multiple defense measures along with reasonable model prompts, barriers and gaps between different systems and applications are breached, generating security defense reports, all unified under the security operations center, establishing a multi-dimensional, in-depth defense network security system, achieving intelligent analysis and decision-making for security defense strategies.

Those who do not plan for the overall situation are not competent to plan for a single domain. The China Bank possesses multiple production and testing data centers in various locations, and based on the multi-model fusion AGI operations framework constructed from general and industry-specific models, operational engineers can obtain accurate operational guidance without leaving their homes, efficiently completing collaborative operational tasks across multiple locations and centers.

Challenges and Issues

AIGC large models have already emerged in the operations field, but there are still some significant challenges in practical applications.

1. Content Uncertainty. Content generated based on large models may be limited by the quality or quantity of training data and the reliability of the models, leading to biases or errors in the generated content. It is necessary to utilize large model application frameworks to debug prompts to optimize their generation capabilities, such as constraints on output in JSON format.

2. Privacy and Security Issues. The banking business systems involve a large amount of sensitive information and critical data, necessitating corresponding security and privacy protection measures, such as RLHF model alignment, generation security content detection before and after the model, to prevent unauthorized access and data breaches.

3. User Trust Challenges. The decision-making process of large models is often opaque, which may raise issues regarding the interpretation and trust of their decisions. It requires time and effectiveness to earn user acceptance and trust.

Conclusion and Outlook

In the future, the new era brings new opportunities and challenges. As the application scenarios of AIGC technologies continue to be explored, many large model frameworks have emerged in the industry, gradually covering areas such as code development and system security. The China Bank Software Center will continue to deepen its research on AIGC technologies, aiming to make it the preferred means for testing and operations personnel for root cause analysis, fault resolution and prediction, system iteration, and optimization. At the same time, we will actively monitor the development dynamics of AIGC technologies and the open-source community in the industry, exploring the application of this technology in various scenarios such as anti-money laundering, intelligent investment advisory, and fraud analysis in the gray industry, providing a solid environmental guarantee for the safe, stable, and efficient operation of the business.

(Source: Financial Electronics)

Highly Recommended by “China Information Security” Magazine

“Enterprise Growth Plan”

Click the image below for details

Leave a Comment Cancel reply