By | Wang Wanting, Fu Hui, Yan Xiaofei, Chen Ruoyu, China Bank Software Center
With the rapid evolution of distributed architecture and the gradual implementation of cloud-native technologies, the IT systems of large commercial banks exhibit a combination of agile and stable characteristics. Under the complex architecture of parallel operations both on and off the cloud, as well as the coexistence of centralized and distributed systems, there is a growing demand for more efficient and agile IT operations. In recent years, the rapid development of artificial intelligence (AI) technologies, particularly the emergence of AI-generated content (AIGC) technologies, has become a catalyst for efficient IT system operations. AIGC has derived a rich capability matrix (such as text generation, intelligent analysis, content recommendation, etc.), demonstrating powerful capabilities in handling complex tasks. The China Bank Software Center actively explores the application of AIGC technologies in the operations field, focusing on service applications and constructing a large model framework for operations, aiming for intelligent fault resolution and performance optimization of IT systems, thereby improving operational efficiency and providing strong technical support for stable business operations.
AIGC Operations System

Under the AGI operations framework, how can we achieve true integrated operations using existing enterprise-level IT systems? First, intelligent information generation links existing enterprise databases, analysis systems, etc., through generative AI models to produce real-time system inspection reports, event analysis reports, resource usage reports, and system configuration item reports; second, interactive AI allows AI models to generate content through optimized prompts and link to RPA system tools, enabling command-driven conversational operations; third, intelligent alarms and self-healing utilize analytical AI large models to intelligently analyze observable operational data and automatically implement system alarm healing using automation tools.
Figure 2 is an example of AGI operations scenarios. By calling the large model API services at the MaaS layer, enterprise-level tools, and tuning prompts, basic operational scenario requirements can be met. For instance, Q&A AI large models can fulfill needs such as technical consulting and operational ticket handling recommendations, while task-oriented AI large models can provide real-time monitoring information and automated task dispatching.

1. Multi-channel Information Collection to Build the Foundation of Operations Data. Before the troops move, data must precede. Data collection and integration is the foundation of all operational work. The training phase of large models requires support from various types of data. The enterprise observability system integrates existing monitoring systems such as Zabbix, Prometheus, ELK, and Skywalking. The enterprise unified configuration center incorporates various automation script tools, enabling multi-channel and high-efficiency data collection capabilities, encompassing basic configuration, system and application operation logs, monitoring metrics, link information, operational knowledge bases, and more, effectively supporting the collection of operational data across tens of thousands of partitions.
The domain-specific large model is problem-oriented, optimizing preprocessing algorithms through regular analysis and assessment of collected data to reduce the interference of redundant data and highlight meaningful observable operational data. The domain AI model generates optimization suggestions for existing rule-based or machine learning-based alarm algorithms, promptly adjusting alarm generation, aggregation, and convergence methods to effectively improve alarm quality. The alarm resolution strategies generated concurrently by the domain model are appended to pre-warning notifications, achieving a “concierge-style” service of data aggregation and global presentation.
2. Full-link Observability to Accurately Exclude System Failures. Comprehensive coverage, dynamic perception. The AGI operations framework is based on the enterprise observability system and the enterprise unified configuration center, driven by domain large models and algorithm libraries, conducting link-level aggregation analysis and fault diagnosis across multiple dimensions such as system architecture, network topology, and applications. By employing key path embedding and fault chain coloring, the vast amount of basic data collected is interconnected, and multiple rounds of model training and parameter adjustments are performed, establishing a baseline profile of the application’s operational state using multi-source historical data in the link. Real-time link data from business clusters is dynamically matched with the application’s baseline profile, and fault information is reported, achieving transparency of the business link context and making “link as a service” and “fault as discovery” readily available, providing intelligent “dual eyes” for precise fault isolation of systems and applications.
Not all fallen petals are without feelings; they transform into spring mud to protect flowers. After the fault analysis model completes fault diagnosis on key fault chains, abnormal indicators, log data, and historical event/problem knowledge bases, the report generation model autonomously reviews fault chain logs, indicators, and other information, summarizes issues, generates fault summary reports, and moves them to the pending disposal items library. After verification by the expert system, the information is stored in the knowledge base and serves as foundational data for model optimization, continually enhancing the model’s diagnostic capabilities.
3. Change Implementation Control for Efficient and Agile Process Management. Focusing on the principles of “safety, agility, and efficiency,” changes, emergencies, and service requests undergo simplification and flexible transformation to establish lightweight and convenient processes suitable for cloud-native environments. AI general large models, vertical domain large models, and robotic process automation (RPA) tools can automate change configuration and deployment processes. The model collects and analyzes system configuration information, environmental requirements, and application characteristics to generate configuration files suitable for specific environments and applications, validating these configuration files. Based on system configuration information and deployment strategies, automated deployment plans are generated, detailing steps, sequences, and dependencies. Automation tools execute the distribution of configuration files and the deployment of applications according to the generated deployment plan. After deployment or changes are completed, validation steps are executed, and if issues are found, the system automatically restores to the previous available state based on pre-defined rollback strategies, achieving a “natural way” effect in change processes and implementations under intelligent operations.
The intelligent change process is illustrated in Figure 3. By inputting user change information descriptions, the general large model decomposes change tasks and distributes them to domain large models for concurrent execution of sub-tasks. By constraining the output content of the large model, it links different enterprise tools such as databases, search tools, and RPA systems for real-time information retrieval and task execution. Upon task completion, relevant information is fed back to the change responsible person or recorded for future review.

Challenges and Issues
Conclusion and Outlook
Highly Recommended by “China Information Security” Magazine
“Enterprise Growth Plan”
Click the image below for details
