Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

A system—whether an organism, enterprise, company, or computer program—spends effort feeding back past events to the present because it is a more economical approach for the system to respond to the future. To foresee the future, one must understand the past..

— Kevin Kelly, “Out of Control”

This statement also applies to the field of database operations and maintenance. Traditional monitoring systems have matured in collecting, storing, and displaying vast amounts of server/operating system/database metrics, allowing DBAs to grasp the “current” operational status in real-time. However, these systems have historically been insufficient in analyzing “the past,” particularly regarding the analysis of “valuable” historical data and the correlation between metrics. This has led to their use primarily for “post-event firefighting,” and the tools available for identifying root causes have been limited, relying heavily on the DBA’s own skills and experience.

With the development of technology and the increasing complexity of database environments across various industries, CloudFun Technology has been exploring smarter and end-to-end solutions in the operations and maintenance field. Below, we will introduce some capabilities of the new version of the AgileX database monitoring and diagnostic system through several typical scenarios in operations and maintenance.

01

Root Cause Analysis of Metric Fluctuations

Heinrich’s Law(Heinrich’s law) states that behind every serious accident, there are invariably 29 minor accidents and 300 near misses, as well as 1000 accident hazards. This also applies to database operations and maintenance scenarios; the occurrence of any major accident is not coincidental, but rather an inevitable result of the accumulation and evolution of various small problems.

Timely investigation and handling of accident signs and symptoms is a key method to prevent major accidents. Traditional monitoring systems generally set alarm thresholds for a limited number of metrics, notifying DBAs with alarms when thresholds are exceeded, which is a relatively simplistic and delayed approach, lacking the ability to identify risks in advance. To address this pain point, AgileX analyzes historical data of metrics, identifies data characteristics, and can recognize potential risks earlier. By correlating metrics and drilling down, it provides DBAs with auxiliary intelligent analysis capabilities based on traditional alarm functions.

QPS is one of the important metrics for measuring database performance. The following chart shows the data trend of a customer’s database instance QPS over a week.

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

AgileX automatically analyzes data and records data characteristics, creating a load profile for the database instance. Based on this “load profile,” AgileX detected a stability change in QPS on December 12 (an abnormal increase or decrease outside the reasonable range, hereafter referred to as an anomaly). Due to the large alarm threshold settings, no alarm was triggered.
Application of Statistical Science and Machine Learning in AgileX Database Monitoring System
Anomaly indicates a potential problem may arise. AgileX continues to analyze, capturing the corresponding time-window time series dataset, and based on the data distribution characteristics, selects related metrics from other metrics (including but not limited to positive correlation, negative correlation, time-lag correlation, etc.):
  • Through a sliding time window, 30 minutes prior, the proportion of the wait event “enq: TM – contention” showed an anomaly;

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

  • Similarly, the execution count of DML statements also showed an anomaly;
Application of Statistical Science and Machine Learning in AgileX Database Monitoring System
  • Correlating with the TopSQL list, the proportion of DELETE/UPDATE/INSERT statements was relatively high during this time window;

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

At this point, based on the above logical chain, AgileX infers that lock contention caused the QPS anomaly.
  • Entering the blocking session page, one can view the blocking and blocked relationships between sessions through a tree diagram, quickly identifying the source of the blockage;
Application of Statistical Science and Machine Learning in AgileX Database Monitoring System
  • Further comparing SQL texts, it was found that these DML statements all operated on the same business table;

Thus, the root cause of the failure has been basically restored: Concurrent DML operations on the same table caused lock contention, leading to a backlog of numerous DML and query statements, reducing database performance. Once lock resources are released, the backlog of statements can be executed, resulting in a QPS anomaly. Similar issues can easily be overlooked in traditional operations and maintenance models that rely on alerts. AgileX’s data analysis capabilities can assist DBAs in identifying metric anomalies in advance and attempt to analyze related data to pinpoint root causes, preventing problems from worsening, effectively supplementing traditional alert functions.

02

Capacity Management

In addition to performance issues, space capacity management is also one of the daily tasks of DBAs. However, as the complexity of database systems and the scale of data continue to increase, tablespace management faces more and more challenges. Expansion plans are often slow to respond; index adjustments or partition changes always worry about affecting business… how to make effective decisions in a data-driven manner becomes a key issue.

The following chart shows the capacity changes of a certain customer’s database over the past half month:

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

Traditional database monitoring management systems typically list tablespace-related information (such as total capacity, used capacity, remaining capacity, usage percentage, etc.).

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

However, customers still “complain” that capacity management always consumes considerable manpower for data comparison and analysis. Therefore, based on this pain point, AgileX provides more “intelligent” capacity management capabilities. It continues to learn historical data characteristics, predicting the remaining available days for each tablespace, providing data support for early expansion; offering “correlation” information, identifying the tablespace that contributes the most to capacity changes, making space cleanup more targeted; at the same time, identifying tablespace capacity anomalies, as abnormal increases or decreases may indicate business changes or potential risks. As shown in the following figure:

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

In this example, AgileX detected that the second largest tablespace B (real name omitted for security reasons) contributed the most to the total capacity change, as shown below:

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

The capacity change of tablespace B is correlated with the total capacity change at 62%

Therefore, in addition to listing basic tablespace capacity information, AgileX can also provide the following suggestions:

  • Tablespace A has the largest used capacity; it is recommended to prioritize analysis and cleanup of this tablespace;
  • Tablespace B has the highest correlation with total capacity changes; it is recommended to “focus” on the data growth of this tablespace;

  • Tablespace C and D are expected to have available days of less than 7 days; it is recommended to expand or clean up data as soon as possible;

Through AgileX’s more “intelligent” tablespace capacity management capabilities, DBAs can save a significant amount of time in space management scenarios.

03

Alarm Threshold Recommendations

Alarms are the basic capabilities of monitoring products. In addition to providing conventional functions, AgileX also attempts to solve another common problem in operations and maintenance scenarios.

Due to the varying performance of infrastructure across different database instances and the differing scales/types of business they carry, a universal alarm configuration template often struggles to meet the personalized needs of different instances. This leads to DBAs spending effort manually adjusting these thresholds.

For example, consider the number of active sessions. The following chart shows the historical trend of the active session count metric for a customer’s database instance, fluctuating around 1200.

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

Due to using the default template, the alarm threshold for the session count metric was set to 5000, which may cause DBAs to overlook potential risks. AgileX, based on historical session count analysis, identifies data characteristics and provides recommendations on the 【Intelligent Baseline】 and 【Alarm Configuration】 pages: The session count alarm threshold is too high; recommended threshold: 1800

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

This can help DBAs effectively and scientifically adjust the alarm thresholds for various key metrics.

In most industries, the technical reserves of operations and maintenance teams struggle to keep pace with the ever-evolving database environment. Especially in the financial sector, there are many users who use more than five databases, with instance counts exceeding 200. Traditional operations and maintenance models are increasingly unable to meet the needs of operations and maintenance teams. However, with the development of technology, capabilities such as root cause analysis, trend prediction, and threshold recommendation are not only possible but necessary. CloudFun Technology’s AgileX database monitoring and diagnostic system, based on statistical science and machine learning techniques, can intelligently analyze existing monitoring data, discover data characteristics, profile database instances, and quickly identify potential risks, thereby assisting system and database experts in improving the likelihood of pinpointing root causes and operational efficiency, allowing technical personnel to focus on more complex and creative tasks, which may be the true meaning of technology.

About CloudFun Technology

Zhejiang CloudFun Network Technology Co., Ltd. (abbreviated: CloudFun Technology) is a professional comprehensive service provider for databases. The CloudFun team has accumulated years of experience in database, cloud computing, and other fields, independently developing enterprise-level products such as database security control, intelligent operations and maintenance, SQL code auditing, and providing one-stop professional services such as domestic database consulting, design, implementation, and operations management.

We hope to help enterprises continuously improve the security compliance level and operational efficiency of their database systems in the process of advancing toward multi-cloud strategies and digital reforms through the concept of “building data security and creating data value” and the model of “product + service,” unleashing the imagination of the data-driven era.

Previous Articles of Interest

Helping Hangzhou Bank Build a Safer and More Convenient Data Change Process

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

Technical Sharing | Linux Performance Optimization

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

OB Enterprise Edition Data Synchronization OMS Deployment Practice

Application of Statistical Science and Machine Learning in AgileX Database Monitoring System
Application of Statistical Science and Machine Learning in AgileX Database Monitoring System

Leave a Comment