Research on Financial Fraud Prediction Model of Listed Companies Based on XGBoost

Research on Financial Fraud Prediction Model of Listed Companies Based on XGBoost

Author Introduction

Zhou WeihuaInstitute of Digital Finance, Chinese Academy of Fiscal Sciences

Zhai XiaofengInstitute of Digital Finance, Chinese Academy of Fiscal Sciences

Tan Haowei,Institute of Digital Finance, Chinese Academy of Fiscal Sciences

Research on Financial Fraud Prediction Model of Listed Companies Based on XGBoost

Research Background

In recent years, the Central Committee of the Communist Party of China and the State Council have paid great attention to serious financial discipline, strengthening financial supervision, regulating financial audit order, improving the quality of accounting information, and curbing financial fraud. However, traditional financial fraud detection methods only conduct simple analysis and modeling of financial statement data, making it difficult to uncover the financial fraud behind listed companies. From the existing research literature at home and abroad, mainstream research mainly focuses on various characteristic factors affecting financial fraud, with statistical models mainly being Logistic models and linear regression models, and research objectives focusing on the explainability of financial fraud characteristic factors rather than out-of-sample predictability. With the development of artificial intelligence technology, the application of big data and machine learning research methods in constructing financial fraud prediction models has aroused great interest in the academic community.

The goal of this study is to construct a financial fraud prediction model for listed companies based on big data and machine learning methods, according to the financial and non-financial data disclosed by Chinese listed companies. In the model construction process, we mainly focus on the following key factors: First, the learning sample set is divided into fraud sample sets and non-fraud sample sets, where the fraud sample set is generally easier to construct based on penalties from the China Securities Regulatory Commission, while the non-fraud sample set is prone to selection bias due to interference from gray samples. Second, in terms of feature selection, both domain knowledge and statistical characteristics must be considered to avoid “curse of dimensionality” and overfitting issues, which could affect the model’s generalization ability. Third, a balance must be achieved between learning effectiveness and generalization ability to ensure that the model has strong adaptability out-of-sample, truly achieving good predictive performance. Finally, addressing the “black box” nature of machine learning models and their weak interpretability, we enhance the model’s interpretability through SHAP methods and feature importance analysis based on visualization.

Main Findings

The methods of big data and machine learning provide a research framework for predicting financial fraud in listed companies. This paper attempts to construct a financial fraud prediction model suitable for the characteristics of Chinese listed companies based on the XGBoost method—Xscore. We use A-share listed companies from 2000 to 2020 as the research sample to construct training sets, test sets, and out-of-sample sets (prediction sets), selecting a feature indicator system that characterizes corporate profiles from four aspects: corporate governance, financial supervision, financial indicators, and corporate operations. Through feature aggregation, feature combination, and feature discretization methods, we expand, filter, and reduce dimensions to ultimately determine the optimal 27 feature indicators, and train the model using the XGBoost algorithm to construct the Xscore model. After model comparison, analysis, and application, the following conclusions are mainly drawn: First, the Xscore model significantly outperforms traditional financial fraud prediction models Cscore and Fscore. Xscore has better prediction precision, recall rate, positive-negative sample distinction, and stability, achieving an overall recall rate of 79% in out-of-sample predictions, a precision rate of 42%, an AUC of 0.90, and a KS distinction of 0.65, with model stability (PSI) at 0.09. Second, through SHAP methods, we conduct visual analysis of feature contributions in the Xscore model, enhancing its interpretability through global attribution and local attribution, making the Xscore model both excellent in predictive capability and interpretable. Third, by defining and applying the Xscore scoring card, we further enhance the usability of the Xscore model, combining the SHAP method to make the prediction results of the Xscore model more intuitive. Empirical tests show that the fraud sample hit rate in the top 5% high-score segment is as high as 78%, and the top 20% high-score segment can capture 64% of the fraud samples in the overall sample, which means that rejecting 20% of the high-score samples can filter out 64% of the fraud samples.

Policy Implications

1. Promote corporate accounting information quality supervision based on big data.

Corporate accounting information is not only a “barometer” of economic development but also an important basis for macroeconomic decision-making. High-quality accounting information is the foundation for high-quality economic development. Traditional accounting information quality supervision mainly limits itself to checking and supervising partial accounting data of a specific enterprise for a specific year. Under the background of the digital economy, data, algorithms, and computing power have become core technologies of the digital economy. Promoting corporate accounting information quality supervision based on big data thinking and enhancing digital financial supervision is the future development direction. By establishing national accounting big data, industry accounting big data, and enterprise accounting big data, we can expand the dimensions of accounting information quality supervision in spatial, temporal, and structural dimensions, and enhance the strength of accounting information quality supervision through algorithms and computing power, further promoting and improving the quality of corporate accounting information in China.

2. Promote big data auditing to enhance the quality of CPA practice.

The “Accounting Informationization Development Plan (2021-2025)” proposes to gradually achieve remote auditing, big data auditing, and intelligent auditing goals. In the context of a significant increase in unstructured accounting data and the growing urgency of auditing work, the traditional auditing model, which mainly relies on account checking, faces severe challenges. Using big data and artificial intelligence technologies, replacing manual auditing risk assessments with machine learning models, establishing a big data auditing mindset, expanding audit coverage, improving audit efficiency, and establishing a comprehensive, authoritative, and efficient audit supervision system.

3. Strengthen the disclosure of information on equity pledges and related transactions.

Our global attribution analysis of the Xscore model found that the proportion of equity pledges and the degree of reliance on related transactions are important characteristic variables for identifying financial fraud. It is recommended that regulatory authorities pay close attention to the revision and improvement of policies related to the disclosure of equity pledge and related transaction information, requiring enterprises to timely, fully, and comprehensively provide relevant information about equity pledges and related transactions, including pledge ratios, pledge periods, funding purposes, transaction prices, fair values, and market concerns, etc. By continuously improving the quality and transparency of information disclosure, we can guide all sectors of society to strengthen external supervision, prevent corporate financial fraud motives, and control corporate financial fraud risks.

Original Publication

Published in: Quantitative Economics and Technical Economics Research, Issue 7, 2022

Research on Financial Fraud Prediction Model of Listed Companies Based on XGBoost

Editorial Department of Quantitative Economics and Technical Economics Research

WeChat ID: jqte-cass

Leave a Comment