?p=34635

Analyst: Lingzi Lu

Customer churn is a serious problem that exists in various industries, and it has also attracted the attention of numerous telecom service providers—because the cost of acquiring a new customer far exceeds the cost of retaining an existing customer（Click “Read the Original” at the end for complete data）。

Related Video

Therefore, it is crucial to explore relevant indicators that may significantly impact customer churn and to predict the future based on existing information. This can help service providers identify potential risks of customer churn and propose targeted plans, thereby effectively improving customer retention rates. The key challenge of the project lies in identifying the most influential factors on customer churn and evaluating various machine learning models to select the best one for prediction.

Solution

Task/Objective

Explore relevant indicators that may significantly impact customer churn; select suitable machine learning models for prediction.

Data Source

The telecom customer churn dataset used for analysis（See the end for free access to the data） covers service categories, demographic characteristics, and account information of over 7,000 telecom business users. The dataset contains a total of 20 variables. The dependent variable is the customer churn indicator, recording whether a customer has churned in the last month. The other 19 independent variables describe customers from multiple perspectives.

To better understand these characteristics and analyze their relationship with customer churn, I categorized them into three groups. As shown in the figure below, the first category describes the basic characteristics of customers, such as gender, whether they are elderly, and whether they have partners or relatives. These features are closely related to the user’s daily life and may influence their churn behavior. The second category pertains to the characteristics of telecom contracts. This category describes contract duration, whether paperless billing is used, payment methods, and numerical features such as monthly fees and total costs. These features reflect customer choices related to telecom contracts and play a crucial role in customer churn. For example, customers who choose long-term contracts may have a lower chance of churning. The final category focuses on service functions, highlighting indicators related to telephone and internet services.

Exploratory Data Analysis

Exploratory data analysis was conducted through visualizing categorical variables with stacked bar charts and studying the distribution of numerical variables, box plot visualization, and correlation calculations. Based on this, features significantly impacting the target churn variable were selected. Regarding categorical variables, first, elderly customers without partners or relatives have a higher churn rate. Second, customers who choose fiber optic network services over other network services have a higher churn rate. Third, customers with monthly contracts, paperless billing, and electronic check payment methods have a higher churn rate. For numerical variables, customers with shorter total tenure and lower total payments are more likely to leave. Charges are positively correlated with total tenure. The findings provide valuable references for machine learning research and have certain commercial application value.

Data Processing

All categorical variables were converted to dummy variables, and all numerical variables underwent z-normalization to eliminate differences in magnitude between different variables.

Construction

Before building the model, the data was randomly split into a training set and a testing set, with a ratio of 3:1; during model tuning, 5-fold cross-validation was used.

Modeling

KNN

The K-Nearest Neighbors (KNN) classifier is one of the most classic and intuitive methods in supervised learning. The concept of the nearest point is defined by Euclidean distance, so it is necessary to transform features into a standard range to avoid the dominance of some large-range features. Since the data has already undergone z-normalization, this will not be an issue for the model.

During parameter adjustment, different values of K from 1 to 100 were tried. The prediction accuracy of 5-fold cross-validation is shown in the figure below. Ultimately, the optimal K was selected as 66.

Naive Bayes

Logistic Regression

I performed backward selection, forward selection, and stepwise selection on features through 5-fold cross-validation, where the accuracy shown in the table is calculated by averaging performance on the validation dataset. Based on their performance, stepwise selection was ultimately chosen as the model.

According to the p-values of the logistic regression coefficients, total tenure, monthly fee, and telephone service are the top three significant variables. Total tenure and telephone service have a negative impact on customer churn, while monthly fee has a positive impact. In other words, if a new customer has a short total tenure, lacks telephone service, and pays a high monthly fee, they are likely to churn. This is intuitive, as long-term customers are more loyal. Moreover, without telephone service, customers’ reliance on a specific telecom service provider is not strong. Additionally, high monthly fees may deter customers from continuing to use a telecom company’s services.

LDA/QDA

I implemented two forms of discriminant analysis: Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). The former is more flexible than the latter but does not require estimating too many parameters.

Random Forest

A forest is built randomly, consisting of many decision trees, where each decision tree in the random forest is uncorrelated. Once the forest is obtained, when a new input sample enters, each decision tree in the forest makes a judgment on which class the sample should belong to (for classification algorithms), and then the class chosen the most is predicted as that class for the sample.

Support Vector Machine

The Support Vector Machine finds a hyperplane to separate classes with the maximum margin. To some extent, it is similar to Logistic Regression and LDA, but Support Vector Machine performs quite well when classes are almost separable. Furthermore, kernel support vector machines can draw nonlinear boundaries, making the modeling very flexible. The most common kernel functions are linear, polynomial, Gaussian RBF (radial), and Sigmoid.

During parameter adjustment, different kernel types can be chosen, as well as different values for the penalty parameter c. To obtain a more accurate estimate of model prediction performance, 5-fold cross-validation was employed.

Click the title to view previous content

Customer Churn Prediction in Telecom: KNN, Naive Bayes, Logistic Regression, and More

Data Sharing | R Language Decision Tree and Random Forest Classification of Telecom User Churn Data and Parameter Tuning, ROC Curve Visualization

Swipe left to see more

CART

Classification and Regression Trees (CART) are tree-based models that involve stratifying and partitioning the prediction space. The splitting behavior can be summarized as a tree, also known as a decision tree.

In the classification tree, I used the Gini index to measure the purity of nodes, and then recursively performed binary splits on the nodes to search for the optimal partition. Finally, I stopped at leaf nodes based on certain stopping conditions.

However, due to overfitting, classification trees may perform well on training sets but poorly on testing sets. Therefore, we need to prune the tree to make it less complex for better performance on test data. The method I adopted is Cost Complexity Pruning (CCP) and 5-fold cross-validation.

Random Forest

Trees are easy to understand and visualize, but they sometimes lack good predictive accuracy. To address this, we adopted some ensemble learning methods to improve their performance. Random forests take it a step further by randomly selecting features for bagged trees. This selection can break the correlation of bagged trees, making the average result of the trees more reliable. In the randomForest() function, ntree represents the number of decision trees included in the random forest, and max_depth represents the number of nodes from the root node to the furthest leaf node. They are very useful when tuning parameters; we tried setting ntree from 50 to 500 with a step of 50 and max_depth from 10 to 100 with a step of 10. The performance of models with different ntree and max_depth is shown in the figure below.

In the random forest model, we can also know the importance of different variables by averaging the reduction in the Gini index. The most important variables found are TotalCharges, MonthlyCharges, and tenure.

XGBoost

Boosting is another ensemble learning method to improve model performance. The boosting algorithm handles underlying training data by iteratively re-weighting the observations of the training data. Among all boosting algorithms, XGBoosting (Extreme Gradient Boosting) is considered one of the most powerful models. It can use regularization to avoid overfitting, effectively handle missing data, and build trees at extremely fast speeds.

Neural Network

Feedforward neural networks, as a powerful deep learning method, are also used to enhance predictive performance. Feedforward neural networks are a type of neural network where there is no recursive relationship between units, and information only propagates forward in the network. The network is trained through forward and backward computation phases. The R package “nnet” is used to construct feedforward neural networks with hidden layers.

Model Comparison

Metric Selection

Precision represents the proportion of correctly predicted churn customers among all predictions. Telecom companies want to ensure that the cost of retaining these customers is economically effective. In other words, they are likely to spend money to retain FN+TN customers, but in reality, only TN customers have churned. Therefore, Precision is an important metric for measuring the effectiveness of customer retention programs.

Recall represents the proportion of correctly predicted churn customers among all churn customers. For telecom companies, the more customers with churn tendencies are retained, the better. Therefore, Recall is an effective metric for measuring the ability to successfully identify churn customers.

Comparison

Based on observations from the exploratory data analysis section and the importance metrics in logistic regression and random forests, telecom companies should pay special attention to key variables such as total tenure, monthly fee, and telephone service. If a customer is new, pays high monthly fees, or lacks telephone service, the telecom company is likely to lose that user. Except for the Naive Bayes model and the CART model, the performance of other models is quite good and very similar to each other. If one were to choose one as a recommended method for telecom companies to apply customer churn prediction, the feedforward neural network model may be chosen, as it performs relatively consistently across different metrics.

Based on the above conclusions, the following recommendations are proposed from a business perspective:

1. Based on the feedforward neural network model, create a list of users with high churn rates. Conduct real-time monitoring, user surveys, and follow-ups on listed users.

2. For new customers, discounts such as half-year coupons can be offered to navigate through the peak churn period.For fiber optic users and those with additional streaming TV and movie services, the focus should be on improving network experience and value-added service experience.Enhancing the technical department to improve network metrics is necessary.Additionally, the company should commit to providing target users with free network upgrades and monthly free TV and movie services to increase user stickiness.

3. For value-added services such as online security, online backup, device protection, and technical support, the company should offer discounts primarily based on user referrals and promotions, such as a free experience for the first month or half-year.

4. For monthly contracted users, launch annual contract payment discount activities to convert monthly contracted users into annual contracted users.

About the Analyst

Here, I would like to sincerely thank Lingzi Lu for her contributions to this article, focusing on data cleaning, machine learning, financial data analysis, etc. She is proficient in R and Python.

Data Acquisition

Reply “Telecom on the public account backend to get complete data for free.

This article’s analyzed data is shared in the member group. Scan the QR code below to join the group!

Customer Churn Prediction in Telecom: KNN, Naive Bayes, Logistic Regression, and More

Click “Read the Original” at the end

to get the complete data.。

This article is excerpted from “Customer Churn Prediction in Telecom: KNN, Naive Bayes, Logistic Regression, LDA/QDA, Random Forest, Support Vector Machine, CART, Neural Networks”.

Customer Churn Prediction in Telecom: KNN, Naive Bayes, Logistic Regression, and More

Click the title to view previous content

From Decision Trees to Random Forest: R Language Credit Card Default Analysis Credit Data Example

PYTHON User Churn Data Mining: Building User Profiles with Logistic Regression, XGBOOST, Random Forest, Decision Trees, Support Vector Machines, Naive Bayes, and KMEANS Clustering

Python Time Series Modeling and Predictive Analysis of Store Data Using LSTM and XGBoost

PYTHON Ensemble Machine Learning: Classification and Regression with ADABOOST, Decision Trees, Logistic Regression Ensemble Models and Grid Search Hyperparameter Optimization

R Language Ensemble Models: Boosting Trees, Random Forests, Constrained Least Squares Weighted Average Model Fusion Analysis of Time Series Data

Python Time Series Modeling and Predictive Analysis of Store Data Using LSTM and XGBoost

R Language PCA, Logistic Regression, Decision Trees, and Random Forests Analysis of Heart Disease Data with High-Dimensional Visualization

R Language Tree-Based Methods: Decision Trees, Random Forests, Bagging, Boosting Trees

R Language Classification Prediction of Credit Data Set Using Logistic Regression, Decision Trees, and Random Forests

SPSS Modeler Using Decision Trees and Neural Networks to Predict ST Stock

R Language Using Linear Models, Regression Decision Trees to Automatically Combine Feature Factor Levels

R Language Custom Gini Coefficient CART Regression Decision Tree Implementation

R Language Using RLE, SVM, and RPART Decision Trees for Time Series Prediction

Python Decision Trees and Random Forests to Predict NBA Winners in Scikit-learn

Python Using Scikit-learn and Pandas Decision Trees for Iris Flower Data Classification Modeling and Cross-Validation

R Language Nonlinear Models: Polynomial Regression, Local Regression, Kernel Smoothing, Generalized Additive Models (GAM) Analysis

R Language Using Standard Least Squares OLS, Generalized Additive Models (GAM), and Spline Functions for Logistic Regression Classification

R Language ISLR Salary Data Polynomial Regression and Spline Regression Analysis

R Language Polynomial Regression, Local Regression, Kernel Smoothing, and Spline Regression Models

R Language Using Poisson Regression, GAM Spline Curve Models to Predict the Number of Cyclists

R Language Quantile Regression, GAM Spline Curves, Exponential Smoothing, and SARIMA for Power Load Time Series Forecasting

R Language Spline Curves, Decision Trees, Adaboost, and Gradient Boosting (GBM) Algorithms for Regression, Classification, and Dynamic Visualization

How to Build Ensemble Models in Machine Learning Using R?

R Language ARMA-EGARCH Models, Ensemble Prediction Algorithms for SPX Actual Volatility Forecasting

In Python Deep Learning Keras, Calculate Neural Network Ensemble Model

R Language ARIMA Ensemble Model for Time Series Analysis

R Language Based on Bagging Classification of Logistic Regression, Decision Trees, Forest Analysis of Heart Disease Patients

R Language Tree-Based Methods: Decision Trees, Random Forests, Bagging, Boosting Trees

R Language Based on Bootstrap Linear Regression Prediction Confidence Interval Estimation Method

R Language Using Bootstrap and Incremental Methods to Calculate Generalized Linear Model (GLM) Prediction Confidence Intervals

R Language Spline Curves, Decision Trees, Adaboost, Gradient Boosting (GBM) Algorithms for Regression, Classification, and Dynamic Visualization

Python Time Series Modeling and Predictive Analysis of Store Data Using LSTM and XGBoost

R Language Random Forests, Logistic Regression to Predict Heart Disease Data and Visualization Analysis

R Language PCA, Logistic Regression, Decision Trees, and Random Forests Analysis of Heart Disease Data with High-Dimensional Visualization

Matlab Establishing SVM, KNN, and Naive Bayes Models for Classification and Drawing ROC Curves

Matlab Using Quantile Random Forest (QRF) Regression Trees to Detect Outliers

Customer Churn Prediction in Telecom: KNN, Naive Bayes, Logistic Regression, and More

Full Link: https://tecdat.cn/?p=34635

Analyst: Lingzi Lu

About the Analyst

This article’s analyzed data is shared in the member group. Scan the QR code below to join the group!

Leave a Comment Cancel reply