MLQuant: Financial Time Series Trading Strategy Based on XGBoost (With Code)

Top Quantitative Self-Media in the Industry

Author: Smith Translated by: Fang’s Mantou

Introduction

Using machine learning to predict the next period’s price or direction based on stock prices is not new, and it does not yield any meaningful predictions. In this article, we will break down the time series data of a series of assets into a simple classification problem to see if machine learning models can better predict the direction of the next cycle. The goal and strategy is to invest in one asset daily. The asset will be the one in which the machine learning model is most confident that the stock price will rise during the next period T+1. In other words, we invest in the asset that the machine learning model gives the highest predicted probability of appreciation tomorrow. For instance, if the model predicts on day t that GOOG has a predicted probability of 0.78 to be above the previous closing price, and also predicts that AMZN will rise with a probability of 0.53, we will invest in GOOG today. We only invest in one asset daily, but the model can be extended to short selling, multi-asset purchases, and multi-cycle investments.

Importing Related Libraries

require(PerformanceAnalytics)
library(data.table)
library(dplyr)
library(tibble)
library(TTR)
library(tidyr)
library(tidyquant)
library(tsfeatures)
library(rsample)
library(purrr)
library(stringr)
library(tibbletime) # tsibble clashes with the base R index() function
library(xgboost)
library(rvest)

Predefine some initialization objects and set the stock codes of the companies we want to download. We randomly selected 30 stocks from the S&P 500.

set.seed(1234)
##########################################################

Scale_Me <- function(x){
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

###########################################################

start_date <- "2017-01-01"
end_date <- "2020-01-01"

url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
symbols <- url %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="constituents"]') %>%
  html_table() %>%
  .[[1]] %>%
  filter(!str_detect(Security, "Class A|Class B|Class C")) %>%     # Removes firms with Class A, B & C shares
  sample_n(30) %>%
  pull(Symbol)


#symbols <- c(
#  'GOOG', 'MSFT', 'HOG', 'AAPL', 'FB'
#  'AMZN', 'EBAY', 'IBM', 'NFLX', 'NVDA',
#  'TWTR', 'WMT', 'XRX', 'INTC', 'HPE'
# )

Data

Download the data and store it in a new environment.

dataEnv <- new.env()
getSymbols(symbols,
           from = start_date,
           to = end_date,
           #src = "yahoo",
           #adjust = TRUE,
           env = dataEnv)

##  [1] "LEG"  "NLSN" "SLB"  "CHTR" "C"    "REGN" "CCI"  "SYK"  "ROP"  "RL"
## [11] "CERN" "CMG"  "GS"   "CAT"  "MSI"  "BR"   "VRSK" "PNC"  "KEYS" "PHM"
## [21] "FB"   "BKR"  "ABMD" "WYNN" "DG"   "ADI"  "GL"   "TSCO" "FLS"  "CDW"

Once the data is downloaded and stored in the new environment, we will clean the data, put all lists into a single data frame, calculate the daily returns of each asset, and create an upward or downward direction, which will be what the classification model tries to predict.

df <- eapply(dataEnv, function(x){
  as.data.frame(x) %>%
    rename_all(function(n){
      gsub("^(\w+)\.", "", n, perl = TRUE)
    }
    ) %>%
    rownames_to_column("date")
}) %>%
  rbindlist(idcol = TRUE) %>%
  mutate(date = as.Date(date)) %>%
  group_by(.id) %>%
  tq_mutate(
    select = Adjusted,
    mutate_fun = periodReturn,
    period = "daily",
    type = "arithmetic"
  ) %>%
  mutate(
    Adj_lag = lag(Adjusted),
    chng_Adj = ifelse(Adjusted > Adj_lag, 1, 0)
  ) %>%
  select("date", ".id", "Adjusted", "daily.returns", "chng_Adj", "Open", "High", "Low", "Close") %>%
  as_tibble() %>%
  as_tbl_time(index = date) %>%
  setNames(c("date", "ID", "prc", "ret", "chng", "open", "high", "low", "close")) %>%
  drop_na(chng)

The first few observations of the data are as follows:

We can use the nest() function to put the data into a convenient nested table, which we can simply map over with map() and apply the rsample package’s rolling_origin() function, so that each of our assets will have its own rolling_origin() function applied to it without any overlap or mixing of asset classes, and we do this to create time series features for each cycle.

nested_df <- df %>%
  mutate(duplicate_ID = ID) %>%
  nest(-ID)

We will divide the time series data into multiple lists so that the analysis() list contains 100 observations in each list, and has a corresponding assessment() list with 1 observation. Typically, analysis() will become our training dataset, and assessment() will become our testing dataset, but here we use the rolling_origin() function to help create time series features.

rolled_df <- map(nested_df$data, ~ rolling_origin(.,
                                                  initial = 100,
                                                  assess = 1,
                                                  cumulative = FALSE,
                                                  skip = 0))

Time Series Functions

To create time series variables, we use the tsfeatures package, but there is also a feasts package here. For this model, we only need to select some functions of interest from the tsfeatures package.

functions <- c(
  "entropy",                   # Measures the "forecastability" of a series - low values = high sig-to-noise, large vals = difficult to forecast
  "stability",                 # means/variances are computed for all tiled windows - stability is the variance of the means
  "lumpiness",                 # Lumpiness is the variance of the variances
  "max_level_shift",           # Finds the largest mean shift between two consecutive windows (returns two values, size of shift and time index of shift)
  "max_var_shift",             # Finds the max variance shift between two consecutive windows (returns two values, size of shift and time index of shift)
  "max_kl_shift",              # Finds the largest shift in the Kulback-Leibler divergence between two consecutive windows (returns two values, size of shift and time index of shift)
  "crossing_points"            # Number of times a series crosses the mean line
)

We find it more interesting to stick with the map() function instead of function(SYMB). This function performs the following operations on each asset in our data:

Using out-of-sample t+1 (assessment) data, bind these lists into a dataframe. Next, apply the functions string to call functions from the tsfeatures package, applying these functions to the sample analysis data (each data containing 100 observations), so we get a foldable set of observations that can be bound together. Finally, we use bind_cols() to bind the columns of the two datasets together. After that, we rename the chng variable and use ~str_c(“X”, seq_along(.)) to rename the time series feature variables to more dynamic variables, so we only need to add functions to the functions string without worrying about renaming variables separately to make the model work.

After completing this, we will use the rolling_origin() function again to create the machine learning dataset. The first rolling_origin() function is used to help fold time series data on a rolling basis by taking the previous 100 days of data and calculating the tsfeatures functions on it, which is similar to using the zoo package’s rollapply() function to calculate using rolling averages/standard deviations.

Next, we use the variables X_train and X_test to split the data into X variables and use Y_train and Y_test to separate the corresponding Y variables. The xgboost package requires a specific type of xgb.DMatrix(). This is what dtrain and dtest are doing.

Then, we set the XGBoost parameters and apply the XGBoost model. At this point, proper cross-validation should be performed, but since time series cross-validation is very tricky, there are no functions in R to help with this type of cross-validation. We will introduce the method to readers in a later article.

Once the model is trained, we start making predictions.

Model Application

The function for calculating all of this is as follows:

Prediction_Model <- function(SYMB){
  data <- bind_cols(
    map(rolled_df[[SYMB]]$splits, ~ assessment(.x)) %>%
      bind_rows(),
    map(rolled_df[[SYMB]]$splits, ~ analysis(.x)) %>%
      map(., ~tsfeatures(.x[["ret"]], functions)) %>%
      bind_rows()
  ) %>%
    rename(Y = chng) %>%
    rename_at(vars(-c(1:9)), ~str_c("X", seq_along(.)))
  
  ml_data <- data %>%
    as_tibble() %>%
    rolling_origin(
      initial    = 200,
      assess     = 1,
      cumulative = FALSE,
      skip       = 0)
  
  X_train <- map(
    ml_data$splits, ~ analysis(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("X")) %>%
      as.matrix()
  )
  
  Y_train <- map(
    ml_data$splits, ~ analysis(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("Y")) %>%
      as.matrix()
  )
  
  X_test <- map(
    ml_data$splits, ~ assessment(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("X")) %>%
      as.matrix()
  )
  
  Y_test <- map(
    ml_data$splits, ~ assessment(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("Y")) %>%
      as.matrix()
  )
  
  #############################################################
  
  dtrain <- map2(
    X_train, Y_train, ~ xgb.DMatrix(data = .x, label = .y, missing = "NaN")
  )
  
  dtest <- map(
    X_test, ~ xgb.DMatrix(data = .x, missing = "NaN")
  )
  
  # Parameters:
  watchlist <- list("train" = dtrain)
  params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1,
                 "objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0,
                 "colsample_bylevel" = 1, "eval_metric"= "auc",
                 "set.seed" = 176)
  
  # Train the XGBoost model
  xgb.model <- map(
    dtrain, ~ xgboost(params = params, data = .x, nrounds = 10, watchlist)
  )
  
  xgb.pred <- map2(
    .x = xgb.model,
    .y = dtest,
    .f = ~ predict(.x, newdata = .y, type = 'prob')
  )
  
  preds <- cbind(plyr::ldply(xgb.pred, data.frame),
                 plyr::ldply(Y_test, data.frame)) %>%
    setNames(c("pred_probs", "actual")) %>%
    bind_cols(plyr::ldply(map(ml_data$splits, ~assessment(.x)))) %>%
    rename(ID = duplicate_ID) %>%
    #select(pred_probs, actual, date, ID, prc, ret) %>%
    as_tibble()
  return(preds)
}

We can apply the above model to create time series features by training and testing each of our assets with the following:

Sys_t_start <- Sys.time()
Resultados <- lapply(seq(1:length(rolled_df)), Prediction_Model)
Sys_t_end <- Sys.time()
round(Sys_t_end - Sys_t_start, 2)

The Resultados output will give us a list that lists the length of the number of assets in our data. The first few observations of the first asset in the list are as follows:

It includes the predicted probabilities from XGBoost, the actual observations, the result date (the date of the out-of-sample test data), the observed stock price, the calculated daily return (a copy of the observations), the OHLC data collected from Yahoo, and finally, we constructed time series features and then renamed them as X_n.

The goal of this strategy is to invest daily in the asset with the highest predicted probability of market appreciation. In other words, if the model predicts on day t that the price of GOOG will be above the previous closing price with a predicted probability of 0.78, and also predicts that AMZN will rise with a probability of 0.53, it is likely that we will invest in GOOG today. In other words, we only invest in the assets with the highest expected probability of market appreciation.

Therefore, we created a new data frame named top_assets, which essentially provides us with the highest predicted probabilities for all assets daily.

top_assets <- plyr::ldply(Resultados) %>%
  #select(pred_probs, actual, date, open, high, low, close, prc, ret) %>%
  group_by(date) %>%
  arrange(desc(pred_probs)) %>%
  dplyr::slice(1) %>%
  ungroup() %>%
  select(date, everything()) %>%
  rename(score = pred_probs) %>%
  select(-actual)

Strategy Evaluation

The first 10 days of strategy investment are as follows:

We can see that the score column is the probability of the asset with the highest predicted probability, that is, its price is higher than its previous closing price. The ID column provides us with the asset codes we invested in.

Next, we will analyze the strategy of selecting the best predicted winners based on the S&P 500 benchmark index and download the S&P 500 index.

top_assets <- xts(top_assets[,c(2:ncol(top_assets))], order.by = top_assets$date) # put top_assets into xts format

# Analyse strategy
getSymbols("SPY",
           from = start_date,
           to = end_date,
           src = "yahoo")

## [1] "SPY"

#detach("package:tsibble", unload = TRUE) # tsibble clashes with the base R index() function
SPY$ret_Rb <- Delt(SPY$SPY.Adjusted)
SPY <- SPY[index(SPY) >= min(index(top_assets))]

RaRb <- cbind(top_assets, SPY)

From here we can see the comparison of the strategy with the S&P 500 index. The model has not yet been extended to include short selling or constructing multi-asset portfolios of the top N assets.

Plotting the performance of the strategy:

charts.PerformanceSummary(RaRb[, c("ret", "ret_Rb")], geometric = FALSE, wealth.index = FALSE,
                          main = "Strategy vs. Market")

Relevant Metrics:

##                             ret ret_Rb
## Sterling ratio           0.1870 0.3879
## Calmar ratio             0.2551 0.5884
## Burke ratio              0.2251 0.5344
## Pain index               0.0955 0.0283
## Ulcer index              0.1189 0.0455
## Pain ratio               0.7337 4.0290
## Martin ratio             0.5891 2.5027
## daily downside risk      0.0111 0.0066
## Annualised downside risk 0.1768 0.1044
## Downside potential       0.0054 0.0029
## Omega                    1.0722 1.1601
## Sortino ratio            0.0351 0.0714
## Upside potential         0.0058 0.0034
## Upside potential ratio   0.7027 0.6124
## Omega-sharpe ratio       0.0722 0.1601

Backtesting Information:

## $ret
##         From     Trough         To   Depth Length To Trough Recovery
## 1 2018-08-31 2019-01-03 2019-09-16 -0.2746    261        85      176
## 2 2019-11-06 2019-12-03       <NA> -0.1300     39        19       NA
## 3 2019-10-01 2019-10-18 2019-10-29 -0.0810     21        14        7
## 4 2018-03-22 2018-04-20 2018-05-09 -0.0773     34        21       13
## 5 2018-08-10 2018-08-15 2018-08-20 -0.0474      7         4        3
##
## $ret_Rb
##         From     Trough         To   Depth Length To Trough Recovery
## 1 2018-09-21 2018-12-24 2019-04-12 -0.1935    140        65       75
## 2 2019-05-06 2019-06-03 2019-06-20 -0.0662     33        20       13
## 3 2018-03-15 2018-04-02 2018-06-04 -0.0610     56        12       44
## 4 2019-07-29 2019-08-05 2019-10-25 -0.0602     64         6       58
## 5 2018-06-13 2018-06-27 2018-07-09 -0.0300     18        11        7

Comparing Returns

chart.Boxplot(RaRb[,c("ret", "ret_Rb")],  main = "Returns")

Comparative Return Statistics:

table.Stats(RaRb[, c("ret", "ret_Rb")]) %>%
  t() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

    Observation  Asset Value  Minimum Requirement  Quartile 1  Median  Arithmetic Mean  Geometric Mean  Quartile 3  Maximum  SE Mean  LCL Mean (0.95)  UCL Mean (0.95)  Variance  Std Dev  Skewness  Kurtosis
Return  453 0 -0.0669 -0.0068 0.0006 0.0004 0.0003 0.0087 0.0642 0.0007 -0.0011 0.0018 0.0002 0.0156 -0.2542 2.8842
ret_Rb  453 0 -0.0324 -0.0030 0.0006 0.0005 0.0004 0.0054 0.0505 0.0004 -0.0004 0.0013 0.0001 0.0091 -0.2949 3.6264

Sharpe:

lapply(RaRb[, c("ret", "ret_Rb")], function(x){SharpeRatio(x)})

## $ret
##                                       ret
## StdDev Sharpe (Rf=0%, p=95%): 0.025027498
## VaR Sharpe (Rf=0%, p=95%):    0.015346462
## ES Sharpe (Rf=0%, p=95%):     0.009618405
##
## $ret_Rb
##                                   ret_Rb
## StdDev Sharpe (Rf=0%, p=95%): 0.05152014
## VaR Sharpe (Rf=0%, p=95%):    0.03218952
## ES Sharpe (Rf=0%, p=95%):     0.01913213

Plotting the risk-return scatter plot:

chart.RiskReturnScatter(RaRb[, c("ret", "ret_Rb")], 
                        Rf=.03/252, scale = 252,
                        main = "Risk - Return over the period")

Plotting rolling returns, risks, and Sharpe performance:

charts.RollingPerformance(RaRb[, c("ret", "ret_Rb")],
                          Rf=.03/12,
                          colorset = c("red", rep("darkgray",5), "orange", "green"), lwd = 2)

Annualized Returns:

lapply(RaRb[, c("ret")],function(x){periodReturn(
  x, period = 'yearly', type = 'arithmetic')})

## $ret
##            yearly.returns
## 2018-12-31      -1.855083
## 2019-12-31      -1.475181

lapply(RaRb[, c("ret_Rb")],function(x){periodReturn(
  x, period = 'yearly', type = 'arithmetic')})

## $ret_Rb
##            yearly.returns
## 2018-12-31     -9.0376638
## 2019-12-31     -0.7226497

The WeChat public account for quantitative investment and machine learning is a mainstream self-media in the industry, vertical to fields such as Quant、Fintech、AI、ML and others. The public account has more than 180,000 followers from various circles including public funds, private equity, brokerage, futures, banks, insurance, and asset management. It publishes cutting-edge research results and the latest quantitative information daily.

Leave a Comment Cancel reply