When it comes to using machine learning algorithm to pick the stocks that are most likely to produce a good return, it is similar to seeking the opinion of an investment consultant. However, it can be unsettling to make your investment decision after listening to just one consultant. Now is the moment to get second opinions and hire more investment advisors to make sure the investment concept is reliable, doable, and profitable.

The same principle that you consult other machine learning algorithms to confirm the predictions made by these models are applied in ensemble learning. When you have collected all of the final data from these models, you may take your time relaxing in your nice chair like a big boss, analyzing the results, and making your important and sacred decision.

Motivation

In the previous articles 1 2 3 4 and 5, we have built the machine learning script to predict the winners in the stock market using only the XGBoost model. Nevertheless, there are many algorithms out there for us to try and evaluate. So the most important question for us becomes much more complex. We need to build multiple machine learning models, use GridSearch to find the best hyperparameters, train/fit many different machine learning models, evaluate each model with the same metrics, pick the best-performing model for us to use, and …….

How are we going to do this?

Ensemble learning

Ensemble learning is a method to combine the predictions from different machine learning models. We gave these machine learning models the name weak learners, as compared to our finalize machine learning model, these weak learners contribute only a part of their efforts to produce the final predictions. By saying that, the ensemble learning model is a more powerful predictor by using a strong learner to assemble the results from many weak learners, so that our final predictor is able to waive the variances from some of the machine learning models and also prevent the overfitting of a singular model. Below is the list of the ensemble learning techniques:

Different types of ensemble learning techniques

Pause!! Let’s narrow it down

Among these ensemble learning techniques, Bagging and Boosting are the most commonly known techniques. They are even used in the modern machine learning algorithm such as the Adaboost model or the XGBoost model that we used in our previous articles. However, to cover all these techniques would probably bore you to death. Therefore, we’re going to introduce two techniques in this article, Average Voting and Stacking. Also, as explaining the basic theory is not my strength, I’ll put less effort into explaining and more effort into describing the details of the backtests and coding details.

Average Voting

As the name implies, average voting is to average the predicted scores/probabilities from your weak learners and output the final scores/probabilities. For example, you have three weak learners classifier models trained and produced the final predicted probabilities of getting the positive return tomorrow.

Classifier Model	Stock 1	Stock 2	Stock 3
A	0.9	0.9	0.7
B	0.7	0.3	0.7
C	0.6	0.7	0.7
Averaged possibility	0.73	0.63	0.7

Possibilities of getting the positive return tomorrow (Soft voting)

If we look at models A, B, and C respectively, we probably end up buying Stock 2 as it has a relatively high probability to receive a positive return from models A and C. After employing the average voting technique, the probability of Stock 2 now drops to 66% and Stock 1 probability would top Stock 2, indicating that Stock 1 would actually have a higher probability to receive a positive return than the other two stocks. This is so-called Soft Voting.

There is also Hard Voting, which takes binary inputs, True or False, into account instead of the probabilities. Taking the same example as above, we add one more condition that the output would be 1 (True) only when the possibility is over 0.7. The final result would be quite different.

Classifier Model	Stock 1	Stock 2	Stock 3
A	0.9 (1)	0.9 (1)	0.7 (1)
B	0.7 (1)	0.3 (0)	0.7 (1)
C	0.6 (0)	0.7 (1)	0.7 (1)
Voter	2 Positives & 1 Negative	2 Positives & 1 Negative	*3 Positives*

Possibilities of getting the positive return tomorrow (Hard voting)

By looking at the total number of the voters who vote positive, the final winner would be Stock 3 as it has 3 people who think it’s going to receive a positive return tomorrow. Therefore the Hard Voting would recommend Stock 3, yet the Soft Voting would recommend Stock 2. The concept is quite straightforward, but this technique does help the model to mitigate the impact of the high variance of one single model.

Stacking

Other than average voting, Stacking processes the predictions from the weak learners in a more advanced way. Stacking treats the outputs of its weak learners as features and stacks them together into secondary training data. The secondary training data will be used as the inputs for the final estimator (a.k.a. meta-model), and then computes the final prediction.

Stacking technique illustration

As illustrated above, classification models A, B, and C use the same training data to train the model and then produce predictions A, predictions B, and predictions C. The final estimator treats these predictions as new features to compute the final prediction.

Walk through the strategies

We now have the general idea of these two ensemble learning techniques, let’s move on to the backtest so that we can understand the power of ensemble learning. In this series of backtests, we are going to use the same dataset to train 1. XGBoost, 2. LogisticRegression, 3. SVM, and 4. Deep Learning with 2 layers of hidden layers. After conducting the backtests using these models respectively, we will combine these models together and apply Average Voting and Stacking techniques respectively to see whether the performances are improved or not.

Universe and training data

I’m still using ZZ500 as our universe and the same set of features as the training data. If you are interested in knowing how to define the universe and what features I’ve been using, you can check out my previous articles regarding machine learning and factor analysis.

Backtest timeframe

My backtest timeframe is from 2020-04 ~ 2022-07. For each month, I would need 60 months’ data as the training data to train the model. Therefore, it would require 27 (validation data) + 60 (training data) = 87 months = ~ 8 years of stock data.

Backtest scenarios

Here are the four models that I employed in this backtest. Again, I’m not the professor of the machine learning algorithm that can turn you into a machine learning expert with what I know. Instead, I’m going to put some quick descriptions and the materials that help me understand the basics of these ML models.

1. XGBoost
This is the decision-tree-base model that I’ve been using since the first article. The advantage of this algorithm is it’s extremely fast. This model took 1/5 of the time to train compared to other models. Below are the StatQuest videos that help me to understand what XGBoost is about:

2. LogisticRegression
Logistic Regression is very much like the Linear regression that I talked about in 【Factor analysis】 Vol. 1. Introduction the idea of factor analysis. It uses various ordinal features to predict the probability of whether a thing will happen or not. To transform the probability into a Boolean value that stands for whether a certain incidence will happen or not, an activation function (such as Sigmoid or Softmax) will be applied. Here are the materials for you to know more about logistic regression:

3. SVM
I have introduced the concept of SVM here. SVM is a variant of logistic regression. Instead of finding the exact line to separate all the 0’s and the 1’s, we include an extra hyperplane into the model. We hope that by adding this hyperplane, we will be able to clearly separate the data into different groups. The method for inserting this hyperplane is referred to as a ‘kernel.’

4. Neural Networks
The neural network is a type of deep learning algorithm. It uses numerous nodes to simulate the neuron in a neural system of a person, that each neuron makes individual solution and combine these solutions to make the final solution. Below are the related articles to talk about the NN model:

def get_model():
  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Dense(256, activation="relu", input_shape=(179, ), name="dense_1"))
  model.add(tf.keras.layers.Dropout(0.1),)
  model.add(tf.keras.layers.Dense(256, activation="relu", name="dense_2"))
  model.add(tf.keras.layers.Dropout(0.1)),
  model.add(tf.keras.layers.Dense(1, activation="sigmoid", name="predictions"))

  # Compile model
  model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=[
      tf.keras.metrics.AUC(),
      tf.keras.metrics.BinaryAccuracy(),
    ],
  )
  return model

My neural network model set up

5. Average Voting Algorithm
As previously explained, Average Voting essentially averages out the predicted scores/possibilities participated in machine learning models. Therefore, it’s relatively easy to implement the average voting model by putting your model into a list as an estimator parameter. The tricky part is, that the TensorFlow library that the neural network model uses is originally developed by Google, and the scikit-learn library that built the VotingClassifier is not. These two models are not naturally compatible and your neural network model can’t be tucked into the estimator parameter directly. Fortunately, TensorFlow also provides the function to wrap our NN model into a format that the scikit-learn library can understand. Hence, remember to wrap your NN model before you start building your Average Voting Algorithm.

import TensorFlow as tf

nn_model = tf.keras.wrappers.scikit_learn.KerasClassifier(
    build_fn=get_model,
    epochs=40,
    verbose=False,
)
nn_model._estimator_type = 'classifier'

scaled_nn_model = make_pipeline(
    RobustScaler(),
    nn_model
)

Use `tf.keras.wrapper.scikit_learn` to wrap our NN model

Once you have your models ready, you simply need to put them together into a list and add the wrapper to the VotingClassifier function. Here we use voting='soft' to smooth the variance of the model predictions.

from sklearn.ensemble import VotingClassifier

# Instantiate the VotingClassifier class
voting_model = VotingClassifier(estimators=[
    ('xgboost_model', xgb_model),
    ('scaled_lr', scaled_lr_model),
    ('scaled_svm', scaled_svm_model),
    ('scaled_nn', scaled_nn_model),
  ],
  voting='soft',
)

# Train the voting model
voting_model.fit(X_train, y_train)

# Run the prediction
y_predict = voting_model.predict(X_test)

VotingClassifier basic instruction

6. Stacking
In our StackingClassifier, we use the XGBoost model, the Support Vector Machine model, and Neural Network models as our base estimators. As for the final estimator to produce the final prediction, we use the Logistic Regression model with the parameters needed. Once the model is instantiated, we can use this instance as the rest of scikit-learn model to fit and to predict. Make sure you include the hyperparameters before you build your base learner models.

from sklearn.ensemble import StackingClassifier
base_learners = [
  ('xgboost_model', xgb_model),
  ('scaled_svm', scaled_svm_model),
  ('scaled_nn', scaled_nn_model),
]

model = StackingClassifier(
  estimators=base_learners,
  final_estimator=LogisticRegression()
)

model.fit(X_train, y_train)

y_predict = model.predict(X_test)

Basic set up of StackingClassifier

Backtest results

Backtest results summary

Backtest results summary

Even though the annual returns of both VotingClassifier and StackingClassifier are not higher than the other machine learning model, the Sharpe Ratio and the Maximum Drawdown are relatively lower. The win rate of the VotingClassifier scenario even increases to 61%, indicating our model is more powerful in its predictability to pick the stocks that are more possible to gain positive returns. To gain a more intuitive sense of how the ensemble learning method impacts our model, let’s look at the stratified and the return diagrams.

Scenario	Stratified Diagram	Return Diagram
XGBoost
Logistic Regression
SVM
Neural Network
Average Voting
Stacking

It’s quite clear that our ensemble learning methods (Average Voting and Stacking) less fluctuate than the rest of the models. By comparing the same bear market period from 2022-02~2022-04, our loss appears a lot less than the non-ensemble learning methods.

Conclusion

For the Average Voting ensemble learning method, it seems to produce a better result and improve the predictability of our model. However, there are a lot fewer places we can step in to better fine-tune the model. On the contrary, there is much more room for us to find out the best combination of the base estimators when we look at the Stacking ensemble learning method. Hence, one thing we can try is using the result from Average Voting as a benchmark and using Stacking as a tool to see whether we can build a much more powerful model to better predict the market.

References

A Comprehensive Guide to Ensemble Learning: What Exactly Do You Need to Know?

Michael's blog

【ML algo trading】 VI - Employ the power of ensemble learning to increase your portfolio return