0%

2025-07-01-updated-boilertemplate-ml-project

Data-driven analysis of the most profitable wheel variants

Extracted from Website: SteadyOptions

Introduction

This report is meant to document the process of building the machine learning model and to predict the short-term next-day return of SPY. The report will be separated into three major sections as instructed:

To answer these questions, additional steps are involved, such as data collection, data processing, data imputation, .etc. These additional steps will also be addressed while answering these questions.

The goal of this assignment is to increase the possibility of predicting when the SPY will go up, hence to maximize the portfolio return by investing in the SPY when the positive signal is revealed.

Libraries Used in this Template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Libraries for holding the data
import pandas as pd
import numpy as np
import talib as ta

# Libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for training the model
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, RFECV, RFE
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SequentialFeatureSelector as SFS
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, roc_auc_score, roc_curve, RocCurveDisplay
import xgboost as xgb


# Libraries for other purposes
from scipy.stats import shapiro, boxcox, yeojohnson, skew
import missingno as msno
from prettytable import PrettyTable

Helper functions

This is a customize class to adjust skewness of the data set by mimicing the behavior of standard scaler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
class SkewnessAdjustor:

def __init__(self, multiplier=1.5):
self.dont_transform_fields = []
self.np_log_transform_fields = []
self.yeo_johnson_transform_fields = []
self.box_cox_transform_fields = []

self.yj_transformer = PowerTransformer(
method='yeo-johnson',
standardize=False,
copy=True # Set to False to perform inplace computation during transformation.
)

self.bc_transformer = PowerTransformer(
method='box-cox',
standardize=False,
copy=True # Set to False to perform inplace computation during transformation.
)

def fit_transform(self, data:pd.DataFrame, factor_names:list, inplace=False):
tmp = data.copy() if not inplace else data

for factor in factor_names:
if len(tmp[factor].unique()) == 2:
self.dont_transform_fields.append(factor)

elif abs(tmp[factor].skew()) < 0.5:
self.dont_transform_fields.append(factor)

else:
original_skewness = tmp[factor].skew()
np_log_skewness = None
yeo_skewness = None
boxcox_skewness = None

if (tmp[factor] > 0).all():
np_log_skewness = np.log(tmp[factor]).skew()
box_cox_skewness = pd.DataFrame(boxcox(tmp[factor])[0]).skew().values[0]
else:
np_log_skewness = float('inf')
box_cox_skewness = float('inf')

yeo_skewness = pd.DataFrame(yeojohnson(tmp[factor].astype(np.float64))[0]).skew().values[0]

method, _ = min(
('original', abs(original_skewness)),
("np", abs(np_log_skewness)),
("yeo", abs(yeo_skewness)),
("box_cox", abs(box_cox_skewness)),
key=lambda x:x[1]
)

if method == 'original':
self.dont_transform_fields.append(factor)
elif method == 'np':
self.np_log_transform_fields.append(factor)
elif method == 'yeo':
self.yeo_johnson_transform_fields.append(factor)
elif method == 'box_cox':
self.box_cox_transform_fields.append(factor)
else:
raise ValueError(f'Wrong skewness')

if self.np_log_transform_fields:
tmp.loc[:, self.np_log_transform_fields] = np.log(tmp.loc[:, self.np_log_transform_fields])

if self.yeo_johnson_transform_fields:
tmp.loc[:, self.yeo_johnson_transform_fields] = self.yj_transformer.fit_transform(tmp.loc[:, self.yeo_johnson_transform_fields])

if self.box_cox_transform_fields:
tmp.loc[:, self.box_cox_transform_fields] = self.bc_transformer.fit_transform(tmp.loc[:, self.box_cox_transform_fields])

return tmp if not inplace else None

def transform(self, data:pd.DataFrame, factor_names:list, inplace=False):
tmp = data.copy() if not inplace else data

if not self.np_log_transform_fields and not self.yeo_johnson_transform_fields and not self.box_cox_transform_fields:
raise ValueError(f'This class not yet been fitted.')

if self.np_log_transform_fields:
tmp.loc[:, self.np_log_transform_fields] = np.log(tmp.loc[:, self.np_log_transform_fields])

if self.yeo_johnson_transform_fields:
tmp.loc[:, self.yeo_johnson_transform_fields] = self.yj_transformer.transform(tmp.loc[:, self.yeo_johnson_transform_fields])

if self.box_cox_transform_fields:
tmp.loc[:, self.box_cox_transform_fields] = self.bc_transformer.transform(tmp.loc[:, self.box_cox_transform_fields])

return tmp if not inplace else None

B. Feature Selection Using the Funnelling Approach

Features represent the characteristics of the data and are inputs as variables to the machine learning model. To create features to train the model, the fundamental step is to collect and process the data. Once we have the raw data from the market, we’re able to further process the data and even produce new features to train our machine learning model. However, we can’t just dump everything we have into the model. Therefore, selecting the more important and relevant features would be a crucial step before building and training the model.

B.1 Data Collection

The objective of this ML model is to predict the SPY next-day return. To do that, we need raw data from the broker or data warehouse. I have downloaded the 10-year SPY end-of-day market data since 2014-01-01 from the data service provider Tiingo (spy_asof_20140101.csv).

1
2
df = pd.read_csv('./spy_asof_20140101.csv', index_col=0)
df.head()

Raw data

B.2 Organizing and Examining the Raw Data

The ‘close’, ‘high’, ‘low’, ‘open’, and ‘volume’ are the raw prices from the market. However, to analyze the growth from 2014 until today, these prices might have high jumps due to dividend and split events. Therefore, we need to switch to the Forward Adjusted Price to make sure we can track the trend of the stock price.

1
2
3
4
5
6
7
8
9
10
11
12
df = df.drop(
columns = ['close', 'high', 'low', 'open', 'volume', 'splitFactor']
).rename(
columns={
'adjClose': 'close',
'adjHigh': 'high',
'adjLow': 'low',
'adjOpen': 'open',
'adjVolume': 'volume'
}
)
df.head()

Raw data

Now, let’s check whether anything was missing from the data we pulled from the data warehouse.

1
df.isnull().sum().to_frame().T

Missing Values

B.3 Featuring Engineering

Feature engineering is the process of transforming raw data into meaningful and informative features for the ML model to digest, thereby improving the performance of machine learning models.

It usually involves but not limited to the following processes:

  • Creating New Features
  • Transforming Features
  • Selecting/Removing Features
  • Handling Missing/Noisy Data

B.3.1 Labeling the Sample Data

To be able to train and test the ML model, we need to label each sample as either positive (uptrend) or negative (downtrend) so that the model would realize how to recognize the pattern and predict the positive sample.

In this research, we define the binary target variable (bool) to be:

  • 1 (Uptrend): Return > 0
  • 0 (Downtrend or Flat): Return ≤ 0.

Here are a few rules considered while we produce the daily return:

  1. To waive the factor of the overnight jump from impacting our desired predictability, we use $\text{close} - \text{open}$ instead of $\text{close} - \text{previous close}$ to simulate the daily return.
  2. We use log return instead of the arithmetic price change as the daily return ($\text{intraday return} = \log\left(\frac{\text{close}}{\text{open}}\right)$).
  3. We also need to produce another column called “predict_log_return“ by .shift(-1) the “daily_log_return“ as the y-label for the training/testing dataset.
  4. Lastly, we label each sample to [0,1] according to the rule we designed above.
    • Note : Here we label the entire dataset, including the test set, for the purpose of ML model performance evaluation.

‘’’python
123df[‘asd’]
‘’’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3))

df['daily_log_return'].plot(ax=ax1, color='coral')
ax1.set_title('Daily o-c Log SPY Return')
ax1.set_xlabel('Date')
ax1.set_ylabel('log return')
ax1.tick_params(axis='x', rotation=45)

df['daily_log_return'].plot(kind='box', ax=ax2, vert=False, color='skyblue')
ax2.set_title('Daily o-c Log SPY Return (Box plot)')
ax2.set_xlabel('Value')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Missing Values

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
value_counts = df.loc[:, ['y_label']].value_counts()

ax = value_counts.plot.pie(
subplots=True, legend=True,
autopct='%1.2f%%', # Show percentages
startangle=90,
wedgeprops={'linewidth': 1, 'edgecolor': 'white'},
textprops={'fontsize': 8},
figsize=(6,3)
)

if isinstance(ax, np.ndarray):
ax = ax[0]

for i, (count, pct) in enumerate(zip(value_counts, value_counts/sum(value_counts)*100)):
ax.text(0, -0.1 - i*0.1, f"Count: {count}",
transform=ax.transAxes, ha='center', fontsize=10)

plt.title('Binary Category Distribution with Counts')
plt.ylabel('')
plt.show()

Missing Values

We can say tell that this is a relatively balanced dataset.

B.3.2 Creating New Features and Imputation

It is not enough to use only ohlc (open, high, low, close) prices while predicting the expected return of the next day. We need to derive the factors that are potentially related to the growth of the future stock price. These factors might include technical indicators, momentum, trend strength, and others.

The factors we derived include:

  • Moving Averages (SMA, EMA).
  • RSI, MACD, Bollinger Bands.
  • Momentum, Volatility measures.
  • Lag Features: Past returns (ROC - Rate of Change).
  • Volume-based Features: Volume SMA.
  • Over night changes: $\text{Open}-\text{Close}_{\text{previous day}}$, …

It is important to note that indicators, like SMA, EMA, and Bollinger Bands, are more of smoothed lines. These values by themselves don’t represent any trend, momentum, or signal. Therefore, I further processed them into features that might actually be meaningful while predicting the outcome.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
for n in [5, 10, 15, 25]:
df[f'sma{n}'] = ta.SMA(df['close'], n)
df[f'ema{n}'] = ta.EMA(df['close'], n)
df[f'roc{n}'] = np.log(df['close']/(df['close'].shift(n)))
df[f'volume{n}'] = ta.SMA(df['volume'], n)
df[f'adx{n}'] = ta.ADX(df['high'], df['low'], df['close'], timeperiod=n)

for n in [5, 10, 15, 25]:
df[f'sma{n}_diff_close'] = df[f'sma{n}'] - df['close']
df[f'ema{n}_diff_close'] = df[f'ema{n}'] - df['close']
df[f'volume{n}_diff_vol'] = df[f'volume{n}'] - df['volume']

for n in [10, 15, 25]:
df[f'sma{n}_diff_5'] = df[f'sma{n}']-df['sma5']
df[f'ema{n}_diff_5'] = df[f'ema{n}']-df['sma5']
df[f'volume{n}_diff_volume5'] = df[f'volume{n}'] - df['volume5']

for n in [7, 14]:
df[f'mfi{n}'] = ta.MFI(df['high'], df['low'], df['close'], df['volume'], timeperiod=n)
df[f'rsi{n}'] = ta.RSI(df['close'], n)
df[f'vol{n}'] = df['daily_log_return'].rolling(n).std() * np.sqrt(252) # Annualized

# The overnight changes could also be an indicator of the next day change
df['o_m_c_return'] = np.log(df['open']/df['close'].shift(1))
df['o_m_c_return_lag1'] = df['o_m_c_return'].shift(1)
_, _, df['macd'] = ta.MACD(df['close'], fastperiod=12, slowperiod=26, signalperiod=9)

upperband, middleband, lowerband = ta.BBANDS(df['close'], timeperiod=5, nbdevup=1.5, nbdevdn=1.5, matype=0)
df['bband_uptrend'] = np.where(df['close'] - upperband > 0, 1, 0)
df['bband_downtrend'] = np.where(lowerband - df['close'] > 0, 1, 0)
df['bband_midtrend'] = np.where((df['close'] - upperband <= 0) & (lowerband - df['close'] <= 0), 1, 0)

df['MF_Multiplier'] = ((df['close'] - df['low']) - (df['high'] - df['close'])) / (df['high'] - df['low'])
df['MF_Multiplier'] = df['MF_Multiplier'].fillna(0)
df['MF_Volume'] = df['MF_Multiplier'] * df['volume']
df['ADL'] = df['MF_Volume'].cumsum() # Cumulative sum
df['ADL_EMA3'] = df['ADL'].ewm(span=3, adjust=False).mean()
df['ADL_EMA10'] = df['ADL'].ewm(span=10, adjust=False).mean()
df['Chaikin_Oscillator'] = df['ADL_EMA3'] - df['ADL_EMA10']

df = df.drop(columns=['MF_Multiplier', 'MF_Volume', 'ADL', 'ADL_EMA3', 'ADL_EMA10'])

df['intra_momentum'] = (df['close'] - df['open'])/(df['high'] - df['close'])

df['divInfluence'] = df['divCash'].ewm(span=5).mean()
1
2
3
4
5
# We don't need these anymore since they are more of the smoothed data.
data = df.drop(columns=[
'close', 'high', 'low', 'open', 'volume', 'divCash', 'sma5', 'sma10', 'sma15', 'sma25',
'ema5', 'ema10', 'ema15', 'ema25', 'volume5', 'volume10', 'volume15', 'volume25',
])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 3))

np.isnan(data).sum().plot.bar(ax=ax1, color='coral')


msno.matrix(data, ax=ax1, color=(0.2, 0.4, 0.6), sparkline=False)
ax1.set_title('NaN Values per Column')
ax1.set_ylabel('Count')
# ax1.tick_params(axis='x', rotation=45)
ax1.set_xticks([])
ax1.set_title('Missing Data Matrix')

np.isinf(data).sum().plot.bar(ax=ax2, color='skyblue')
ax2.set_title('Inf Values per Column')
ax2.set_ylabel('Count')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Missing Values

For the first diagram “Missing Data Matrix”, we get to see that the orange spots, representing the missing data, are located at the beginning of the data. This indicates that these missing values are generated while we creating the new features. We can simply discard them. As for the second diagram, we see that the feature “intra_momentum” has a lot of np.inf. This is due to the fact that the $\text{high}-\text{close}$ is either 0 or close to 0, indicating the intraday momentum is huge at the end to the trading hour. Therefore, we can simply replace them with the biggest number in that feature to represent the momentum is huge.

1
2
3
4
5
# Drop the row that has missing values
data.dropna(axis=0, inplace=True)

# Replace the inf value with biggest number in that feature
data['intra_momentum'] = np.where(np.isinf(data['intra_momentum']), data['intra_momentum'].replace([np.inf, -np.inf], np.nan).max(), data['intra_momentum'])

B.3.3. Exploratory Data Analysis (EDA)

The purpose of the Exploratory Data Analysis (EDA) is to examine, visualize, and summarize the dataset in order to recognize a hidden pattern. It combines various statistical techniques and visual tools to transform raw data into actionable insights for the viewer to further understand the data structure.

One rule of thumb that requires extra attention is that you always train your scalers and transformers with training data, then apply the changes to testing data. This practice is highly encouraged in order to avoid data leakage.

In this research, we’ve conducted the EDA from the following aspects:

B.3.3.1. Splitting Training and Testing Data

Before starting further processing of the data, we need to split the training/testing data before any data adjustment has happened. This prevents data leakage, where information from the test set influences the training process (e.g., during log transformation, feature scaling, or handling missing values). By splitting first, we ensure the test set remains a completely independent benchmark to evaluate the model’s generalization performance on unseen data, mimicking real-world deployment scenarios.

Also, don’t forget to add a gap between the training and testing datasets as it also prevents data leakage.

1
2
3
4
5
6
7
8
9
10
11
X = data.drop(columns=['daily_log_return', 'predict_log_return', 'y_label'])
y = data.loc[:, ['daily_log_return', 'predict_log_return', 'y_label']]

split_rate = 0.8

training_set_size = int(df.shape[0]*split_rate)

X_train = X.iloc[:training_set_size, :]
y_train = y.iloc[:training_set_size, :]
X_test = X.iloc[training_set_size + 1:, :]
y_test = y.iloc[training_set_size + 1:, :]

B.3.3.2 Drop High Correlated Samples

We have produced a lot of features from the raw data, though they are yet to be examined. We need to screen out those features that are too closely correlated. Categorical features are usually excluded from the correlation test in this case. Remember, we use the training dataset to examine the correlation, avoiding data leakage.

1
2
3
4
5
6
categorical_features = ['bband_uptrend', 'bband_downtrend', 'bband_midtrend']
numerical_features = [x for x in X_train.columns if x not in categorical_features]

corr_matrix = X_train[numerical_features].corr().abs()
tria = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
cols_to_drop = [column for column in tria.columns if any(tria[column] > 0.9)]

Once we have obtained the columns that have too high correlation with one another, we will transform the training dataset and testing dataset to have the same features.

1
2
X_train = X_train.drop(columns=cols_to_drop)
X_test = X_test.drop(columns=cols_to_drop)

B.3.3.3 Handling Near-Zero Returns

For those samples that have relatively low daily log returns, we see them as the noise and would like to mitigate their impact on our machine learning model. There are several ways of pulling it off.

  1. We delete those samples from our dataset.
  2. We decrease their weights while training the machine learning model.

Here, we employ a method to combine both methods mentioned above. As the winsorization technique that usually used to standardize the feature values, what we do here is to exponentially decrease the weights of the values that sit between the upper- and lower-bound of the predicted log returns. The list of weights derived will be preserved and fed to the model when we start training the model.

B.3.3.3.1 Define Near-Zero Returns
1
2
3
4
lower_bound = y_train['predict_log_return'].quantile(0.25)
upper_bound = y_train['predict_log_return'].quantile(0.75)
print(upper_bound, lower_bound)
# --> (np.float64(0.004149227614119242), np.float64(-0.0031685340375708798))
B.3.3.3.2 Design Decreasing Weights for Near-Zero Returns
1
2
3
4
5
6
7
8
9
10
epsilon = 0.0025
abs_returns = np.abs(y_train['predict_log_return'])
weights = np.ones_like(y_train['predict_log_return'])

mask = (y_train['predict_log_return'] >= lower_bound) & (y_train['predict_log_return'] <= upper_bound)

weights[mask] = 1 - np.exp(-abs_returns[mask] / epsilon)

y_train.loc[:, 'weight'] = weights
y_train.head()

Decreasing weights

B.3.3.4 Logarithmic Transformation and Standardization

Many ML algorithms (e.g., linear regression, logistic regression, neural networks) assume features are approximately normally distributed. Skewed data violates this assumption, leading to biased estimates. Therefore, one important step is to define whether a feature is normally distributed and then conduct log transformation accordingly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
numerical_features = [x for x in X_train.columns if x not in categorical_features]

n_cols = 5
n_rows = (len(numerical_features) + n_cols - 1) // n_cols # Round up division

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 2))
axes = axes.flatten() # Flatten to 1D array for easy iteration

for i, col in enumerate(numerical_features):
sns.histplot(df[col], kde=True, bins=30, ax=axes[i])
axes[i].set_title(f'Distribution of {col}', fontsize=10)
axes[i].set_xlabel(col)
axes[i].set_ylabel('Frequency')

for j in range(i + 1, len(axes)):
axes[j].set_visible(False)

plt.suptitle('Distribution Plots of All Features')
plt.tight_layout()
plt.show()

Distributions of all features

1
2
skewness_adjuster = SkewnessAdjustor()
X_train = skewness_adjuster.fit_transform(X_train, numerical_features)

After finishing adjusting the skewness of the skewed features, we standardize all the featrues in our training dataset.

1
2
standard_scaler = StandardScaler()
X_train.loc[:, numerical_features] = standard_scaler.fit_transform(X_train.loc[:, numerical_features])

Lastly, remember to transform the test data using the scalers we just trained.

1
2
X_test = skewness_adjuster.transform(X_test, numerical_features)
X_test.loc[:, numerical_features] = standard_scaler.transform(X_test.loc[:, numerical_features])

B.4 Feature Selection

This used to be included in the process of EDA. We now separate it from the EDA so that we can focus on more details of the feature selection methods. Let’s walk through the entire process while answering the question 2.

The three methods mentioned are 1. Filter, 2. Wrapper, and 3. Embedded Methods. Each method has a different idea to select the most important features from the ones that we just created.

  1. Filter Method (top-down): The main idea of the filter method is that we compare the features side by side to select the features that are most relevant to the desired outcome (y-label) before training the ML model. The steps are as follows:
    1. First, we remove highly correlated features to avoid feature duplication. (It’s done above)
    2. Using univariate feature selection, such as information gain, Chi-square test, and others, to select the most relevant features against the y-label.
    3. Remove features that are not suitable.
  2. Wrapper Method (bottom-up): It aims to find out the rank of the feature importance iteratively by adding or removing features one by one. There are three different options to conduct the Recursive Feature Elimination with Cross-Validation (RFECV):
    1. Forward Selection:
      • Train the model with one feature only, and find out the most important feature.
      • Given the most important feature, you add another feature and train the model to find out the next important feature among the rest.
      • Repeat the above step until you reach the number of features required.
    2. Backward Selection:
      • Similar to Forward Selection. Instead, you start with all features and leave one feature out to find out the least important feature.
      • You remove the least important feature, and then repeat the above step to find out the next least important feature. Remove that feature accordingly.
    3. Exhaustive Selection
      • You evaluate all the possible combinations of different features and find the best feature set that perform the best
      • The Exhaustive Selection method is usually not recommended when the number of features is large. (consumes a lot of resources)
  3. Embedded Method: The name embedded by itself, meaning that you integrate the feature selection directly into the ML model training process. The most common embedded strategies are the Lasso regularization for linear regression models and information gain for decision tree models.
    1. Build a basic model with L1 Regularization.
    2. Train the model with all the features.
    3. Check the importance rank after the model is trained.

(b) Justify the selection of features retained at each step

b.1. Filter

The SelectKBest API in the sklearn package can be used to achieve the purpose of using the filter method to find the features we want. We first use f_classif to conduct the ANOVA test to select the top 60% of the numerical features. Then we use mutual_info_classif to classify the features by their information gain, picking the top 60% of the features as well. Finally, we concatenate these two sets of features to create a pool of features that is worth further investigation.

1
2
3
4
5
6
k_best_selector = SelectKBest(f_classif, k=int(X_train.columns.shape[0]*0.6))
anova_scores = k_best_selector.fit(X_train, y_train['y_label'].values.flatten())

feature_scores = pd.DataFrame({'Feature': X_train.columns, 'Score': anova_scores.scores_}).set_index('Feature')
feature_scores.sort_values('Score', ascending=False, inplace=True)
feature_scores.plot.bar(figsize=(10,3), title='Feature Scores of ANOVA Test')

Distributions of all features

1
2
3
anova_features = anova_scores.get_feature_names_out()
anova_features
# --> array(['roc5', 'roc10', 'roc15', 'roc25', 'sma10_diff_5', 'volume10_diff_volume5', 'volume15_diff_volume5', 'rsi7', 'vol7', 'mfi14', 'vol14', 'o_m_c_return_lag1', 'macd', 'bband_downtrend', 'Chaikin_Oscillator', 'divInfluence'], dtype=object)
1
2
3
4
5
6
k_best_selector_2 = SelectKBest(score_func=mutual_info_classif, k=int(X_train.columns.shape[0]*0.6))
mutual_if_scores = k_best_selector_2.fit(X_train, y_train['y_label'].values)

feature_scores = pd.DataFrame({'Feature': X_train.columns, 'Score': mutual_if_scores.scores_}).set_index('Feature')
feature_scores.sort_values('Score', ascending=False, inplace=True)
feature_scores.plot.bar(figsize=(10,3), title='Feature Scores of Mutual Information Test')

Distributions of all features

After seeing this diagram, we select the features that scores more than 0 instead of the top 60% of the features.

b.2. Wrapper

In the wrapper method, we choose to use forward and backward methods to generate to compare and contrast the differences. In the end, we will find the union of these two feature sets to find the desired features using both methods. The sklearn package has built-in methods that support both forward and backward methods. The SequentialFeatureSelector is designed for serving the purpose of finding features using the forward method. The RFE and RFECV are both designed for the backward method.

b.2.1 Forward
1
2
3
4
5
6
7
8
9
10
11
12
sfs = SFS(
estimator=RandomForestClassifier(n_estimators=5),
n_features_to_select=0.6,
direction='forward',
scoring='accuracy',
cv=5,
)
sfs.fit(X_train, y_train['y_label'].values.flatten())

wrapper_forward_features = sfs.get_feature_names_out()
print("Best subset (Forward):", wrapper_forward_features)
# --> Best subset (Forward): ['adx10' 'roc25' 'adx25' 'volume5_diff_vol' 'sma10_diff_5' 'volume15_diff_volume5' 'mfi7' 'vol7' 'vol14' 'macd' 'bband_uptrend' 'bband_downtrend' 'bband_midtrend' 'Chaikin_Oscillator' 'intra_momentum' 'divInfluence']
b.2.2 Backward
1
2
3
4
5
6
7
8
9
10
11
estimator = LogisticRegression(
penalty='l1',
solver='liblinear',
max_iter=1500
)
rfe = RFE(estimator, n_features_to_select=0.6, step=1)
rfe.fit(X_train, y_train['y_label'].values.flatten())

wrapper_backward_features = rfe.get_feature_names_out()
print("Best subset (Backward):", wrapper_backward_features)
# --> Best subset (Backward): ['roc5' 'roc10' 'roc15' 'sma5_diff_close' 'sma10_diff_5' 'volume10_diff_volume5' 'volume25_diff_volume5' 'mfi7' 'vol7' 'o_m_c_return' 'o_m_c_return_lag1' 'macd' 'bband_downtrend' 'bband_midtrend' 'intra_momentum' 'divInfluence']

b.3. Embedded

Lastly, we select Logistic Regression model with L1 Regularization to perform the model training. In the meantime, the feature importance has been automatically calculated during the training process. Therefore, we can easily fetch the features that are more relevant to the target movement.

1
2
3
4
5
6
7
8
9
10
11
12
embedded_model = SelectFromModel(
LogisticRegression(
penalty='l1', solver='liblinear'
),
max_features = int(len(X_train.columns) * 0.6)
)
embedded_model.fit(X_train, y_train['y_label'].values.flatten())

embedded_features = embedded_model.get_feature_names_out()
print("Best subset (Embedded):", embedded_features)

# --> Best subset (Embedded): ['roc5' 'roc15' 'adx25' 'sma5_diff_close' 'sma10_diff_5' 'volume10_diff_volume5' 'volume25_diff_volume5' 'mfi7' 'vol7' 'o_m_c_return' 'o_m_c_return_lag1' 'macd' 'bband_downtrend' 'bband_midtrend' 'intra_momentum' 'divInfluence']

(c) Provide the final list of selected features

The idea that I designed here was that we take the intersection of the results of the wrapper (forward and backward) methods and the embedded method, and then make sure they are under the hood of the results of the filter method. This hybrid approach offers advantages over a strict intersection of all methods: it maintains features that show either statistical significance (filter methods) and/or model importance (wrapper/embedded methods), while avoiding being overly restrictive. This balanced approach might be able to yield more stable features than using any single method alone, while being less restrictive than requiring features to pass all selection methods which might exclude potentially useful predictors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
final_features = list(set(filter_features) & (set(wrapper_forward_features) | set(wrapper_backward_features) | set(embedded_features)))

table = PrettyTable()
table.title = "Final Selected Features"
table.field_names = ["Feature Names", "Description"]
for feature in final_features:
if 'roc' in feature:
table.add_row([feature, 'Rate of change indicator'])
elif 'sma' in feature:
table.add_row([feature, 'Smoothing average indicator'])
elif 'Chaikin' in feature:
table.add_row([feature, 'Chaikin Oscillator'])
elif 'momentum' in feature:
table.add_row([feature, 'Momentum factor'])
elif 'mfi' in feature:
table.add_row([feature, 'Money Flow Index'])
elif 'rsi' in feature:
table.add_row([feature, 'Relative Strength Index'])
elif 'volume' in feature:
table.add_row([feature, 'Volume indicator'])
elif 'bband' in feature:
table.add_row([feature, 'Bollinger Band Index'])
elif 'vol' in feature:
table.add_row([feature, 'Volatility Index'])
elif 'adx' in feature:
table.add_row([feature, '(Average Directional Index)'])
else:
table.add_row([feature, 'Other indicators'])

print(table)

Back to Introduction


C. Model Building, Tuning, and Evaluation

After all the steps we’ve taken, we now finally have the data prepared and ready to feed to the machine learning model. In this section, we transition from data preparation to the core of machine learning: building, optimizing, and validating predictive models. There are several topcis that we’re going to address:

  1. Train our basic model
  2. Fine-tune the hyperparameter
  3. Evaluate the performance against the matrix
  4. Finally, backtest this strategy against the buy-and-hold strategy

This end-to-end pipeline ensures our model is both accurate and generalizable to unseen market conditions.

(a) Build a model to predict positive market moves (uptrend) using the feature subset derived above.

The modern Python ML libraries are relatively mature and easy to use. Table below summarizes the pros and cons of the common gradient boosting models.

Model Best For Speed Handles Categorical Features Missing Values
XGBoost Accuracy Fast Needs encoding (mostly) ✅ Yes
LightGBM Big Data Fastest Handles well (better than XGB) ✅ Yes
CatBoost Categorical Data Moderate Best-in-class ✅ Yes
Sklearn GBM Baseline Slow Needs encoding ❌ No

GB Models Comparison Summary

  • For highest accuracy: XGBoost (with hyperparameter tuning).
  • For fastest training: LightGBM (especially on large datasets).
  • For categorical data: CatBoost (minimal preprocessing needed).
  • For simplicity: Scikit-learn’s GradientBoostingClassifier.

Our data set is relatively balanced, small, and has no missing values according to our observation above. Also, we are pursuing high accuracy in order to maximize our strategy performance. I chose XGBoost as our baseline model for this research. XGBoost is known for handling small-to-medium datasets, and it has built-in regularization with speed faster than traditional GBDT. Let’s see how it works against the dataset that we have prepared.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
model = xgb.XGBClassifier(
objective='binary:logistic',
eval_metric='logloss',
early_stopping_rounds=50,
verbosity=0,
silent=True,
random_state=42
)

X_train = X_train.loc[:, final_features]
X_test = X_test.loc[:, final_features]

_ = model.fit(
X_train, y_train['y_label'].values.flatten(),
eval_set=[(X_train, y_train['y_label'].values.flatten()),(X_test, y_test['y_label'].values.flatten())],
sample_weight=y_train.loc[X_train.index, 'weight'],
verbose=False,
)

(b) Tune the hyperparameters of the estimator to obtain an optimal model

Sklearn package has bulit in the API to train the model multiple folds. It is called GridSearchCV. We use this to train the model multiple times. Each time, the model will be train with the same training dataset but with different parameter set in order to find out the best parameter set that allow the model to perform well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
params = {
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7],
'n_estimators': [100, 200, 300],
'gamma': [0, 0.1, 0.2],
'reg_alpha': [0, 0.05, 0.1],
'reg_lambda': [0, 0.05, 0.1]
}

grid = GridSearchCV(
estimator=model,
param_grid=params,
cv=5,
scoring='roc_auc',
verbose=0,
n_jobs=1
)
grid.fit(
X_train,
y_train['y_label'].values.flatten(),
eval_set=[(X_train, y_train['y_label'].values.flatten()),(X_test, y_test['y_label'].values.flatten())],
sample_weight=y_train.loc[X_train.index, 'weight'],
verbose=False,
)

best_model = grid.best_estimator_
print(grid.best_params_)

Here we find out that while using roc_auc as the scoring function, the following parameter set will allow the model to maximize the predictability:

1
2
3
4
5
6
{'gamma': 0.2,
'learning_rate': 0.1,
'max_depth': 7,
'n_estimators': 100,
'reg_alpha': 0,
'reg_lambda': 0}

Also, we get to obtain the trained model with the best parameter set as

1
grid.best_estimator_

where we save this as best_model variable.

(c) Evaluate the Model’s Prediction Quality

Let’s have a look at the performance and the quality of our trained model by looking at the following matrix.

Firstly, we use the testing dataset to predict the next-day SPY return.

1
2
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

(c.1) Log Loss Progress

1
2
3
4
5
6
plt.plot(best_model.evals_result_['validation_0']['logloss'], label='Train Log Loss')
plt.plot(best_model.evals_result_['validation_1']['logloss'], label='Test Log Loss')
plt.legend()
plt.title("Train Log Loss vs. Test Log Loss")
plt.ylabel("Log Loss")
plt.show()

After plotting the log loss of both training and testing datasets, you’ll clearly see that, even though our training log loss has a significant decrease, the test log loss barely reduces and the gap between expands over time. This is a sign that our model is likely in the state of overfitting. However, the samples are too few, and the training log loss hasn’t been able to converge to a certain level. We can just say that we need more samples to train our model, and then see which state our model is in.

Enjoy reading? Some donations would motivate me to produce more quality content