Vitoria Lima

Stock Return Prediction

Predicting the sign of stock returns in a noisy time series tabular dataset

Technical analysis of stock 1
Figure 1: Technical analysis of stock 1

Challenge Goals

This project is based on a Data Challenge by QRT that aims at predicting the return of a stock in the US market using historical data over a recent period of 20 days. The one-day return of a stock j on day t with price Pjt (adjusted from dividends and stock splits) is given by:

\( R_{j}^{t} = \frac{P_{j}^{t}}{P_{j}^{t-1}} - 1 \)

In this challenge, we consider the residual stock return, which corresponds to the return of a stock without the market impact. Historical data are composed of residual stock returns and relative volumes, sampled each day during the 20 last business days (approximately one month). The relative volume Vjt at time t of a stock j among the n stocks is defined by:

\( \overline{V}_{j}^{t} = \frac{V^{t}}{\text{median}(\{V_{j}^{t-1}, \ldots, V_{j}^{t-20}\})} \)

\( V_{j}^{t} = \overline{V}_{j}^{t} - \frac{1}{n} \sum_{i=1}^{n} \overline{V}_{i}^{t} \)

where V^{t} is the volume at time t of a stock j. We also give additional information about each stock such as its industry and sector.

The metric considered is the accuracy of the predicted residual stock return sign.

Data Description

The dataset comprises 46 descriptive features (all float/int values):

The target variable is the sign of the residual stock return at time t (binary). The dataset contains 418,595 observations for training and 198,429 observations for testing.

Feature Engineering

To enhance the dataset and improve prediction accuracy, the following feature engineering techniques were applied:

Volatility Measures

Rolling Standard Deviation:

\( \text{Rolling Std} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (R_i - \overline{R})^2} \)

Volatility of Volatility (Vol of Vol) (std of std):

\( \text{Vol of Vol} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (\sigma_i - \overline{\sigma})^2} \)

A look into volatility of stock 5
Figure 2: A look into volatility of stock 5

Technical Indicators

Money Flow Index (MFI):

\( \text{MFI} = 100 - \left( \frac{100}{1 + \frac{\sum(\text{Positive Money Flow})}{\sum(\text{Negative Money Flow})}} \right) \)

Relative Strength Index (RSI):

\( \text{RSI} = 100 - \left( \frac{100}{1 + \frac{\text{Average Gain}}{\text{Average Loss}}} \right) \)

Accumulation/Distribution Line (ADL):

\( \text{ADL} = \sum \left( \frac{(C - L) - (H - C)}{H - L} \times V \right) \)

Average True Range (ATR):

\( \text{ATR} = \frac{1}{n} \sum_{i=1}^{n} \text{TR}_i \)

Moving Average Convergence Divergence (MACD):

\( \text{MACD} = \text{EMA}_{12} - \text{EMA}_{26} \)

Implementation Structure

Data Cleaning and Preprocessing

Exploratory Data Analysis (EDA)

A look into stock 2 distribution
Figure 3: A look into stock 2 distribution
A look into distributions of sectors
Figure 4: A look into distributions of sectors

Advanced Feature Engineering

Technical Indicators: To extract information from our dataset, technical indicators have been coded. These are typical technical indicators that can also be found in the TA-Lib library:

Technical analysis of stock 1 ret 5
Figure 5: Technical analysis of stock 1 with RET_5 focus

Statistical Indicators: To extract more information, factor investing techniques were considered. The approach involved calculating Principal Components per SECTOR and INDUSTRY (aggregated per stock category), both whitened and not whitened. These didn't have as much positive impact in terms of explainability of the target variable RET as the Technical Indicators.

Prediction Model

The model is a Random Forest applied with stratified cross-validation, to ensure proper assessment and avoid overfitting to any training part of the dataset.

Results

Outcome: In the leaderboard, this approach achieved the 70th position out of 399 submissions, placing in the top 17.3% percentile of submissions.

Future Work

Future ideas include:

Resources

GitHub Repository