Groundhog Day is quite the entertaining movie. For those of you that have never seen it, Bill Murray's character – Phil Connors – plays a character who gets caught in a strange time loop where he re-lives February 2nd (Groundhog Day), over and over again. In fact, he re-lives it so many times that he's able to use his knowledge of what will happen to change his behavior for his own benefit. This sort of future foresight makes for good entertainment, but can pose some significant problems in the world of machine learning.
Objective & Target Audience of Post
The goal of this post is to present two common sources of error when using quantitative methods to analyze time-series data and to provide examples of how we address those errors at Apteo. It's intended for anyone interested in using data-driven or quantitative methods to analyze time series data, specifically financial data. Though some of it may be a bit technical, we try to keep it at a high level such that you can understand the concepts that are at play.
Time Dependent Data in the Stock Market
Milton creates predictions for financial markets, and we focus primarily on US stocks today. The team behind Milton relies heavily on machine learning (specifically deep neural networks) to predict the future of individual stocks. This process, like any other prediction task that requires predictions far into the future, requires data that has a significant time-dependent structure.
Working with this type of data presents a specific set of challenges that requires very careful attention, because it’s very easy for lookahead bias to creep into the data science process in subtle ways. Here’s an example we encountered in our early days.
We train a variety of neural networks to predict future stock returns over a variety of time frames. When we first started out, we used a standard data science approach to address this task. We first split our dataset into a training set that contained the first 70–80% of our data (when ordered chronologically), then we used the remaining data as our test set.
The label for each instance was generated by taking the adjusted close of each stock at some point in the past and comparing that to the adjusted close of that stock at some later date (also in the past).
At first glance, this may seem acceptable. Our training data contains instances whose dates are before those of any instances in our test data, which means that when we evaluate our network, we evaluate it on unseen data.
However, we subtly introduced lookahead bias, and that bias was affecting our evaluation metrics.
An example may help to illustrate the issue.
Lookahead Bias in the Evaluation Process
If the last training instance in our dataset was for Apple on January 2, 2012 (since January 1st is a market holiday), that means that we would need to use the stock price of Apple from January 2, 2013 to create the label for that instance. Now what would happen if we used our trained network to predict the return on Apple’s stock for January 9, 2012?
Even though we don’t introduce any future data during the training process, during the evaluation stage, the network has knowledge about Apple’s stock price on days after January 9, 2012, which it could then use when making predictions for Apple on January 9, 2012.
This subtle issue is actually an example of lookahead bias creeping into the evaluation stage of our data science processes, and it caused our network to appear more accurate than it actually was.
We've seen that lookahead bias can affect both the training and evaluation stages of a model, but that’s not the only gotcha to be aware of when it comes to time-dependent data, especially in the world of investing.
Another issue in finance is that the distribution of future data doesn’t always match the distribution of previous data. In finance, you’ll frequently hear this referred to as “regime changes”. The idea here is that the market can quickly shift from sideways to choppy to bullish to bearish to recession to breakout, all in ways that have never been observed in the past.
When creating time-dependent predictions, this is problematic. Using only a single time period for testing our predictions may not capture the accuracy of our network in different historical regimes.
It’s possible for our network to be accurate today. However, one year ago, a network trained in the exact same way as the one trained today may have been extremely inaccurate.
If we had continued to use the prior method of evaluation that we had been using, we would never have seen that effect.
So now the natural question to ask is how do we account for lookahead bias and changing regimes?
Our answer lies in walk-forward cross-validation.
In machine learning that does not have a strong time-dependency, it’s quite common to use k-fold cross-validation to evaluate a trained model. The idea behind this technique is fairly simple:
- Select a value for k (for us this is often 10)
- For each of k iterations, create a subset of the data that has 1/k of the data points and use that as the test set
- Use the remaining data as the training set
- Train and evaluate your model on the training and test datasets as normal
- Keep track of the metrics on each of the test sets
- When all k models are trained, average the metrics from each of the test sets together to get a final value for the entire model (at this point it would be necessary to train a model on the entire dataset to get the actual model to be used in production)
For time-dependent data, the same idea is used. The difference, though, lies in how the test and training sets are created in each iteration. Instead of holding out 1/k of the data on each repetition, the start and end dates of the training dataset are walked-forward on each iteration, and the network is trained from the beginning of the entire dataset up until the ending period of the training dataset (which must account for the lookahead bias we mentioned above).
When all networks are done training, the metrics for each test set can be combined together in a weighted-average to give the cross-validated error for the model that was trained on the entire dataset.
Benefits and Disadvantages
Using this strategy, we can account for different regimes and also avoid accidentally introducing lookahead bias into our evaluation process. This allows us to get an unbiased estimate of our models’ accuracy.
This approach does have its drawbacks, though. For one, training k additional networks does take additional time and GPU resources. In addition, the selection of k can also be subjective and would require some underlying understanding of historical data distributions (though, it could be argued, that a good data scientist would already care about this anyway).
Please Get In Touch!
If you're interested in keeping up with what we're doing, you can sign up for Milton here and subscribe to our newsletter.
Also, please feel free to reach out directly at email@example.com with any questions.
Apteo, the company behind Milton, is made up of curious data scientists, engineers, and financial analysts based in the Flatiron neighborhood in New York City. We have a passion for technology and investing, and we strongly believe that investing is one of the most reliable and effective ways to build long-term wealth. We build AI tools to help informed investors make better decisions.
Apteo, Inc. is not an investment advisor and makes no representation or recommendation regarding investment in any fund or investment vehicle.
Subscribe to Milton's Blog
Get the latest posts delivered right to your inbox