While improving machine learning (ML) models and making them more sophisticated is important, a lot of value can be extracted by optimising prediction pre- and post-processing for the same level of model sophistication.
By making ML models more sophisticated I refer to improvement progressions such as decision tree to random forest to gradient boosting. By prediction pre-processing, I refer to feeding the ML model with causal data signals. By prediction post-processing, I refer to further aligning ML model predictions to ground realities.
The kitchen sink approach to ML models is that you throw all available variables into the model to make predictions about the dependent variable. The understanding is that the ML model would then be able to identify unknown but (hopefully) causal patterns or signals from the given variables to predict the dependent variable.
However, ML models are sensitive to how you tune the model and the data it trains on. For different tuning parameters and training data, the ML model latches onto different variables that best predicts the dependent variable. The difference between model prediction performance for different tuning parameters and datasets maybe negligible, i.e., their performance looks the same to an observer. However, internally, within the ML “Black Box”, the model may be making predictions from completely different set of variables.
This is leaving things to chance. Chance determines if the ML model has picked up the causal signal from the given variables or is simply using in-sample correlates of the causal signal to make predictions. By explicitly feeding the ML model the causal signal, you leave less to chance. Even for stringent tuning parameters, the ML model will prioritise the causal signal because of the strength of that signal’s predictive power. This not only reduces model complexity but also improves model performance and model reliability, especially when deployed in realtime.
Thus, prediction pre-processing is an important step to ensure that your ML model has the best variables to make predictions from. How does one find the causal signal? Having a theory or a hypothesis based on content expertise that you test using inferential statistical techniques. Such a causal signal could be an interaction between different set of variables, an auto-correlated effect, etc.
Prediction post-processing also leads to superior use of ML models for a given level of model sophistication. I will share my thoughts on the importance of prediction post-processing soon.