Importance of Prediction Pre-Processing

3rd March 2024

While improving machine learning (ML) models and making them more sophisticated is important, a lot of value can be extracted by optimising prediction pre- and post-processing for the same level of model sophistication.

By making ML models more sophisticated I refer to improvement progressions such as decision tree to random forest to gradient boosting. By prediction pre-processing, I refer to feeding the ML model with causal data signals. By prediction post-processing, I refer to further aligning ML model predictions to ground realities.

The kitchen sink approach to ML models is that you throw all available variables into the model to make predictions about the dependent variable. The understanding is that the ML model would then be able to identify unknown but (hopefully) causal patterns or signals from the given variables to predict the dependent variable.

However, ML models are sensitive to how you tune the model and the data it trains on. For different tuning parameters and training data, the ML model latches onto different variables that best predicts the dependent variable. The difference between model prediction performance for different tuning parameters and datasets maybe negligible, i.e., their performance looks the same to an observer. However, internally, within the ML “Black Box”, the model may be making predictions from completely different set of variables.

This is leaving things to chance. Chance determines if the ML model has picked up the causal signal from the given variables or is simply using in-sample correlates of the causal signal to make predictions. By explicitly feeding the ML model the causal signal, you leave less to chance. Even for stringent tuning parameters, the ML model will prioritise the causal signal because of the strength of that signal’s predictive power. This not only reduces model complexity but also improves model performance and model reliability, especially when deployed in realtime.

Thus, prediction pre-processing is an important step to ensure that your ML model has the best variables to make predictions from. How does one find the causal signal? Having a theory or a hypothesis based on content expertise that you test using inferential statistical techniques. Such a causal signal could be an interaction between different set of variables, an auto-correlated effect, etc.

Prediction post-processing also leads to superior use of ML models for a given level of model sophistication. I will share my thoughts on the importance of prediction post-processing soon.

LLMs Replicate, They Do Not Reason

Machine Learning, Artificial Intelligence

1 week ago

The core insight from the paper “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” (link) is that large language models (LLMs) replicate patterns from training data rather…

How Marketways Arabia Aligns with the Nation’s Vision for Responsible AI

Artificial Intelligence, Machine Learning

2 weeks ago

On October 11, 2024, the UAE Cabinet, under the leadership of His Highness Sheikh Mohammed bin Rashid Al Maktoum, took a monumental step by approving the country’s official stance on…

The Need for Explainable and Accountable Models in AI

Econometrics, Artificial Intelligence, Machine Learning

1 month ago

Artificial intelligence (AI) and machine learning (ML) are transforming industries at an unprecedented rate. From healthcare to finance, the deployment of AI models has brought significant advantages in automating decisions…

Transparency in AI: It’s Not About Showing the Code

Artificial Intelligence, Machine Learning

2 months ago

When discussing AI transparency, one of the most common misconceptions is that it simply involves revealing the algorithm’s source code. While open-source AI systems certainly have their merits, true transparency…

Addressing Reverse Causality in Data Science Consultancy Projects

Econometrics, Machine Learning

3 months ago

In data analysis, understanding the direction of causality between variables is crucial for making informed decisions. For example, does poor sleep lead to depression or does depression lead to poor…

Controlling for Unobserved Heterogeneity in Statistical Modelling

Econometrics

5 months ago

A Key to Accurate Insights In the realm of statistical modelling, controlling for unobserved heterogeneity is crucial. Unobserved heterogeneity refers to factors that influence the outcome variable but are not…

Importance of Prediction Pre-Processing

Other Articles

LLMs Replicate, They Do Not Reason

How Marketways Arabia Aligns with the Nation’s Vision for Responsible AI

The Need for Explainable and Accountable Models in AI

Transparency in AI: It’s Not About Showing the Code

Addressing Reverse Causality in Data Science Consultancy Projects

Controlling for Unobserved Heterogeneity in Statistical Modelling

Email

Visit

Call