Monday, 29 January 2018

The Signal and The Noise - Why so many predictions fail but some don't

Why DO so many predictions fail? What makes it so difficult for us to make accurate predictions on topics like earthquakes and climate change, or like economics, politics and sports? What can we really do to ensure correct predictions and not mistake our capability to calculate the past for our ability to predict the future?

Distinguishing the signal from the noise requires both scientific knowledge and self-knowledge.
– Nate Silver

Is it the lack of techniques that makes such predictions impossible? Maybe. We do not have ways to completely represent the true nature of relational dependencies in the world. However, we have found suitable workarounds that can help us draw a close enough picture. We now rely on deep pattern recognition to create forecasting and classification models for complex relationships. It’s not the lack of tools. We now have software capable of performing some extremely heavy number crunching from your average household laptop. It’s definitely not about a lack of expertise on these tools. We have the example of Nate Silver who was able to accurately predict the outcomes in every single state in the 2010 US elections with no other tool than a spreadsheet.

Why then is it so difficult to draw accurate predictions. One of the most talked about and least understood pitfalls in data analytics is Bias. Most analytical practitioners who are willing to get their hands dirty with data cleaning, try to ensure that any bias incorporated during data collection is removed (Information Bias). Data scientists use L1 and L2 normalization to balance the bias-variance in the model itself (Predictive Bias). However, of all the biases, the one toughest to manage is the bias brought in by the data scientist itself, called the Confirmation Bias.

Ever since I learnt about confirmation bias I’ve started seeing it everywhere.
– Jon Ronson

We are all biased in our ways. Our biases are formed based on where we live, where we grew up, our social circle, our morals, our likes and dislikes and everything else that defines us. In fact, they are also a function of time. As we grow in life and have new experiences, our perspective changes and so do our biases. These biases define what we see as a “signal” and what “noise” we block out.

When we create a model it is vital that we put our natural biases aside. More often than not, we approach a problem with a solution in mind, consciously or subconsciously. Eventually, all the evidence we gather for the problem seems to point to the solution we started off with. Take for example a college assignment to review a book or research paper. Our immediate acceptance of the correctness of the published word, makes us write a positive review every time and no amount of sources that we end up looking at will result in wavering of that sentiment.
Source: Dilbert
So then Problem solved! When we create a model, we should look purely at the data and just let it work its magic.

Or should we?

As we dive deeper into the realms of artificial intelligence, we have started realizing the importance of “Domain Knowledge”. This is the knowledge specific to a topic that makes it different from another subject. Some aspects are fairly straightforward, like the word “flow” may refer to “cash flow” in financial models whereas “water flow” in models predicting ground water levels. Other aspects may be more complex, which “Domain experts” learn after years of studying the subject. Understanding the subject and incorporating the domain knowledge into the model is important not just in understanding what the model implies but also in how efficiently the model represents the actual process. Human judgement plays an important role here and should never be discounted.

Now domain knowledge is not sacrosanct either. Experts on various topics have been known to have contradictory opinions on the same subject which has led to plenty of Prime Time debates. The truth remains that their opinions are governed by their interpretation and biases towards the topic. And these biased opinions will creep into your models as well.

How then do we create good models which are representative of the processes they are built on and at the same time are subjective enough to not have been biased by the creator itself.

Maybe the foxes and hedgehogs can offer some advice.

The first principle is that you must not fool yourself – and you are the easiest person to fool.
– Richard Feynman

No comments:

Post a Comment