Why DO so many predictions fail? What
makes it so difficult for us to make accurate predictions on topics like earthquakes
and climate change, or like economics, politics and sports? What can we really
do to ensure correct predictions and not mistake our capability to calculate the
past for our ability to predict the future?
Distinguishing the
signal from the noise requires both scientific knowledge and self-knowledge.
– Nate Silver
Is it the lack of techniques that
makes such predictions impossible? Maybe. We do not have ways to completely represent
the true nature of relational dependencies in the world. However, we have found
suitable workarounds that can help us draw a close enough picture. We now rely
on deep pattern recognition to create forecasting and classification models for
complex relationships. It’s not the lack of tools. We now have software capable
of performing some extremely heavy number crunching from your average household
laptop. It’s definitely not about a lack of expertise on these tools. We have
the example of Nate Silver who was able to accurately predict the outcomes in
every single state in the 2010 US elections with no other tool than a
spreadsheet.
Why then is it so difficult to
draw accurate predictions. One of the most talked about and least understood
pitfalls in data analytics is Bias. Most analytical practitioners who are
willing to get their hands dirty with data cleaning, try to ensure that any
bias incorporated during data collection is removed (Information Bias). Data
scientists use L1 and L2 normalization to balance the bias-variance in the
model itself (Predictive Bias). However, of all the biases, the one toughest to
manage is the bias brought in by the data scientist itself, called the
Confirmation Bias.
Ever since I learnt
about confirmation bias I’ve started seeing it everywhere.
– Jon Ronson
We are all biased in our ways.
Our biases are formed based on where we live, where we grew up, our social
circle, our morals, our likes and dislikes and everything else that defines us.
In fact, they are also a function of time. As we grow in life and have new
experiences, our perspective changes and so do our biases. These biases define
what we see as a “signal” and what “noise” we block out.
When we create a model it is
vital that we put our natural biases aside. More often than not, we approach a
problem with a solution in mind, consciously or subconsciously. Eventually, all
the evidence we gather for the problem seems to point to the solution we
started off with. Take for example a college assignment to review a book or
research paper. Our immediate acceptance of the correctness of the published
word, makes us write a positive review every time and no amount of sources that
we end up looking at will result in wavering of that sentiment.
Source: Dilbert
So then Problem solved! When we
create a model, we should look purely at the data and just let it work its
magic.
Or should we?
As we dive deeper into the realms
of artificial intelligence, we have started realizing the importance of “Domain
Knowledge”. This is the knowledge specific to a topic that makes it different
from another subject. Some aspects are fairly straightforward, like the word
“flow” may refer to “cash flow” in financial models whereas “water flow” in
models predicting ground water levels. Other aspects may be more complex, which
“Domain experts” learn after years of studying the subject. Understanding the
subject and incorporating the domain knowledge into the model is important not
just in understanding what the model implies but also in how efficiently the
model represents the actual process. Human judgement plays an important role
here and should never be discounted.
Now domain knowledge is not
sacrosanct either. Experts on various topics have been known to have contradictory
opinions on the same subject which has led to plenty of Prime Time debates. The
truth remains that their opinions are governed by their interpretation and
biases towards the topic. And these biased opinions will creep into your models
as well.
How then do we create good models
which are representative of the processes they are built on and at the same
time are subjective enough to not have been biased by the creator itself.
Maybe the foxes and hedgehogs can
offer some advice.
The first principle
is that you must not fool yourself – and you are the easiest person to fool.
– Richard Feynman



