Udit Agarwal's Blog

Monday, 29 January 2018

The Signal and The Noise - Why so many predictions fail but some don't

Why DO so many predictions fail? What makes it so difficult for us to make accurate predictions on topics like earthquakes and climate change, or like economics, politics and sports? What can we really do to ensure correct predictions and not mistake our capability to calculate the past for our ability to predict the future?

Distinguishing the signal from the noise requires both scientific knowledge and self-knowledge.

– Nate Silver

Is it the lack of techniques that makes such predictions impossible? Maybe. We do not have ways to completely represent the true nature of relational dependencies in the world. However, we have found suitable workarounds that can help us draw a close enough picture. We now rely on deep pattern recognition to create forecasting and classification models for complex relationships. It’s not the lack of tools. We now have software capable of performing some extremely heavy number crunching from your average household laptop. It’s definitely not about a lack of expertise on these tools. We have the example of Nate Silver who was able to accurately predict the outcomes in every single state in the 2010 US elections with no other tool than a spreadsheet.

Why then is it so difficult to draw accurate predictions. One of the most talked about and least understood pitfalls in data analytics is Bias. Most analytical practitioners who are willing to get their hands dirty with data cleaning, try to ensure that any bias incorporated during data collection is removed (Information Bias). Data scientists use L1 and L2 normalization to balance the bias-variance in the model itself (Predictive Bias). However, of all the biases, the one toughest to manage is the bias brought in by the data scientist itself, called the Confirmation Bias.

Ever since I learnt about confirmation bias I’ve started seeing it everywhere.

– Jon Ronson

We are all biased in our ways. Our biases are formed based on where we live, where we grew up, our social circle, our morals, our likes and dislikes and everything else that defines us. In fact, they are also a function of time. As we grow in life and have new experiences, our perspective changes and so do our biases. These biases define what we see as a “signal” and what “noise” we block out.

When we create a model it is vital that we put our natural biases aside. More often than not, we approach a problem with a solution in mind, consciously or subconsciously. Eventually, all the evidence we gather for the problem seems to point to the solution we started off with. Take for example a college assignment to review a book or research paper. Our immediate acceptance of the correctness of the published word, makes us write a positive review every time and no amount of sources that we end up looking at will result in wavering of that sentiment.

Source: Dilbert

So then Problem solved! When we create a model, we should look purely at the data and just let it work its magic.

Or should we?

As we dive deeper into the realms of artificial intelligence, we have started realizing the importance of “Domain Knowledge”. This is the knowledge specific to a topic that makes it different from another subject. Some aspects are fairly straightforward, like the word “flow” may refer to “cash flow” in financial models whereas “water flow” in models predicting ground water levels. Other aspects may be more complex, which “Domain experts” learn after years of studying the subject. Understanding the subject and incorporating the domain knowledge into the model is important not just in understanding what the model implies but also in how efficiently the model represents the actual process. Human judgement plays an important role here and should never be discounted.

Now domain knowledge is not sacrosanct either. Experts on various topics have been known to have contradictory opinions on the same subject which has led to plenty of Prime Time debates. The truth remains that their opinions are governed by their interpretation and biases towards the topic. And these biased opinions will creep into your models as well.

How then do we create good models which are representative of the processes they are built on and at the same time are subjective enough to not have been biased by the creator itself.

Maybe the foxes and hedgehogs can offer some advice.

The first principle is that you must not fool yourself – and you are the easiest person to fool.

– Richard Feynman

Wednesday, 29 November 2017

Paper Review: A Few Useful Things to Know about Machine Learning – Pedro Domingos

One look at this paper and a reader begins to wonder about the true scope of Machine Learning. For people who have started on their Machine Learning journey, the content of this paper can prove to be quite daunting and many of the terms absolutely alien. So, it is easy for them to miss out the roadmap that Pedro Domingos has laid out for us to the Machine Learning journey.

Pedro points out that there is a lot of “folk knowledge” that is utilized behind building a good Machine Learning model. Knowledge of the different techniques is a good start towards being successful on this journey, however, it is only one of the skills that a Data Scientist needs to possess.

As many have come to realize (any many more will), the time consumed in actually doing machine learning is very little compared to the time consumed in gathering, integrating, cleaning & pre-processing data, and in performing trial and error on feature design. The paper brings out 1) the aspects necessary to build a model, 2) common traps and pitfalls that modelers should avoid & 3) tips on how to bring out the best from a machine learning exercise.

To give a snapshot of what was covered in the paper, below is a graphical representation of the roadmap presented.

Friday, 20 October 2017

Paper Review: The Discipline of Machine Learning - Tom M. Mitchell

This paper is a must read for anyone starting off their journey into the world of Data Science and Machine Learning. It introduces the concept of Machine Learning in a very simple and crisp manner. Back in 2006, data scientists had already made much headway in commercial applications of ML like Speech Recognition, Computer Vision, Bio-Surveillance, etc. (Of course we are still endeavoring to improve the performance and accuracy in these fields today). It touches upon the key research questions surrounding the scope of machine learning algorithms and the exploration of the variety of learning tasks.

Although the paper was written over a decade ago, I believe the ideas expressed summarize much of what is known in the field today. The paper introduces the concept of Machine Learning as a process when a machine learns from its experience E, and utilizes the learning from this experience to improve its performance P at carrying out a defined task T. The concept is fairly simple. The complexity of actually implementing it is whole different question.

Take for example, the plethora of personal assistants entering the market recently. Google Home just hit the markets to take on its competitor Amazon Echo (Alexa) and shortly Apple HomePod (Siri) will be joining the ranks. Now, what really defines the performance of a personal assistant. One of the key functions of a PA is to provide answers to a user’s queries. To assess their performance, I asked this question to Google and Siri. “What is a neural network?”

Google presented the dictionary definition, “noun – a computer system modelled on the human brain and nervous system.”

Siri took me to the Wikipedia page for Artificial Neural Networks that starts off with “Artificial neural networks (ANNs), a form of connectionism, are computing systems inspired by the biological neural networks that constitute animal brains”

Which definition warrants a better rating on performance? Google’s is an extremely simplistic one while Siri leads to a detailed document on it. One might argue that the 2nd adds more knowledge and hence is a better choice. Is it really though? The definitions are meant for different kinds of users. The wiki page might be perfect for an aspiring data scientist while the dictionary definition is probably all that a non-technical user needs to know for putting his curiosity to rest.

The paper touches upon numerous aspects of machine learning research and leaves a reader with various avenues to start the exploratory journey into the field.