Is Data Overhyped? Why We Need to Rethink the "New Oil" Metaphor.
The Data Disconnect: Why We Should Be Starving for Insights.
Dunnhumby was formed by husband and wife team Clive Humby and Edwina Dunn in 1989, in the kitchen of their London home.
They worked with Tesco, where they developed the Clubcard, the world's first supermarket loyalty card. This program used data analysis to personalize customer offers and improve Tesco's understanding of customer behavior.
Soon, the company became a global consumer insights business that revolutionized the use of customer data in the retail industry.
In 2006, while giving talk at some conference, Clive Humby said this:
“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so data must be broken down, refined (analyzed) for it to have value.” –Clive Humby
The "Data is the new oil" metaphor gained traction in the 2010s, suggesting data, like oil, needs extraction and refinement to be valuable. Wired magazine compared data to oil in the 18th century, stating those who "extract and use it" would become wealthy.
Back in India, Reliance Jio was launched on September 5, 2016. Flip it horizontally, and the word "oil" emerges. This design reflects Reliance's evolution from an energy giant to a digital powerhouse, now extracting value not from the oil but from the world of data and connectivity.
Data & Oil: The Raw Material, and the Finished Product
The core logic behind the metaphor is: Data is like crude oil and requires a systematic process of cleaning and refining to bring out the best. But the "new oil" is oversimplified and potentially misleading.
Abundance vs. Scarcity
Oil is finite, scarce, and geographically concentrated. Data can be abundant, easily generated, and widely accessible.
Unlike oil, data can be a by-product of social interactions mediated by digital technologies, captured and processed by a third party.
However, Data's abundance often diminishes its inherent value compared to oil. But Data has other advantage which sets it apart from oil in some crucial ways.
Non-rivalrous, Non-depleting, and Progenitive resource
Data can be used by multiple people (Non-rivalrous) simultaneously without diminishing its value. If you share a dataset with someone, you don't lose access to it yourself.
Data doesn't get "used up" when consumed. In fact, analyzing and using data often leads to more data being generated, creating a virtuous cycle of information growth.
Two years back, during my trip to Lucknow and I was chatting with Pratyush Mittal, founder of India’s biggest DIY investment Data Platform, screener.in
I was discussing about the Data Platform I was building at that time, and he said
“Data has a compounding effect”
When data is analyzed, shared, and combined with other data, it can lead to new insights, discoveries, and even more data creation.
Value Creation and Extraction
Oil requires significant investment and effort for extraction and refinement. Same way, Data also need it. However the key difference is, Data: Requires Human ingenuity, skills, and tools for analysis and interpretation to unlock right value. This human involvement has few issues also.
Correlation vs. Causation: Impressive but Useless Correlations
In 2012, a study published in the New England Journal of Medicine found a strong correlation between a country's chocolate consumption per capita and the number of Nobel laureates per 10 million people in that country.
Chocolate, particularly dark chocolate, contains flavonoids, which are antioxidants that have been linked to improved cognitive function. It's been suggested that flavonoids may play a role in boosting brainpower.
But, is this correct?
Country's investment in education and research, socioeconomic conditions, wealth, and cultural factors, can also play a significant role in producing Nobel laureates.
“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.” -Jim Barksdale
Data Trap: Why More Data Doesn't Always Mean Better Results
In the past, collecting large amounts of data was a laborious and time-consuming process. It often involved manual surveys, record-keeping, and limited tools for gathering information. Storing large datasets was also challenging, add the cost of analysing that required required specialised skills.
The dramatic decrease in prices for disk space, memory, processing power, and network bandwidth has brought once-costly techniques into the reach of ordinary businesses.
Business started thinking that, “Whoever has the best algorithms and the most data wins.”
This thought is leading us towards the early age of Data Overload or GIGO Problem.
Businesses often collect and store vast amounts of data without a clear strategy for its use. Making it difficult to manage, process, and extract meaningful insights. Even the most powerful algorithms will produce meaningless results if the input data is flawed, biased, or incomplete, i.e. "Garbage In, Garbage Out".
Data needs to be relevant, and aligned with the specific problem you're trying to solve. Let me bring cricketing analogy. While talking to Sidharth Monga, Rohit Sharma said,
“…'Am I breaking a stone to break a stone? Or am I breaking a stone to build a building?' The purpose behind breaking a stone is important.”
Order and Disorder: Exploring Information Entropy
Claude.ai is an AI chatbot and the name of the large language models (LLMs) it is developed by Anthropic. It is named after Claude Shannon (1916–2001) an American mathematician and electrical engineer who is recognised as the "Father of the Information Age."
Shannon joined Bell Laboratories in 1941. He contributed to various projects, including cryptography and fire control systems during World War II. His wartime work on cryptography influenced his later work on information theory.
Information entropy, a concept with roots in thermodynamics, is a cornerstone of his's information theory. It quantifies the uncertainty or randomness associated with a random variable, providing a measure of surprise or the amount of information gained from observing an outcome.
Data's value lies not in its quantity but in its relevance to the issue at hand. Data that does not reduce uncertainty about the problem is essentially useless, regardless of how much there is.
Information entropy: it measures the uncertainty or randomness of information. When entropy is high, it signifies a lack of clear patterns, predictability, or structure, making it difficult to extract meaningful insights and make informed decisions.
Irrelevant Data Increases Entropy:
Noise vs. Signal: More data often means more noise, irrelevant information that obscures the valuable signals you're looking for. This increases entropy, making it harder to extract meaningful insights.
Example: Imagine analyzing customer reviews. Collecting more reviews that are unrelated to the product or service (e.g., spam, off-topic rants) increases the noise and makes it harder to identify patterns in genuine feedback.
High Entropy Can Hinder Decision-Making and Confuse Things:
Uncertainty and Complexity: High entropy means greater uncertainty and complexity. This can make it difficult to interpret data, identify patterns, and make informed decisions, even with powerful analytical tools.
Example: If you're trying to predict customer churn, having a massive dataset with numerous variables (many of which are irrelevant) can make it harder to build an accurate predictive model.
Entropy is all about the importance of Relevant Data.
While reading book “Ask Measure Learn” by Lutz Finger and Soumitra Dutta, I got an interesting pointer:
“…it is not wise to use all data, but it is necessary to carefully select the data sets that have a high causal impact. This process is also commonly referred to as feature selection or regularization.”
This is further illustrated as:
Causation: Is this variable in a causal relationship with the outcome to the question?
Error: How easily and cleanly can you measure this variable?
Cost: How available is the data?
Data Disconnect: Why We Should Be Starving for Insights
I am in business of helping business reduce Fraud in the Digital Space, and have achieve quite success in doing so. Let me bring that lens (practical usecase).
Focus on What Matters.
You have to reduce entropy by increasing the "signal-to-noise ratio" – more useful information, less irrelevant stuff. Instead of collecting every possible piece of information about your customers, focus on the specific data points that are known to be helpful in predicting Fraud or bad behaviour.
Let’s assume you are a digital platform (Fintech offering digital loan, eGame platform giving bonus/promo). What is the MOST common thing a prospective BAD user will do?
He/She will buy a new phone number, and/ or create a new email ID for taking the offers.
Now, I know so many of the players, who in the process of trying to get 360-view of the customers collect all sort of data in their attempt of reduce fraud.
While simple, digital history of phone, email with the respect of name can do the magic. We call it as PEN <Phone+Email+Name> linkages. This make their data (clients) and our data something like this:
Everything else from that 360-view slide is noise, when you are solving for Fraud. However when you are solving for affordability (aka income), propensity, or cross sell such data is quite useful.
We should always start with null hypothesis as a starting point for investigation. It helps us be rigorous and avoid drawing conclusions without sufficient evidence.
The Null Hypothesis: The simplest explanation is usually the best one, even if it does not prove the hypothesis you want to prove. It's often called the "boring" hypothesis because it's the statement that nothing interesting is happening.
In the end, success is about making business sense of the data for usecases, not using particular data, algorithms or tools.
Investment Perspective:
Data is a powerful resource, but it's not a magic solution. The true value of data lies in our ability to extract meaningful insights and apply them to real-world problems, like solving for fraud or looking into macro data.
“Forget the evidence. Evidence is a list of the material you’ve got. What about the things you haven’t found?”―Terry Hayes,I Am Pilgrim
Data Deluge vs. Insights Drought: Organisations often find themselves facing a data paradox where they are drowning in data (data deluge) but are unable to derive meaningful insights from it. This is because simply accumulating more data does not automatically translate into better understanding or decision-making.
DIA: Data, Insight, and Action must all happen before value creation can occur. Data is used by people (or systems) to produce an insight, the insight informs an action, and the action results in a valuable outcome. Better processes and products create new value, not the data itself.
The Value of Observation: Ultimately, success in the data-driven world is not about blindly chasing the latest trends or tools, but about understanding the unique characteristics of data, recognizing its limitations, and using it strategically to achieve our goals.
Foundation for Forecasting: Before making any predictions about the future, it's essential to have a clear and accurate perception of the present.
Let me conclude the core of my arguments with self quote!
"If in the digital age, data is the crude oil that fuels progress, then analytics will be the engine that transforms it into motion. Thus driving innovation and powering the predications."