After learning to code and build simple CRUD apps, learning to create value from data is a useful tool to add to one's work belt. This skill-set allows you to go beyond CRUD applications, to apps that have intelligence and provide higher layers of value.
Alongside brand and financial equity, many online businesses create valuable, often unrealized, data opportunities. It could, for example, be argued that the real value of Amazon is in it having collated the largest crowd sourced database of reviews alongside unparalleled insights into consumer habits - this data may actually be as valuable as its core business. Think about that: could the implicit data your product generates daily be worth more than what pays the bills? Are you sitting on a hidden data play?
Pieter Levels has been generating some very nice correlation plots from data at NomadList, and has built an amazing data resource. This week he asked an amazing question:
Being the rare kind of person who gets really excited by stats, I wanted to put together an answer to this question, that goes beyond the limitations of a short response. I think this is useful: one, because of the sincerity of the questioner and secondly, because the answer may also be useful to fellow IndieHackers interested in the essence of statistical reasoning.
People who aren't trained in a field, have a habit of asking brilliantly naive questions. Brilliant in the sense that someone trained in the field becomes systematically incapable of asking such a direct question. I remember in the first week of a psychology class, a student asked a psychoanalyst why they don't just give people with problems prescriptive advice. If someone overspends on shopping (or drinks excessively) instead of digging into their childhood, teach them how to budget and become disciplined right? It was a fabulous question, that in many ways got at the philosophical heart of the matter, but a question which only someone untrained in psychoanalysis would dare to ask. Pieter's question has this revelatory quality: it's simple and direct but touches at the heart of what statistics is about (shamefully, some stats grad couldn't ask a question as good as this).
The point of statistics.
The main use of statistics is to learn things about the world! This can get lost in the sea of techniques and procedures. Statistics is basically an extension of logic. We use logic all the time to teach us things, for example:
1.) I observe that people like to treated with respect.
2.) I observe that people who treat people others respectfully are liked and respected.
3.) I infer that by treating people with respect, I too can become popular.
4.) I build on this inference to conclude that treating workers with respect is the most efficient way to run a company.
Statistics extend logic further, to use numeric observations. Why reason with numbers? Simply, raw numbers have a way of describing things we are unable to intuit directly. For example, how much do you think you spend on Coffee or takeout meals in a month? When you actually add this up, the total will likely surprise you. Same if you ask people how much time they spend on social media or surfing indiehackers. Numbers give us truths about the world, we can't see directly.
Remarkably, numerical observations may even tell us that the world works in a different way to how it appears (or most people think) it works. Traditional baseball scouts value a player based on the athleticism and speed of their swing, yet deep statistical data shows a whole range of other qualities, unseen by the eye, that produce value in a player: a fat unaesthetic player can have hidden value the traditional scouts can't see and which the baseball transfer market severely undervalues. The book, Moneyball, centres around this tension and is a powerful example of how statistical insights can be used for profit and success by those brave enough to apply their unintuitive insights.
Correlation !== causation
If data can provide new unintuitve insights, the first step is to plot things out. We can plot one variable against the other, for example belly circumference against annual number of romantic dates. We are not interested in the data before us, but always what we can infer and conclude from it about how the world works.
When curious people seeking to understand the world discover interesting correlations, rightly, the first thing people will scream is, correlation doesn't, even when strong , prove causation. This a correct statement, but not useful. The analyst, in this case, Pieter, wants to make conclusions about the world from data, so it sucks to have this cliché parroted. It would probably be more useful, and certainly equally correct to say: correlation can suggest causation, but you need more work to be sure! This is also closer to the truth, because entire branches of economics, medicine and sociology depend on the possibility of making solid inferences from observational data.
If you look at any newspaper, at least every week there is a new health story linking health to some-kind of behavior. Let's look at some examples (links included):
1.) Older people with grandchildren live longer than those without.
2.) Time watching television correlates with death.
3.) Eating fish reduces heart disease.
These examples illustrate why we need more than correlation to conclude a causative relationship. TV is correlated with death, but this doesn't mean TV causes death directly, but rather that watching TV is correlated with low physical activity, which is the true causative factor. There may be other causative factors linked to high TV consumption, such as poor diet, lower social class or manual labour.
This type of hidden variable or confounder also arises in example 1. It is quite possible that having lots of children (and therefore lots of grandchildren) is an index of health, so having grand children here is simply acting as a confounder for health, and not as reported: a cause of longevity.
As such when we spot an interesting correlation, the next question that follows should be: is there a confounding (hidden) variable that explains this relationship? With large datasets, we can automate this question, and ask an algorithm whether any of the other variables on file can explain this correlation? This is called multiple logistic regression, which a technique that addresses this question. So with an apparently strong correlation between a behavior (looking after grand-kids) and a health outcome (life expectancy), we would use multiple logistic regression to exclude social class, wealth, activity, sex and other preexisting medical conditions/risks as potential confounders. Indeed any observational study submitted to a respectable journal, would be expected to undertake this type of analysis looking for hidden confounders.
If our correlation of interest can not be explained away by a confounder or other variable, after regression analysis: we may be on to something. Yet there are other things we can do to increase our confidence in a possible causative relationship. Can we observe this same relationship in an independent data-set? Can we observe this same relationship using temporal data? Also, and most importantly, does the relationship make common sense? That is, can we explain how the linked factor leads to the observed effect? Affirmative answers to these questions, increase the probability of a causative relationship.
Sometimes this process can lead to establishing new causative relationships: learning something new about the world. This was the case with smoking, where the initial observation of a strong correlation between smoking and cancer, led to subsequent confirmation of a causative effect. Indeed, what we've summarized above, is what is called the Bradford Hill Criteria for establishing causality from correctional data (indeed a simple link to these criteria would be the brief answer to Pieter's question!).
Sometimes, after decades it still isn't possible to confirm causation. In the case of fish and heart disease, many large experiments with Omega3 oil supplements have been unable to establish directly a causative relationship. This may mean some other substance in fish other than Omega oils is the causative factor, or perhaps that a diet high in fish is indicative of people who take care of themselves (that dam confounder again).
So when Pieter observes an interesting and rather strong correlation like:
The questions to ask are:
1.) Can we rule out obvious confounders (GDP, education, population size) using multiple logistic regression ? (Pieter has the data to do this) The elephant (confounder) in the room here is house prices! We could absolutely use multiple logistic regression to interrogate whether this strong correlation can be explained away by economic strength and house prices.
2.) Can we detect this same relationship in an independent data-set?
3.) Can we get historical data to increase in our confidence in the potential causal link? For example, if we take any one city and look over the past ten years, does the relationship between VC capital and rent hold over time? A strong correlation, over several geographical dispersed cities, that holds over time, and remains even after correcting for obvious confounders, and when using independent data-sets provides strong evidence in favor of a casual link.
4.) Does the correlation make logical sense, can we think of a mechanistic link? In this case, yes, this absolutely makes sense. Even without stats, Bay Area newspapers are full of opinion pieces blaming unfordable housing on Silicon valley.
For simplicity, we haven't mentioned issues like data quality, non-linear relationships or how study types effect inference. Yet for the type of study Pieter is undertaking (observational) it is absolutely possible to make solid causative inferences, but only after further work. The most beginner friendly book for teaching multiple logistic regression I've come across is Andy Field's - Discovering Statistics using IBM SPSS (this is one of the best stats books I've come across period).
Apologies for a longish technical technical article, but these principles are generally applicable to many question commonly asked on IH, i.e how can I tell which landing page design is better, what pricing strategy maximizes revenue etc..