Upgrading Data Analysis with GenAI: Our Journey at ANDRE

I started working on ANDRE’s grandpa back in 2019 as my insights practice was providing support to international consumer product brands. The idea was to reduce the time it takes to produce data visualization from survey data, an incredibly time-consuming activity. The new process would extract all combinations of variables, and we would then only have to select the meaningful ones.

Our deselection approach reduced chart production time by over 60%

We moved from a selection approach, in which each slide would have to be imagined and produced, to a deselection approach, in which the analyst flips through charts and deletes the ones that do not seem useful.

The new approach seems like a small improvement, but it’s fundamental. The human brain is much better at deselection; it’s just the way our neurons have been wired through evolution. And, just like that, our deselection approach reduced chart production time by over 60%.

We were kinda getting tired of sifting through tons of charts, so we decided to prioritize them algorithmically based on how informative they seem statistically. So, we started looking at correlations between variables, which would need to be not too low and not too high either, in an attempt to capture interesting middle-ground patterns rather than noise or redundancy. That was extremely crude, but it helped.

We kept iterating with various algorithms, looking at entropy gains, various centrality and co-occurrence methods to get better predictions, and we did get to improve the variable selection model. But statistics don’t care much about meaning, and we had to make up for this by hard-coding a list of keywords that our algorithm would check on top of stats to make sure we wouldn’t miss out on things like NPS, purchase intent, or overall satisfaction.

I shared the tool, a simple R/Shiny app called Insights Casa, with a couple of friends, and their feedback was encouraging, so we decided to learn our way through building a SaaS product. Because R is slow, not so easily scalable, and most AI code is produced in Python, we rewrote the entire code base in Next.JS with a Python back-end. Now, we had the foundation for a SaaS-able solution.

ChatGPT changed everything; who would want to get charts when they can get full analysis?

The product was decent and we renamed it Chartomat, as its main promise was to produce charts. It may have been a commercial success without the venue of a newcomer in late 2022. ChatGPT changed everything; who would want to get charts when they can get full analysis?

By mid-2023, we had added text describing charts on top of them by passing chart data to GPT-3 for comments in simple prompts. The observations made by GenAI were quite relevant, so we kept optimizing the prompts.

However, the fundamental problem of selecting data on statistical ground rather than based on their meaning remained problematic. We would have to find a way to combine GenAI with our own data processing scripts to get more meaningful suggestions about data.

Meaning isn’t extracted from the data itself but from its metadata in the light of a particular context

We looked at Code Interpreter, the first iteration of what is now called Advanced Data Analysis by OpenAI, and we were quite underwhelmed when we realized that the add-on default parameters would only read the first 5 rows of a dataset to give suggestions about the data.

The suggestions provided by the model made sense, however, and they looked more interesting than our statistic-based approach. We had take a step back, do data analysts start analysing a new file they know nothing about start with advanced statistics? Certainly not!

Metadata is the real starting point of any data analysis project. An analyst would see the file name, type and size without even thinking about it. This alone, together with their job title, business and team, provide the majority of the context necessary to produce a relevant piece of data analysis.

An analyst would typically start by checking the type of data with a df.head(), get general orientation from the name of the variables and their content with a dtype() and value_counts(). So, meaning isn’t extracted from the data itself, but from its metadata in the light of a particular context, e.g. to whom and why an analysis is expected to be shared.

Data science is no rocket science, however, and these heuristics we call experience can be scripted

Once the general context is understood, an analyst would focus on a handful of variables that look interesting, based on their business objectives and the dataset they have at hand. That’s the life of senior data analysts, really; they look at data in the light of their current knowledge, formulate new hypothesis, test them, and then keep only the ones that add value to a logically structured report that serves their customers. Scattered insights are patched together in data stories that are organized in a cohesive whole. Data science is no rocket science, however, and these heuristics we call experience can be scripted.

But how do you implement this, exactly?

Providing the right statistics to GenAI role-playing the senior analyst provides the best output

You would think that prompt engineering is all it takes to extract meaning from data, but it’s not that easy…

First, if you were to just use GenAI, you would need to feed large amounts of data into the LLM, exceeding its attention span or its context. A lot of research is going on in RAG, encoding and vectorization, and we should expect break-through at some point, but current solutions look cumbersome and are quite token-intensive.
This is not necessarily needed though; what you really need is to find the right balance of statistics-based data flow and GenAI-based data flow. With every data point, you could decide how to use it based on stats or based on AI, or a mix of both. We decided to implement the latter; providing the right statistics to GenAI role-playing the senior analyst provides the best output.
Then there is consistency; LLM outputs should be checked and stripped out of formatting errors and hallucinations, most likely by another LLM request so that errors can be fixed when they are caught. Another challenge is that a request that works today may fail tomorrow, because of inevitable model changes, conscious or not. So a model needs to be resilient to such changes. We wouldn’t pretend that our model is 100% antifragile, but at least we are aware of the challenge.
Finally, and this is very important, once you start working with black-magic type of AI models, you must wonder about data security; nobody wants their survey data to be spilled to competitors. This is another reason why we don’t pass data as is to GenAI, but representations of summary data. Basically, we scramble the summary data in a way that GenAI could not reverse engineer the summary data nor the raw data, yet be able to recommend what action to take next (e.g. select these variables to provide supporting evidence for particular data story).

We implemented this mixed solution where we process data based on advanced statistical and ML algorithms as well as GenAI as we think this is what best replicates the iterative process of analysts running analysis, checking its output, and deciding what to do next. This is in my opinion the sweet spot where the machine reasons where a human would reason, and mechanically executes trivial tasks like processing statistics as existing analytic software already does.

We would love to hear your feedback if you end up testing ANDRE; the solution is in active development and it is still error-prone, but it offers a free plan. We wouldn’t mind your vote on ProductHunt on May 19th if you like the project!

ANDRE Survey Data Analysis Software Logo

ANDRE is a Survey Data Analysis Automation product; it extracts data stories from raw data and summarizes them in executive reports.