Enable self-serve analytics with AI models

People have long wanted to ask questions about data in plain language. Developing SQL (Structured Query Language) was on its own a significant step to make data more accessible to non-technical users. Yet even today it's still uncommon for business users to run ad-hoc queries themselves. The dominant approach is still to ask familiar data analyst to make a corresponding request to database.

However, with recent advancements in AI, particularly in Large Language Models (LLMs), self-serving analytics is closer than ever.

Author

Vlad Suvorov

Senior Data Engineer

Vlad is a seasoned Senior Data Engineer specialising in big data solutions. His diverse background equips him to approach projects from multiple angles, effectively bridging the gap between business requirements and technical execution. With several years of leadership experience, Vlad is adept at prioritising tasks and driving projects to successful completion. He is passionate about leveraging big data to empower organisations with data-driven insights and decision-making.

Big tech companies are pouring billions of dollars into AI. This year alone, Microsoft, Alphabet, Meta, and Amazon spent $189 billion on Capital Investment, each allocating billions on AI infrastructure . Goldman Sachs projects that these companies together will invest in AI about $1 trillion over the next 5 years

Keeping such significant investments in mind, AI models should become more powerful and affordable in the coming years.

Besides these companies, a vast community is building an ecosystem around AI models, making it possible to apply AI for self- service data analytics already today! We will see some of the tools later in the article.

AI Models in Data Analytics

AI capabilities to query data have advanced greatly since the emergence of ChatGPT.

A new benchmark for text-to-SQL recently arrived — BIRD-bench . Unlike previous benchmarks, it consists of real- world data with all issues real-world data has, such as invalid and missing values. It contains more than 12,000 questions across 37 professional domains, making it a robust assessment for AI capabilities to query data.
Another great thing about this benchmark is that even humans were not able to answer all the questions, averaging 92.96% of correct answers.

Below is the best performance achieved by AI models on this benchmark, with share of correct answers improved from 40% in January 2023 to over 70% by August 2024. If this trend continues, we may see models reach human-level performance in a year!

What you can have in practice today

Let's see AI in action by asking it questions about the data. For demonstration purposes, the IMDB dataset is used [3]

Overall experience feels natural, and such application can be a nice addition to predefined BI reports, reducing time to insight.

Technical Details

The application above uses GPT-4o model, few-shot prompting and data stored in AWS S3. For query engine used AWS Athena, for web application — ChainLit. LangChainhelps to get AI agent capabilities, like querying database or executing code generated by the model.

For each question:

Form relevant context to be used in prompt
Send request with relevant context to OpenAI to generate SQL query
Execute the query with SQLAlchemy [6]
Provide the answer using SQL query results

How It Works

Large Language Models (LLM) are at the heart of AI. They are trained on vast volumes of data, and they receive text as input and produce text as output.

But LLMs know nothing about your company data. For model to know your data you have 2 options:

Train the model with your data (which is expensive)
Leverage RAG (Retrieval-Augmented Generation) technique to dynamically provide relevant data to the model

RAG is a technique that enriches user's question with relevant information from your knowledge base (such as a data lake, data warehouse, or Confluence) before feeding it to the model.

Modern LLMs have context window of 128K tokens and even more. 128K tokens is approximately 100K words (around 200 pages of text), which should be enough to put extra knowledge.

With relevant data from the database the model can provide informed answers, as demonstrated in the demo

Adding Agent capabilities to the application

You can think of AI models as the brain, taking text as input and producing text as output. Writing SQL queries is an excellent use case for LLMs, but to execute these queries, the model needs "hands" and "tools".

To perform more complex tasks, the AI model requires additional capabilities, such as planning actions, executing each step, and delivering a complete response. These capabilities make an AI Agent/Assistant.

Quering database is a relatively simple example of AI Agent. You can go beyond it if you want. For example, adding build forecasts for the future in addition to historical data, building charts in addition to text. In such a case, you need to develop functions for forecasting and drawing charts. You then instruct the model how and when to use provided functions, and model will leverage them when it seems appropriate, according to provided instructions.

Similarly, you can add any desired capabilities, just develop functions and instruct the model to use them.

What challenges you face when building AI assistant for self-serve
analytics

It's essential to have a solid fundament first, before you can apply AI on top. If humans are struggling to understand how to use the data and have multiple sources of truth, AI in the best case will repeat one of them. Proper data models, available metadata, fields description is a must to leverage full potential of AI.

Below are some challenges you can face when using AI to answer questions about the data.

Model hallucinates about the schema

The model needs to know your data schema, including tables, columns, and data types. However, even having schema available model still has minimal chance to hallucinate.

Model needs to know how data looks like

Having schema provided is a must, but it's not enough. With numeric values, dates, Booleans it is obvious what data is stored there. But for text columns it is not evident, especially when you have plenty of categorical values that the model needs to be aware of. Or if date fields are stored in string data types.

Model does not know your company ways of working

In the working environment there are plenty of nuances that everyone knows and does not mention. For example, that for any metrics you calculate you only consider active records. Or when calculating requests per month only approved ones need to be considered. Or you use some abbreviations and jargon terms that are specific to your company. Yet the model does not know these details out of the box. And the data might not reflect it either.
Another source of potential errors is using ambiguous generic terms. For example, you want to know the share of productive time engineers spent per quarter. But what exactly is productive time in your company? Good chance you will ask 10 people and receive 10 different answers.

Model struggling with particular SQL dialect

There are plenty of tools using some SQL, and plenty of SQL dialects, each with their own features. Some dialects have certain functions, some not, in other they are named differently. As model seen various dialects during its training it might confuse them and use non-existing functions or using different function names.

Implementing RAG for thousands of tables

The more tables you have, the more sophisticated RAG technique needs to be. Trying brute force approach utilize the model's entire Context Window inefficient for several reasons:

LLM providers such as OpenAI charge per number of tokens in/out. So, the more data you send in the request to LLM so more you will pay
When most data you send to LLM is irrelevant, it's easier for the model to get confused and misunderstand what is important and what is a noise, leading to incorrect answers
All metadata may exceed Context Window size, skipping parts of metadata required to answer the question

Conclusion

Answering questions using your data is a great use case for AI models. It could be the missing block to enable self-service analytics in your company, removing barriers to data insights for non-technical users.

Currently, AI can answer easy and medium-level questions, though more complex queries will require further model fine-tuning.

Remember that foundational work needs to be done beforehand, and different challenges need to be addressed. However, the benefits of self-serving analytics will outweigh the efforts involved.