Inside LLMs: Understanding the Technology Behind AI Chatbots

This article talks about what goes on behind the scenes when you send a prompt to an LLM, and it responds back with an answer. We will discuss how LLMs are created and work under the hood. We will also cover the transformer architecture, context windows, attention mechanisms, and hallucinations.

Please note that this article will cover only the surface-level knowledge of the above topics for the benefit of the reader. This article will help you get more out of LLMs, and we will learn some neat tricks that will help you get more accurate and useful responses.

Author

Adnan Khan

Senior Data Engineer

A Quick History

LLMs are fundamentally probabilistic systems: they generate the most statistically plausible continuation of text, which does not always guarantee factual accuracy. They are a culmination of decades of research in AI and represent only a substrata in the universe of AI.

Artificial intelligence is the broad term that encapsulates the ability of a machine to simulate human intelligence.

Machine learning is a subset of AI that deals with recognising patterns, e.g., being able to distinguish dogs from hippos.

Deep learning is a subset of machine learning that uses layered neural networks inspired loosely by biological neurons. You can also layer together neural networks, that’s an architecture, e.g., Convolutional Neural Networks (CNN), Transformers, etc., are examples of Deep Learning Architectures.

The earliest LLMs are based off the transformer architecture that specifically deal with text and natural language, e.g., Generative Pre-trained Transformer (GPT), but new architectures have evolved now to solve some of the limitations of early models. For purposes of this article, we will only discuss GPTs.

Under the hood

The Problem

Let’s start with an oversimplification of the problem. Given a sequence of words, predict the next word. We solve this by using a neural network that we’re going to train with some text data. Let’s say we train a small neural network model (100,000 neurons/parameters) on a small book of poems with merely 5,000 words. The model will be able to accurately predict the next word if it’s given a line from the book. Now let’s extrapolate that. Imagine you have a Large neural network (150 billion neurons/parameters) and all the data available on the internet. Et voilà, we have a model that can predict the next word given in any sentence. Will it be perfect? Not really, since there are often multiple words that can follow a sequence. But it will become good at selecting one of the appropriate words that are syntactically and semantically appropriate.

Now that we can predict one word, we can just feed the extended sequence back into the model and predict another word, and so on. In other words, we can now generate text, not only a single word.

Creativity and Hallucinations

Another important detail is that we don’t necessarily always have to predict the most likely word. We can also sample from, say, the five most likely words at a given time. There is a high chance that all of them will be semantically and syntactically correct. As a result, we may get more creativity from the model. By tweaking the sample size, we can make the model more deterministic or creative.

The second dial for creativity is temperature. Low temperature means responses will be deterministic, and high temperature means responses will be more creative. Without going into too much detail, temperature is controlled by scaling logits before applying a distribution function, e.g., SoftMax to them. And logits are a numerical representation of how much confidence the model has in any token being the next one.

As you can imagine, the more creative a model gets, the more it tends to make up facts. These are called hallucinations. A general rule of thumb is to use more deterministic settings when strict factual accuracy is of value, e.g., technical writing or data extraction, etc.

Token Embeddings

An important detail we have yet to cover is tokens, so here’s a brief summary. Token embeddings are numerical representations that AI models use to grasp the meaning, context, and relationships of words. Since LLMs can't read raw text, they split it into smaller chunks called tokens and convert them into long lists of numbers (vectors).

Attention

Until now, we’ve only learned the generative part of Generative Pre-trained Transformer (GPT). Let’s put the pre-trained on hold for a minute and talk about the Transformer architecture. The main strength of the transformer architecture, which is also the reason it works so well, is its ability to focus its attention on the part of the input sentence that is most important. Similar to how humans work. This is called an attention mechanism.

If we recall, a transformer architecture is just layers of neural networks stacked on top of each other. The attention mechanism is implemented as a layer inside the transformer. This layer computes how strongly each token should attend to other tokens in the sequence. You can compute the importance of a word relative to words that came before it, after it, or relative to every other word in the sentence. Transformers implement it as relative to every other word. This is called self-attention.

As an example, when processing a sentence like "The dog ran across the field," self-attention allows the model to understand the relationship between "dog" and "field" even though they are not adjacent words. This helps the model better understand the context.

A point to note is that the attention mechanism is applied both to the prompt and the training data. The weights that the transformer calculates in the training phase are stored in the model’s internal permanent memory. This is useful in accurately predicting the next relevant word. The weights calculated on the prompt are stored in the working memory of the model. This is the short-term memory of the model, also known as the context window. This allows the AI to focus on the most relevant details, map long-range relationships, and process context.

Context window

The working memory of a model is the maximum amount of information a model can actively consider and process at one time, to generate the response. Usually, the context window contains the prompt, conversation history, and any attached files. When full, older information is discarded or “forgotten“ to make room for new input. But why is there a limit on the context window? If we had a large context window, we would not need to discard old conversations. This is because the self-attention calculation scales quadratically with the context window size. It needs more and more VRAM & GPU, making the use of LLM’s economically unreasonable. In a limited world with limited compute resources, the response takes longer as we increase the context window size.

But that’s not the only problem. LLMs struggle to recall and process information buried in the middle of a large prompt. This is because of the attention mechanism and training data bias. As a result, response accuracy suffers when the context windows get larger. Which is why a bigger context window is not always better. This is also called lost in the middle problem. In large prompts, information at the start and end is prioritised. You might argue that this mirrors human behaviour.

To get instant and accurate responses, we have to limit the size of the context window.

RAG, Context Management & Task Structuring

Retrieval-Augmented Generation (RAG) is an AI framework that connects Large Language Models (LLMs) to external knowledge bases. RAG directly helps bypass context window limits. By searching external databases to retrieve only the most relevant text chunks. It was born to overcome the severe limitations of standalone LLMs. Because LLMs are trained on fixed, static datasets, they quickly become outdated & lack access to private data.

Context management techniques help with managing a limited context window. Having the model or a secondary agent summarise older conversation logs, then replace the raw history with that summary, helps to free up active space. Sliding windows and forgetting are some other techniques where we intentionally forget past prompts to clear up the context window and get accurate responses.

Task structuring is the act of progressively giving the LLM context rather than loading all the information upfront. Worker agents are another good way to break complex, monolithic documents into smaller, manageable chunks. Assign worker agents to process each chunk individually, then use a "Manager Agent" to combine the summaries. Google has released an architecture called Chain-of-Agents (CoA) to tackle long context tasks.

Wrap Up

Hopefully, this article has helped you understand some of the underlying technology that makes LLMs work and the limitations that arise because of it. Mastering LLMs comes down to how you guide them. Below are some tips and tricks to help you get the most accurate responses from LLMs. To get the best results, ditch conversational chatting in favor of precision, context, and structure.

Here are the most effective tips and tricks to level up your prompting:

Persona: Give the LLM a specific identity to set the tone and knowledge base. Example: "Act as a financial analyst with 10 years of experience."
Context & Constraints: Tell it who the audience is, what format you need (e.g., a bullet list, a 500-word essay, a Python script), and any rules it must follow.
Few-Shot Prompting: Give the AI 1-2 examples of inputs and desired outputs. It will mimic the pattern perfectly.
Chain of Thought (CoT): Ask the LLM to "think step by step" before giving a final answer. This drastically reduces hallucinations and logical errors.
Role-play for Feedback: Ask the model to "critique" your work or use a setup like: "I want you to debate me on [topic]. I will state my view, and you will counter it."
Progressive Context: Instead of asking for a whole project at once, ask for an outline, then refine the outline, and generate the content section by section.
Iterative Refinement: If the output misses the mark, don't start a new chat. Tell it exactly what to fix (e.g., "Make this more concise and change the tone to be less formal").

These tricks help you get more accurate responses from LLMs. Building on the foundations we’ve explored in this article, we’ll learn why the above tips and tricks work in the upcoming article “LLMs: The way forward“.

Author

Adnan Khan

Senior Data Engineer

Advisory

Enterprise Data & AI Platforms

ML Solutions

Generative AI Solutions

Data Migrations

Run and Support