How Neural Networks & LLMs Work (Beginner-Friendly Guide)
Have you ever wondered what is happening in the background when you submit a prompt to an LLM like ChatGPT or Claude? In this introductory guide, I would like to give you a high-level overview of how AI and - more specifically - LLMs work, and how they can give the illusion of being intelligent.
I will not get into the precise math for everything I say, this is supposed to be a high-level overview. I may write an article in the future where I go more in depth.
Introduction
When talking about AI, I am not referring to Machine Learning in the general sense. In the context of this article, I am referring specifically to neural networks and other forms of intelligent systems derived from them (like LLMs).
A neural network, at a very high level, is basically a way of approximating any multi-dimensional function. That means, the goal is to approximate a function, such that given some number of inputs, it can output some number of outputs.
For example, we might want to find a way of determining the price of a house from the size in and the number of rooms. To make the prediction more accurate, we could add additional inputs like the age of the building or a score for the location.
We don’t know what the function is exactly, but we do know that there must exist some function which maps all of these metrics to a somewhat accurate price for the home. That is, given these metrics, there exists a price which is the best possible prediction for that property. To make the prediction more accurate, one could add even more inputs.
Now, this is important: for this kind of problem, neural networks work very well, because we don’t really have an accurate way of deriving such function, but we do have a lot of data of many homes for which we know the size, number of rooms, etc., and the price.
We will get into how we can approximate the function from just data soon.
LLMs
Now, LLMs are really similar in the sense that they also try to approximate a function. Basically, you try to encode words (or more specifically, tokens) as numbers. Then, given a sequence of words (which could, for example, be your prompt), the LLM outputs a series of numbers which represent a probability distribution of which token might come next. Can you see the pattern? It’s just a function. Input, output. You give it a series of words, it tells you how likely it is for each word the LLM knows to come next.
That’s right, LLMs just predict the next word. The thing is, they’re extraordinarily good at doing this, which makes it look like they are actually thinking. It seems as if they were really reasoning and forming coherent thoughts. The thing is, they’re not actually thinking (at least not in the human sense). They are just predicting one word at a time. It more so resembles a sophisticated pattern completion system.
A More In-Depth Look at Neural Networks
When talking about neural networks I will be referring to FFNN (feed-forward neural networks), which are the simplest and most basic neural networks. Everything is built upon this simple idea and a lot can be done with just a simple FFNN.
A neural network is an abstract idea of a mathematical structure which involves organizing weights and biases into so-called layers.
Look at this figure.
Each circle is referred to as a neuron. As you can see, the neurons are organized in columns. That is, in a column, all the neurons are horizontally aligned. This way of representing neurons is chosen deliberately: each column represents a layer.
Also, from the figure you can see that each neuron from some layer is connected with a line to each neuron of the layer (the next layer). These lines represent the weights.
We will get into what this all means in a second. The number of neurons in each layer and the number of layers is chosen arbitrarily for this example. The choice in a real-world example would depend on the problem at hand that we are trying to solve with the neural network. In this case, we have input neurons and output neuron. Looking at the neural network from left to right, the four input neurons are the ones in the first layer. The output neuron is the neuron in the last layer (the one on the far right).
This aligns with the example of predicting property prices I wrote about before. We have four inputs. These could represent the size of the home in , the number of rooms, the age of the building, and a score for the location going from to .
We also have two layers in between the input and output layers. These are referred to as the hidden layers.
On a very high level, the idea is that we give the neural network its inputs in the input layer neurons, and the values are propagated until the end of the network (the output layer), from which we can read the output(s).
The value associated with each neuron is referred to as an activation value. The activation of the th neuron in the th layer is denoted by the symbol . Note that neurons are conventionally numbered from top to bottom and layers from left to right. I will start the numbering at , but some people prefer to start at .
The weights - which, remember, are the lines connecting the neurons - are denoted using the following notation. To refer to the weight connecting the th neuron in the th layer with the th neuron in the th layer you use the symbol .
Getting From Input to Output
Intuition
Basically, each weight represents the “importance” of each value. Each activation is multiplied by a weight, which together with the other weighted activations forms the output. A higher value for the weight means that that activation value is “more important”. A lower means it is “less important”. A weight of means that the value of that activation is irrelevant.
Do not worry if you are unable to understand the mathematical process immediately, it is rather abstract and the indexing can be quite confusing. Focus on getting a general picture.
Mathematical Explanation
We are given the four values for the input neurons. We want to find the value of the output neuron. How do we do that? We assume that we are already given the neural network, that is, we are given the weights of the network.
We have to proceed layer by layer. The activation value for the first neuron of the second layer is simply the weighted sum of all the activation values of the previous layer multiplied by the weight going to the first neuron in the second layer.
That would be
In reality, we are missing the bias term. This is often omitted in the graphical representation I used above, since it makes the whole graph awkward. The bias term is added to each neuron at the end of the weighted sum. Each neuron has its own bias and the bias for the th neuron in the th layer is denoted by the symbol . So basically, the expression for the first neuron of the second layer would become
As you can see, each term in the sum resembles a linear function. It’s basically a constant (the weight) multiplied by some input value (the activation of the corresponding neuron). We are adding together a bunch of linear functions. If you think about it, the sum of a bunch of linear functions is just another linear function. But we wanted to approximate any function? How can we approximate any function if all we get is a linear function? This is were activation functions come in. Now, there are multiple activation functions that can be used, but a really effective and simple one that can be used is the function:
Basically, every value under is clamped to , and anything above is left as-is. This, believe it or not, is enough for to break away from the linear function constraint. Usually, the value we computed before is not denoted by the symbol , but by the symbol . This is an intermediate value, and the actual activation of the neuron is obtained by first applying the activation function to . As I said, there are multiple possible activation functions, and is just one of them. To refer to the activation function in general, regardless of which function this may be, the symbol is usually used.
Therefore,
and then
This was all to compute the first activation of the second layer. More generally, to compute the th activation of the th layer, where is the number of neurons in the th layer, we can use the formula
How Are the Correct Parameters Found?
What we just did is called forward propagation. That is to go from input to output. There is the inverse, called back propagation, which can be used to tune the weights given some expected outputs for certain inputs.
On a high level, you need a big dataset of inputs with their expected outputs. What you do is go through each row which contains the inputs and their expected outputs, and you try to run the inputs through the neural network. This is still forward propagation.
You then compare the outputs you got back from the neural network with the expected outputs from the dataset. Based on the difference between the outputs, there are mathematical tools that can be used to determine how to tune the weights to get the neural network closer to the expected outputs. You cannot determine how much you have to tune them to get the exact output, but you can find out “the direction to go towards”.
This is done thousands, maybe even million or billion of times, until your model is rather accurate at predicting outputs based on inputs. And the interesting part is that it will be fairly good at predicting accurate outputs for inputs it has never seen before (i.e. inputs that were not contained in the dataset).
This is done through derivatives, so high school math would be enough to understand the mathematical process, but it is rather long and this is supposed to be an introduction.
I might write a follow-up article to go more in-depth.
Large Language Models
Now even though LLMs do contain neural networks and are, in many ways, based on them, their structure is much more complex. I will not get into the exact structure in this post, since that is besides the point.
Before we talk about anything else, there is one issue we have to cover. Up until this point, we have always talked about numbers. A neural network processes numbers. How come that large language models are able to tak text as input? And how do they give text as output?
Encoding Tokens as Vectors
We need some way of encoding text as numbers. An LLM has a “vocabulary” of tokens that it knows. Tokens are not really words, but they are fairly close. Tokens are “pieces” of words. One could think of them as being similar to syllables. For the sake of simplicity, you can think of token as just being words.
In reality, a token could be a whole word, a part of a word, or even punctuation. There is no one-to-one way of converting words to tokens. Each model could also handle it slightly differently and there are different approaches.
The input of an LLM is given via IDs of each token. It really is that simple: the LLM has a vocabulary of tokens that it knows, and you just start from and number each of them with integers going up to (where is the number of tokens the LLM has in its vocabulary).
Now, this is great, but the IDs are arbitrary. They do not carry any meaning. So how does the LLM know what each token means? This is where embeddings come in. They are similar to weights in usual neural networks in that they are learned by the AI model during training.
I will not get into the details of how training works here, since that is outside of the scope of this article.
The embedding is a matrix where each row represents a vector for each token. The number of rows of the matrix corresponds with the number of tokens the LLM knows. You use the ID of the token (which is basically an index) to index the correct row of the matrix corresponding with that token. Now, these vectors are very large (i.e. they are high-dimensional). They encode the meaning of many of each token through the point they point to in space. Now, this is part is very interesting, because many observations can be made about these vectors.
The vectors corresponding to tokens (which, for simplicity, we think of as words) with similar meaning also point in similar directions. For example, words like “position” and “location” are very close to each other in space.
Another interesting property has to do with the differences of these vectors. The best way to illustrate this is through an example.
Take the vectors and which encode the meaning for the words “man” and “woman” respectively. Now, take the vectors and , which encode the meaning of “king” and “queen” respectively. We can say that
This is because the difference between the two (in both cases) is their sex. What differentiates a man from a woman is that one is a male and the other is female.
The difference between a king and a queen is the same: one is male, the other is female. It only makes sense for the difference of the two vectors to be roughly equal. They quite literally differ in similar ways.
Or, to make another example, the vector for “dog” will be closer to the vector for “cat” than it is close to the vector for “car”. “Cat” and “car” sound similar, but their meaning is very different. This is important: embeddings encode meaning. The actual letters used to form a word which has that meaning are basically irrelevant.
Attention
LLMs are based on the so-called transformer architecture. The invention of this architecture, initially conceived by Google in a 2017 paper called “attention is all you need”, started the whole AI revolution we are living in. What set this architecture apart from other, previous models, was the idea of “attention”.
Previous methods struggled with long-range dependencies, which is exactly what you need for processing text and natural language. Take for example this short text:
“Light travels faster in a vacuum. In perfect conditions, it can reach up to 299 792 458 m/s”.
Now, what does “it” refer to in the text above? Well, it refers to the light, which was previously mentioned. What are “perfect conditions”? To understand what perfect conditions are in the context of this example, you need to see the whole text.
You cannot process one word at a time, you need the full picture all at once to get a good idea of what a sequence of words actually means. That is what attention is for. You need to look at a word in the context of other words. In fact, the embedding vector (which we discussed above) is mutated based on surrounding words, because the same word can have completely different meanings based on the context.
Think of all the words in the english language that have more than one definition and one is chosen based on context.
Including attention in the calculation of the output of an LLM is a mathematical process, just like all the others. I will not go into detail, for now it is important to know that such a thing as attention exists and that that is what revolutionized the field of natural language processing and artificial intelligence research.
This is what sets transformers apart from other models. What was done before was to use much simpler ideas, like simply feeding the previous output as the next input and hope that the model is able to understand it (this is very basic, later some additional improvements were made, but this was the general idea).
Now, this is still done with LLMs, the previous output is still fed as the next input, but with the transformer architecture the model can better distinguish the meaning of words based on the surrounding text. Also, some words are naturally more important than others in a sentence. Some carry more meaning and more depth, others add very little to the final message and are only there for grammatical correctness. Attention also helps with that.
An intuitive understanding is to think of it like this: each token does not only see itself, but it also constantly asks “which other tokens are relevant to me?”, or “which other tokens add meaning to me?”. This is much closer to how we as humans read a sentence.
In summary, attention assigns weights or “scores” to each token. Each token tells each other token how much the model should “pay attention” to it.
Scaling a Transformer
It was found that by adding more transformer layers, the model is able to understand more abstract meanings, since each layer gives it a deeper understanding.
Simply throwing more computing power at a transformer and making it bigger (adding more layers and parameters) was found to make the model better. If this is done, the amount of data fed to the model during training also has to be scaled proportionally.
Limitations of the Transformer Architecture
Some things about the transformer make scaling it to very large scales highly impractical.
For one, LLMs have a finite context size. The context is how many tokens the model can process and keep into account when processing the next one. Now, the context size can be made bigger, but as you scale it, the computing cost of each token scales with the square. This is a serious limitation that, with current technologies (keeping costs into account), only allows to scale the context size up to a certain point.
When giving large documents to an LLM, the output may not keep the start of the document into account.
Also, generally, the whole computation of attention and all the neural networks inside of the transformer architecture (each transformer contains a multitude of neural networks) makes computation very expensive compared to other, more simple architectures.
Also, data is finite, especially high quality data. Companies have already (and, quite frankly, in many cases, illegally) used all of the data produced by humans, ever. We do not have any more data. We have to quite literally wait for humans to create more data.
Also, there is a catch. Much of the data created by humans is AI-generated nowadays. But AI-generated content is not suited for training an AI better than the AI that was used to generate it. In fact, this makes future models worse. Now you need to have a reliable way of distinguishing human-written text from AI-generated text.
As models get better and more complex, this becomes increasingly difficult.
Misconceptions
There are many misconceptions and hype surrounding LLMs and artificial intelligence in general.
The most important thing that should be stated is that models like ChatGPT or Claude (or any other LLM for that matter) do not have any understanding in the slightest sense.
Keep in mind that most CEOs of AI companies are flat out lying when they say that “they cannot determine whether their model is conscious or not”. The models are not conscious, in fact, they are not even close, not in the slightest sense. They do not even understand what the words mean. They only map words to a point in some high-dimensional space. That is literally it.
With the pattern of attention, researchers were able to astoundingly accurately model our way, as humans, of reading and understanding a text in a mathematical way. This is great, but it does not get us close to consciousness in any way shape or form.
LLMs are literally, without exaggeration glorified autocomplete. They are simply a better version of what your phone does when it suggests the next word you are about to type. And again, as I have already explained, I cannot state this enough: I am not saying this to simplify the concept. LLMs do the same thing that your autocomplete engine in your phone has been doing for over a decade: predict the next word. They are just very good at it. This does not imply consciousness or understanding.
Large language models behave as if they understood what they are saying. Some people could argue it does not matter. I would argue the opposite. AI does not know about “consequences” or “responsibilty”. How could you delegate important tasks to an LLM without supervising it? Sounds like a suicide mission to me.
Another misconception is that LLMs can be used for critical tasks where correctness is crucial. What many people fail to acknowledge is that LLMs are non-deterministic.
This means that the same input could lead to different outputs. This is because, as I explained before, the model outputs a probability distribution for the next token. The next token is picked based on its probability. The output is not the same each time. You cannot rely on LLMs for critical tasks (like computer programming or medicine) without proper professional supervision, because on top of not understanding what it is saying, it is, to some extent, random.
LLMs can be configured to be deterministic (i.e. always pick the same token with the same input), but this has been found to deeply hurt the quality of the output.
Also, as a side-note (but most people already know this), LLMs often make up facts. When doing so, they are always very confident, but they can straight up lie in your face about something completely made up. Remember, they only predict plausible text (not correct text).
Conclusion
This serves as a nice introduction acquire a deeper understanding of AI, neural networks, and large language models. I am planning to go more into detail in future posts. Some parts are very theory-heavy (especially mathematically speaking), and most of the important information is related to linear algebra.
You can find a university-level linear algebra course that I have been working on here, on my website.
I also plan on implementing a neural network from scratch in some low-level language, which allows the reader to get a very deep understanding of the mathematical process behind a neural network.
Hopefully you found this useful. Thank you for reading.