a (very simple) language model
Much of this post is inspired by Andrej Karpathy's Building makemore series!
Notebook implementing Neural Probabilistic Language Model paper: https://colab.research.google.com/github/heerboi/AI-from-scratch/blob/main/neural_probabilistic_language_model.ipynb
Notebook implementing the code in this post: https://colab.research.google.com/github/heerboi/AI-from-scratch/blob/main/building_(tiny)_language_models.ipynb
Okay, what are we doing here, Heer? We're going to learn a bit about language models (and make a tiny one from scratch!).
Breaking it down
Transformer architecture with the encoder on the left, and decoder on the right (Vaswani et al., 2017)1
There you go, does that break it down for you? Before we go any further, I'd like to assure you that we're not gonna be exploring Transformers in this post (But I am planning it to be my next). While transformers are the building blocks of GPT and other big-time LLMs, they are absolutely not necessary to build a language model2!
For the purpose of this post, let me define language models to be machine learning models that take text tokens as input, do some multiplications and additions, and output probabilities for a set of tokens. What the heck are tokens?! I understand your frustration (as was mine when I started learning this).
A token is some representation of your input that the model understands. Think of it from our perspective; when we encounter a word we haven't heard before, our brain breaks it down (think "non-existent") into some pieces. These are also tokens. The way an ML model represents things is using numbers! Since we will be inputting text, we'll need to convert it into some form of numbers before feeding it into the language model. The model will output probabilities for similar tokens, which we can then convert back into text.
A language model takes these tokens as input, finds its corresponding representation, and then does silly math to do whatever you tell it to do. Usually, it's to predict the next token. While training, it predicts the next token and checks if it's correct. If it is, then voila, it's working!!! If it's not, it will figure out which representations were incorrect and then slightly nudge it by a small factor that will help the model make correct predictions.
An example of tokenization that is used by GPT2! (play w it here: https://tiktokenizer.vercel.app/?model=gpt2)
Groundwork
We need a definitive objective for our language model. The objective that I will be selfishly using is to create names! If you don't have experience in ML, it might be a bit difficult to think of objectives but at the core of it, there are two basic objectives - classifying things and predicting things.
We'll be training our language model on a dataset of names to get better at creating (better than) random names. The simplest model is a bigram model, which takes only a single character as input, and predicts the next character. For the name "Amy", the training examples would be:
- predicting 'a' as the first character
- taking 'a' as input and predicting 'm'
- taking 'm' as input and predicting 'y'
- taking 'y' as input and predicting that it's the last character
Now, this model will not perform well because:
- Suppose the 3rd point, where the model has to predict 'y' with 'm' as the input, it does not make use of the information that there is an 'a' before the 'm', and that 'a' is the first letter of the name! It is simply looking at the last character, with no other information - perpetually blind!
- It will be trained on hundreds, if not thousands, of names. Think about the model encountering "Amy" and "Amelia". It'll think, "Okay, 'y' after 'm', 'y' after 'm'!", and then get (mathematically) confused, thinking, "Okayyyy.... 'e' after 'm'?".
But, it is a good place to start learning.
Lastly, since we need something to denote the start and end of a name, we'll append a dot (.) to both ends.
!curl https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt > names.txt
words = open("names.txt", 'r').read().splitlines()
input_char = []
output_char = []
for word in words:
chars = ['.'] + list(word) + ['.']
for char1, char2 in zip(chars, chars[1:]):
input_char.append(char1)
output_char.append(char2)
#1 a very simple language model (using frequencies)
The model will predict the next character based on what it has seen in the training data. For each character, it will have a set of probabilities that says how often a character comes after it. How can we reach these probabilities in the easiest way? By counting the pairs!
We can track counts of pairs of the 27 unique characters (a-z and .) in a 27x27 array or matrix, like this!
import torch
# Tracking counts
frequencies = torch.zeros((27, 27), dtype = torch.int32)
unique_chars = ['.'] + sorted(list(set(''.join(words))))
stoi = {c:i for i, c in enumerate(unique_chars)}
itos = {i:c for c, i in stoi.items()}
for word in words:
chars = ['.'] + list(word) + ['.']
for char1, char2 in zip(chars, chars[1:]):
char1_idx = stoi[char1]
char2_idx = stoi[char2]
frequencies[char1_idx, char2_idx] += 1
# Graphing
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize = (16,16))
plt.imshow(frequencies, cmap="Blues")
for idx1 in range(27):
for idx2 in range(27):
pair = itos[idx1] + itos[idx2]
plt.text(idx2, idx1, pair, ha = "center", va="bottom", color = "black")
plt.text(idx2, idx1, frequencies[idx1, idx2].item(), ha="center", va="top", color="black")
plt.axis("off")
(this is interesting in itself to analyse! Most names in our data end with either n or a.)
How do we now use these counts to make new names? We feed the network a dot as the beginning of a name, take the output, and keep running it through the model until it outputs a dot. The first row gives the counts of characters that come after the dot. How do we treat these counts as probabilities (or, more simply, convert them to be between 0 and 1)?
Normalization! Remember this, okay?! To convert a set of counts to probabilities, we can divide each count by the sum of counts. Let's do that for each row.
# Calculate probabilities
probabilities = (frequencies+1).float()
probabilities /= probabilities.sum(dim=1, keepdim=True)
# Graphing
plt.figure(figsize = (16,16))
plt.imshow(frequencies, cmap="Blues")
for idx1 in range(27):
for idx2 in range(27):
pair = itos[idx1] + itos[idx2]
plt.text(idx2, idx1, pair, ha = "center", va="bottom", color = "black")
plt.text(idx2, idx1, format(probabilities[idx1, idx2].item(), ".2f"), ha="center", va="top", color="black")
plt.axis("off")
Isn't that beautiful?
Note: A lot of these probabilities are 0, meaning that they will never get chosen while making names. To offset this, we can add a constant value to all counts, resulting in better distributed probabilities.
actually making some names!
torch.multinomial
treats a set of numbers as probabilities and samples from it, returning the index (this is important) of the sampled number.
Here's the code first to make it easier to understand:
for _ in range(5):
name = []
ix = 0
while True:
probability = probabilities[ix]
ix = torch.multinomial(probability, num_samples=1, replacement=True).item()
if ix == 0:
break
name.append(itos[ix])
print("".join(name))
We start with a dot, and, until we reach the ending dot, we'll:
- take the row of probabilities for the next character,
- sample an index based on them,
- convert it to its character,
- and append it to the name.
BOOM, that's the simplest language model!
measuring its performance!
We usually measure a model's performance using some metric called "loss". The higher the loss, the worse the model. To measure the loss, we need a set of examples for which we know the correct answer, such as our training dataset (that had pairs of characters).
After training on our dataset, we can calculate the training loss, which tells you how well the model performs on the dataset that it was trained on. If we get a probability of 1.0 for each training example, it means our model is perfectly predicting the next character. In statistical modelling, it is very common to use likelihood to measure the quality of a model. It is calculated by multiplying the output probabilities of the correct next character.
likelihood = 1.0
for word in words[:3]:
chars = ['.'] + list(word) + ['.']
for char1, char2 in zip(chars, chars[1:]):
# iterate through all examples and fetch the probability of char2 given char1
pair_prob = probabilities[stoi[char1], stoi[char2]]
likelihood *= pair_prob
print(f"Model likelihood: {likelihood:.20f}")
The likelihood comes out to be a veryyyyyy small number, because we are just multiplying very small numbers between 0 and 1, to get even smaller numbers! We can use logarithm to make this understandable.
Let's see why log fits our needs:
- A perfect model will have likelihood of 1.0 (as all probabilities will be 1.0). So, we expect its loss to be as low as possible. !
- =
- log() results in negative values when , which means our loss will be negative. We can just flip the sign to get the positive loss.
log_likelihood = 0
n = 0
for word in words:
chars = ['.'] + list(word) + ['.']
for char1, char2 in zip(chars, chars[1:]):
pair_prob = probabilities[stoi[char1], stoi[char2]]
log_likelihood += torch.log(pair_prob)
n += 1
print(f"Model log likelihood: {log_likelihood/n:.4f}")
print(f"Model negative log likelihood: {-log_likelihood/n:.4f}")
BAM, now we can measure how good our model is, using negative log likelihood! To have a better measure, we can divide the total negative log likelihood by the number of samples that we are calculating it over. If you've ever heard of neg log likelihood and didn't understand it, well, now you do!
NOTE: Log likelihoods are going to be very important for the rest of this blog, and also for your ML journey, so remember the concept.
names and performance
Here are some names that the model produces:
A model that does not learn anything would have equal probabilities for all characters () and has a negative log likelihood of 3.2961. Our final negative log likelihood ends up at 2.4544, 34% better!
next steps
Okay, now that we're past that, it feels a little... stupid. But this frequency matrix approach is extremely fast; it's just a lookup! There's no training required and it always produces the same set of probabilities given the same input data. These are great characteristics for a baseline model! A baseline model is some model against which we can evaluate better, more complex models to make sure that they're learning something.
Here's a problem, though. Our vocabulary in this problem consists of 27 individual characters only! The frequency matrix currently has 27x27 (729) total entries. What if we decide to take two characters as input? Then, for every combination of two characters, we have 27 possibilities. That goes up to 27x27x27 (19,683) entries!
Now, if we were to encode the word "hello" using our character vocabulary, we'd have [8, 5, 12, 12, 15],
and if we use the GPT-4 tokenizer, we'd have just [15339]
. That's 5 tokens to one word, and this increases very rapidly as you can imagine. Moreover, it'd take much more compute + memory to reach the same level of performance as an LLM that uses word-level encoding because it would have to learn the structure of words and prefixes and suffixes, and then semantically understand the meaning of these character-sequences. For languages that are rare, not well-defined, or that don't have much data, this is a good place to start. Modern LLMs use word-level encoding so, during training, it learns much quicker.
This is where neural nets come in. Instead of learning by counting frequencies, neural nets learn by modifying and building their own internal representation of the vocabulary (also called embeddings), and "learns" the function that relates the inputs to the outputs. For example, for our current model, we have 27 characters, and 27 characters that can come after it. So you can imagine that the model has created an internal representation of the first 27 characters in terms of the probabilities of the next character. So, the model has no other information for the 27 characters except the probabilities of the 27 characters that could come after it.
In case of a neural net, we can say that we want the neural net to represent a character using just 2 numbers, or 200 numbers, even. It is basically a bunch of matrix multiplies, but, very cleverly, we can get from this representation of 2 or 200 numbers back to 27 numbers for our output.
The neural net starts by initialising a random representation matrix and a weight matrix. A weight matrix is what transforms (by matrix multiplication) the representation of the inputs to the outputs! It maps the input representation to the output. Suppose we have 5000 names, we use 5 numbers to represent each token embedding, and pass three tokens as input, making our input matrix have 5000x(5 x 3) = 5000x15 dimensions. Now, we want the outputs to be probabilities of the next character for each of the 27 characters. Our weight matrix, then, will be 15x27! It gets matrix multiplied with the input matrix to get a 5000x27 matrix as the output (27 probabilities for 5000 examples/inputs). This won't work (not very well, atleast). This is because we have no non-linearity (activation functions) and no hidden layers!
the thing about non-linearities and hidden layers
Hidden layers are intermediate layers between the inputs and outputs of a neural network. Every hidden layer has a set of weights that it uses to manipulate its inputs. If you imagine a simple regression model that directly maps the inputs to the outputs using scaling and addition (roughly speaking), it won't be able to learn complex patterns within the data.
Suppose a function which is a weighted sum of the inputs. The model would have to come up with the multiplicative factors of and , which are 2 and 3 respectively. Such a function is pretty easy to map using a linear model. Now suppose a function which is not just a weighted sum, but includes a term that is a combination of the two inputs. This is a non-linear function, and cannot be mapped using a linear model, simply because it cannot create intermediate patterns using the input (like using only and as inputs). Note that if we add as an input alongside and for a linear model, it will then be able to map that function. In most complex cases (that simple regression cannot solve), you won't have very good inputs to build a good model; the model will try and create variations of the inputs it thinks performs the best.
The hidden layers act as intermediate features (think of it as carefully engineered inputs that are more suitable for predicting the output) that helps the model map non-linearities within the data. There's another catch (I know, I'm sorry), if the hidden layers simply perform a weighted sum of its inputs arbitrarily, it is not any different than a linear model; it won't be able to map non-linear functions. That's where activation functions come in.
Activation functions help the model understand complex patterns within the data by creating piecewise linear functions, which are combined to map a complete non-linear function. These intermediate linear functions can be thought of as complex patterns within the data. The model understands these patterns and can solve complex problems by weighing the outputs of these patterns. Moreover, each hidden layer builds upon the linear functions of the previous layer, combining to find more complex patterns as you go on.
a small example
To show the difference a non-linearity makes, let's take the following tiny network:
It looks complex but it's pretty simple (I couldn't add math eqns to draw.io) 😭 Okay, we've got two inputs going into the net - and . These inputs are interconnected with three hidden nodes, with each connection having a separate weight. For ease of calculations, we bunch up the weights for each hidden node, leaving us with three weight vectors , , and , each of dimension (2 rows x 1 col). The inputs are multiplied with their respective weights and summed up in the hidden layer. The inputs are packed in a (1 row x 2 cols) vector, which is multiplied with the weight matrix to get a single value (), and vice versa. The hidden layer values are summed with the biases to get linear outputs , , and . We'll be applying the ReLU (Rectified Linear Unit) activation function - one of the simplest non-linearities. It allows the input value if greater than 0, otherwise clamps it to 0; sounds insane but this is actually a pretty good activation function!
To show the difference, I'll start by showing what , , and look like, with and without applying a non-linearity. After that, I'll show how the final output looks with and without a non-linearity.
The top row shows a countour of all neuron outputs, and the non-linearity applied to the outputs in the bottom row. All values less than 0 in the top row are clamped to 0 by ReLU in the bottom row. Keeping this in mind, it's hard to imagine how there'd be a difference in the sum of the outputs in the first row and the second row. The graph below shows the respective sums.
BOOM! The linear combination is indeed linear, whereas the non-linear one is indeed non-linear! This is the magic of non-linearities, even one as simple as ReLU. For me, an intuitive way to think about how non-linearities work is imagining that they're an external force on the hidden neurons - each neuron has different "breakpoints" where it switches on/off. In this example, the first hidden node has weights . The decision boundary separating the on and off region is then given by the equation !
I hope it is clear now why our approach wouldn't work without a hidden layer and a non-linearity - our problem is much more complex than just the weighted sum of the inputs.
Too much word, now we code.
#2 neural net approach
groundwork - input
We need to create a dataset of inputs and outputs from the names. We're taking two characters as input and one character as the output (trigram). We'll be using the same stoi
and itos
dictionaries as our tokenizer, with 27 unique tokens.
xs = []
ys = []
# taking 2 characters as input
block_size = 2
for word in words:
word = word + "."
# starting with input ['.', '.']
context = [0] * block_size
for char in word:
xs.append(context)
ys.append(char)
# strip the first char in context and add the next char
context = context[1:] + [stoi[char]]
xs = torch.tensor(xs)
ys = torch.tensor(ys)
We have a total of 228,146 examples. Now, we need to split this dataset into a train dataset and a test dataset, one for training and one for testing only. Why? Because, if we both train and test the model on the same dataset, we cannot assess its performance on data that it hasn't seen before.
split = int(len(xs) * 0.8)
X_train, y_train = xs[:split], ys[:split]
X_test, y_test = xs[split:], ys[split:]
The train dataset has 80% of the examples - 182,516, and the test dataset has the remaining 20% - 45,630.
model
Let's set the stage for the neural language model! The main things we have to keep in mind are:
- embedding matrix, inputs
- input-to-hidden layer weights, hidden bias, hidden-layer activations
- hidden-to-output layer weights, output bias, output activations (softmax), and
- the loss function (-ve log likelihood as we discussed earlier) to tell us how well the model is doing.
Okay! Now, we have the trainable parameters (weights, embedding, biases) and non-trainable parameters or hyperparameters (learning rate, block size, hidden layer size). We're gonna find pretty good trainable parameters by training, but the non-trainable ones are sort of trial and error; we have good estimates though. Tweak these parameters in the notebook linked above and see if you can create a better model!
# NON-TRAINABLE PARAMS
# 50 hidden nodes
hidden_size = 50
# using 3 numbers to represent a token
embedding_size = 3
# using a vocabulary of 27 tokens
token_size = 27
# learning rate!
lr = 1e-4
epochs = 200000
# number of examples to use each epoch
batch_size = 1024
import torch.nn as nn
import torch.nn.functional as F
# TRAINABLE PARAMS
# 27 x 3
embedding = nn.Embedding
(num_embeddings=token_size, embedding_dim=embedding_size)
# 2 * 3 x 50 = 6x50
W1 = torch.randn((block_size * embedding_size, hidden_size))
# 50
b1 = torch.randn(hidden_size)
# 50x27
W2 = torch.randn((hidden_size, token_size))
# 27
b2 = torch.randn(token_size)
parameters = [embedding.weight, W1, b1, W2, b2]
We're only evaluating a single example in the network shown above but it can be generalised to work on multiple examples at a time by adding rows of examples ( x 6) to the input and getting x 27 outputs, one set for each example. You can even use the entire training dataset, but it's very slow. In most cases, we use a mini-batch - a fixed-length random subset of the training dataset. Here we use a mini-batch of size 1024.
We have the embedding matrix that has 27 entries, each with 3 numbers. The example here has the input: "" and true output: "". The embeddings for and are pulled from the embedding matrix and combined to form a 1 x 6 input vector, which is then passed onto the network. The input is multiplied with the input-to-hidden weights and hidden bias and then use the ReLU activation on the resulting matrix to get the hidden-layer activations. These are multiplied with the hidden-to-output weights and output bias to get the output activations.
# select a random set of indices of size batch_size
ix = torch.randint(0, X_train.shape[0], (batch_size,))
# dims = (batch, 2)
batch = X_train[ix]
# dims = (batch, 2, 3)
embeddings = embedding(batch)
# dims = (batch, 6)
# flatten flattens the inputs to a single dim
X_train_embeds = nn.Flatten()(embeddings)
W1x = torch.relu(X_train_embeds @ W1 + b1)
W2W1x = W1x @ W2 + b2
converting logits to probabilities
We're at a standstill again; how do we use these activations to predict the next character? I hope you remember our chat about normalisation! Back then, we were using basic frequency counts which were always positive numbers, so we didn't have a problem converting them to probabilities simply by dividing the sum of the respective row. The activations here can be negative too! Moreover, we interpret these activations as unnormalised log probabilities. Why? Let's think.
Let's suppose we're trying to map the probability of a character occuring after : . Now, we can directly set this to be equal to a linear combination of our inputs (with/without a non-linearity) as follows:
Can you see the problem? First of all, the linear combination can be negative, which is not allowed. Secondly, the probabilities of all events in the same event space must sum to 1. How do you even begin to ensure that? A better way is to instead map this function:
Now, we don't have to make sure that the probabilities sum to 1 just yet because we can normalise them after getting to the actual probabilities. The log function can take negative and positive values so we don't have to constrain the activations to always be positive. So the activations are unnormalised (because they don't sum to 1) log probabilities.
To convert them to probabilities, we can apply the exponent function to both sides:
and normalise:
This exponent divided by sum of exponents is the infamous softmax function, as denoted in the architecture above. It's sort of the best way to go from neural net activations to probabilities.
Once we have the final layer probabilities, we're ready for either training or inference.
# Below is the explicit softmax + loss
# but this does not handle exploding gradients due to bad initialisation
# so we use cross entropy (softmax + -ve log likelihood)
probabilities = F.softmax(W2W1x, dim=1)
training the model
Training a neural net is more difficult than a frequency counting model because in the latter, we were just counting the pairs we saw in the data and then normalizing to get probabilities. Here, however, we have to update all the trainable parameters - 1808 total parameters!
We'll need to use the loss so we can know exactly how the model is performing, and then find the rate of change in the loss with respect to change in every trainable parameter i.e. partial derivative of the loss w.r.t trainable parameters. This derivative tells us how much the loss changes with a unit increase in a parameter. If the derivative is -ve, the loss will decrease with an increase in the parameter value, and vice versa. We can then slightly nudge the value of the parameter based on this derivative!
After updating all the parameter values, we run the model again on a set of data, and calculate the loss and derivatives again to try and reach the global minimum. Think of the loss function as a function with multiple changeable parameters. This function will most probably have a lot of local minima, and, since we can't map the entire function, we cannot be exactly sure if we've reached a local or a global minimum.
If we change the parameter values by a lot, we might reach the global minima part, but equally as likely is we'll shoot ourselves in the foot and get an even worse loss. That's why we only slightly nudge the values (denoted by the learning rate) for multiple training runs (epochs).
This method of calculating the partial derivates or gradients backwards is called backpropagation. Let's look at the computational graph of the loss function.
Sorry about the clunky diagram! The gradients flow backwards through the connecting nodes all the way to the embedding matrix! Let's look at an example.
Here is our loss function for a single example:
where is the probability assigned to the index of the next character For multiple examples, we just average it up:
Now, let's look at the first trainable parameter backwards, which is the output bias matrix. We're going to be calculating this partial derivative:
$\frac{d(loss)}{d(bias_{2})} = $
Similarly, the partial derivative of the hidden-to-output weights:
This term is going to be "propagated" to all the derivatives to reduce redundant computation. PyTorch however handles all of this for us using something called autograd that keeps a track of the computation performed on the parameters, and can produce the derivative by backtracking these computations through to the particular parameters. I'll link a video that explains it in detail3.
So, finally, we run the model for multiple epochs, tracking the loss, calculating the gradients and updating the parameters after each epoch.
# this does not handle exploding gradients due to bad initialisation
# i.e. e^100 = 2e+43 so if we have high activations, the gradient will explode
# so we use cross entropy (softmax + -ve log likelihood) which handles exploding gradients by clipping them
# loss = -probabilities[torch.arange(batch_size), y_train[ix]].log().mean()
loss = F.cross_entropy(W2W1x, y_train[ix])
# Have to clear the previous gradient in every epoch. otherwise it will acculumate
for p in parameters:
p.grad = None
# performs autograd and sets the gradient w.r.t loss in the "grad" attribute for all params
loss.backward()
for p in parameters:
p.data -= lr * p.grad
names and performance
The negative log likelihood loss we calculated for the frequency counting model was on the whole dataset without any splits, and we got 2.4543. Using this model, we're able to get a loss of 2.3480 on the train dataset, and 2.5320 on the test dataset, an approximately 5% improvement. If you tweak the hyperparameters, you can definitely beat this! Using a similar network, I was able to get a train loss of 1.98 and a test loss of 2.15 (notebook attached above).
To predict some names, we start with an empty context ['.', '.']
and run it through the model. After getting the final probabilities, we select a character using torch.multinomial()
and check to see if it's the end of a name ('.'
). If it is, we end the loop there; if not, we continue.
names = []
for _ in range(5):
context = [0] * block_size
# init empty name
name = ""
while True:
# no gradient calculation
with torch.no_grad():
# take out the embeddings for context, and flatten them
input_embeds = embedding(torch.tensor(context)).view(1, -1)
W1x = torch.relu(input_embeds @ W1 + b1)
W2W1x = W1x @ W2 + b2
probs = F.softmax(W2W1x, dim=1)
# select one probability
next_char = torch.multinomial(probs, num_samples=1, replacement=True).item()
# if selected token is a '.', exit
if next_char == 0:
break
# remove the first token and add the new token to the context
context = context[1:] + [next_char]
# add the char to the name
name += itos[next_char]
print(name)
That's it from my side! Thank you for reading, folks. pls give feedback bec this is my first blog post that I'm sharing with people (and i know a lot of cool people in ML!). The next post is probably going to be about ATTENTION!
FOOTNOTES
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, June 12). Attention Is All You Need. ArXiv. https://arxiv.org/abs/1706.03762↩
Language models were first introduced as far back as 1987 with a 2003 paper introducing a method of learning a "distributed representation" of words - A Neural Probabilistic Language Model by Bengio et al.↩