Transformers

Background

As we discussed in the previous article, LLMs require pre-training on enormous amounts of data, which in turn requires significant computational resources. This staggering amount of computation is only made possible by using special computer chips that are optimised for running many operations in parallel, known as GPUs.

gpu.jpg

But not all language models can be easily parallelised. Before 2017, most language models—specifically Recurrent Neural Networks (RNNs)—would process text one word at a time. This created a bottleneck: the model had to wait for the previous word to be processed before moving to the next, making it impossible to fully utilise the parallel power of GPUs to process text one word at a time. Then, a team of researchers at Google introduced a new model called the transformer.

google.jpg

Transformers don't read text from the start to the finish, they soak it all in at once, in parallel.

Transformer

The very first step inside a transformer, and most other language models for that matter, is to associate each word with a long list of numbers. This is due to the fact that the training process only works with continuous values, so you have to somehow encode language using numbers.

token.jpg

Each one of these long lists of numbers must somehow encode the meaning of the corresponding word.

What makes transformers unique is their reliance on a specialised operation called attention. This operation gives all of these lists of numbers a chance to communicate with one another and refine the meanings they encode based on the context around them, all done in parallel.

bank.jpg

For example, in the image above, the numbers encoding the word bank might be changed based on the context surrounding it, like river and jumped into, to somehow encode the more specific notion of a riverbank.

Transformers typically also include a second type of operation known as a feed forward neural network/Multi-layer Perceptron. This operation gives the model an extra capacity to store more patterns about language that it learned during training.

All of this data then repeatedly flows through many different iterations of these two fundamental operations. As it does so, the hope is that each list of numbers is enriched to encode whatever information might be needed to make an accurate prediction of the next word in the passage.

overover.jpg

At the end, one final function is performed on the last vector in this sequence, which now has been updated by all of the context from the input text as well as everything the model learned during training, to produce a prediction of the next word.

end.jpg

While researchers do design the framework for how each of these steps work, it's important to understand that the specific behavior is an emergent phenomenon based on how those hundreds of billions of parameters are tuned during training. This makes it incredibly challenging to understand why the model makes the exact predictions that it does.

References / Resources