Tokens

Tokens are nothing but the smaller chunks of data, the data that you give as an input or the data that the model gives you back as an output.

These chunks can be a word, part of a word, or even a single letter. Punctuation marks and spaces also count as tokens.

Roughly speaking, 1 token consumes 4 characters (in English).

But why do we need tokens?

The underlying structure of LLMs uses Transformers, which can only process numbers, not raw text. tokenization acts like a dictionary, mapping text chunks into numerical IDs. These IDs are then converted into a list of vectors (embeddings), allowing the Transformer's attention mechanism to establish mathematical relationships between words based on their surroundings. tokenization is the necessary first step to bridge human language and machine math.

How to count tokens?

There are multiple libraries and websites available on which you can get an idea of how text converts to tokens; a few of them are listed below. You can try it yourself to see tokenization in action

Tiktoken Library and Guide - OpenAI

Tiktokenizer Website - David Duong

Resources / References

What are tokens and how to count them? - OpenAI