In this interactive session, we will explore how to build deep learning encoders and decoders from scratch. Our first example will use text encoders and a small corpus of text.
The test we will use is “The Verdict”, which is used in Sebastian Raschka’s excellent book, “Build a Large Language Model from Scratch.”
Code
import urllib.requesturl = ("https://raw.githubusercontent.com/rasbt/""LLMs-from-scratch/main/ch02/01_main-chapter-code/""the-verdict.txt")with urllib.request.urlopen(url) as response: raw_text = response.read().decode("utf-8")print("Total number of character:", len(raw_text))print(raw_text[:99])
Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no
The first step is to tokenize this text. We can use regular expressions to split the text into words.
Code
import retext ="Hello, world. Welcome to text encoding."result = re.split(r'(\s)', text)print(result)
The decision to remove or include whitespace characters is application-dependent. For example, in coding (especially Pyhon coding), whitespace is essential to the structure and function of text, so it should be included in training.
Whitespace in Coding
Most tokenization schemes include whitespace, but for brevity, these next examples exclude it.
Code
text ="Hello, world. Is this-- a good decoder?"result = re.split(r'([,.:;?_!"()\']|--|\s)', text)result = [item.strip() for item in result if item.strip()]print(result)
We can run this against our entire corpus of text.
Code
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)preprocessed = [item.strip() for item in preprocessed if item.strip()]print(preprocessed[:20])
Creating a vocabulary requires assigning each unique ID to each token. We can do this using a dictionary comprehension.
Code
vocab = {token: idx for idx, token inenumerate( all_words, start=0)}# Print a list of the first 10 tokens and their ids:print(list(vocab.items())[:10])
To facilitate the conversion of tokens to ids and vice versa, we can create a tokenizer class with an encode and decode method. The encode method splits text into tokens and then maps these to ids using the vocabulary. The decode method does the reverse, converting ids back to tokens.
Code
class Tokenizer:def__init__(self, vocab):self.vocab = vocabself.token_to_id = vocabself.id_to_token = {idx: token for token, idx in vocab.items()}def encode(self, text):"""Encode a string into a list of token IDs.""" pattern =r'([,.?_!"()\']|--|\s)' tokens = ( t.strip() for t in re.split(pattern, text) if t.strip() )return [self.token_to_id[t] for t in tokens]def decode(self, ids):"""Decode a list of token IDs into a string.""" text =" ".join(self.id_to_token[i] for i in ids)return re.sub(r'\s+([,.?!"()\'])', r'\1', text)
Let’s test our new Tokenizer class
Code
text =""""It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""tokenizer = Tokenizer(vocab)ids = tokenizer.encode(text)print(f"Encoded:\n{ids}")print(f"Decoded:\n{tokenizer.decode(ids)}")
Encoded:
[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
Decoded:
" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.
The vocabulary of our tokenizer is limited by the tokens that appear in the corpus. Any word that is not in the corpus cannot be encoded.
text ="Hello, do you like tea?"print(tokenizer.encode(text))
Causes a KeyError:
KeyError:'Hello'
Handling Unknown Tokens and Special Tokens
In practical LLM applications, we need to handle words that don’t appear in our training vocabulary. This is where unknown tokens and other special tokens become essential.
Special Tokens in LLMs
Modern tokenizers use several special tokens:
<|unk|> or [UNK]: Unknown token for out-of-vocabulary words
<|endoftext|> or [EOS]: End of sequence/text marker
<|startoftext|> or [BOS]: Beginning of sequence marker (less common in GPT-style models)
<|pad|>: Padding token for batch processing
Let’s create an enhanced tokenizer that handles unknown tokens:
Code
class EnhancedTokenizer:def__init__(self, vocab):# Add special tokens to vocabularyself.special_tokens = {"<|unk|>": len(vocab),"<|endoftext|>": len(vocab) +1 }# Combine original vocab with special tokensself.full_vocab = {**vocab, **self.special_tokens}self.token_to_id =self.full_vocabself.id_to_token = {idx: token for token, idx inself.full_vocab.items()}# Store reverse mapping for special tokensself.unk_token_id =self.special_tokens["<|unk|>"]self.endoftext_token_id =self.special_tokens["<|endoftext|>"]def encode(self, text, add_endoftext=True):"""Encode text, handling unknown tokens.""" pattern =r'([,.?_!"()\']|--|\s)' tokens = [t.strip() for t in re.split(pattern, text) if t.strip()]# Convert tokens to IDs, using <|unk|> for unknown tokens ids = []for token in tokens:if token inself.token_to_id: ids.append(self.token_to_id[token])else: ids.append(self.unk_token_id) # Use unknown token# Optionally add end-of-text tokenif add_endoftext: ids.append(self.endoftext_token_id)return idsdef decode(self, ids):"""Decode token IDs back to text.""" tokens = []foridin ids:ifid==self.endoftext_token_id:break# Stop at end-of-text tokenelifidinself.id_to_token: tokens.append(self.id_to_token[id])else: tokens.append("<|unk|>") # Fallback for invalid IDs text =" ".join(tokens)return re.sub(r'\s+([,.?!"()\'])', r'\1', text)def vocab_size(self):"""Return the total vocabulary size including special tokens."""returnlen(self.full_vocab)
Let’s test our enhanced tokenizer:
Code
enhanced_tokenizer = EnhancedTokenizer(vocab)# Test with unknown wordstext_with_unknown ="Hello, do you like tea? This is amazing!"encoded = enhanced_tokenizer.encode(text_with_unknown)print(f"Original text: {text_with_unknown}")print(f"Encoded IDs: {encoded}")print(f"Decoded text: {enhanced_tokenizer.decode(encoded)}")print(f"Vocabulary size: {enhanced_tokenizer.vocab_size()}")
Original text: Hello, do you like tea? This is amazing!
Encoded IDs: [1130, 5, 355, 1126, 628, 975, 10, 97, 584, 1130, 0, 1131]
Decoded text: <|unk|>, do you like tea? This is <|unk|>!
Vocabulary size: 1132
Notice how unknown words like “Hello” and “amazing” are now handled gracefully using the <|unk|> token, and the sequence ends with an <|endoftext|> token.
Code
# Let's see what the special token IDs areprint(f"Unknown token '<|unk|>' has ID: {enhanced_tokenizer.unk_token_id}")print(f"End-of-text token '<|endoftext|>' has ID: {enhanced_tokenizer.endoftext_token_id}")# Test decoding with end-of-text in the middletest_ids = [100, 200, enhanced_tokenizer.endoftext_token_id, 300, 400]print(f"Decoding stops at <|endoftext|>: {enhanced_tokenizer.decode(test_ids)}")
Unknown token '<|unk|>' has ID: 1130
End-of-text token '<|endoftext|>' has ID: 1131
Decoding stops at <|endoftext|>: Thwing bean-stalk
Byte Pair Encoding (BPE) with Tiktoken
While our simple tokenizer is educational, production LLMs use more sophisticated tokenization schemes like Byte Pair Encoding (BPE). BPE creates subword tokens that balance vocabulary size with representation efficiency.
The tiktoken library provides access to the same tokenizers used by OpenAI’s GPT models.
Code
import tiktoken# Load the GPT-2 tokenizer (used by GPT-2, GPT-3, and early GPT-4)gpt2_tokenizer = tiktoken.get_encoding("gpt2")print(f"GPT-2 vocabulary size: {gpt2_tokenizer.n_vocab:,}")
GPT-2 vocabulary size: 50,257
Let’s compare our simple tokenizer with GPT-2’s BPE tokenizer:
Let’s see how BPE handles domain-specific geospatial terminology:
Code
geospatial_texts = ["NDVI vegetation index analysis","Landsat-8 multispectral imagery", "Convolutional neural networks for land cover classification","Sentinel-2 satellite data preprocessing","Geospatial foundation models fine-tuning"]gpt2_enc = tiktoken.get_encoding("gpt2")print("=== Geospatial Text Tokenization ===")for text in geospatial_texts: tokens = gpt2_enc.encode(text)print(f"'{text}'")print(f" -> {len(tokens)} tokens: {tokens}")print()
The choice of tokenizer affects model efficiency. Let’s compare token counts for our text corpus:
Code
# Use a sample from our corpussample_text = raw_text[:500] # First 500 charactersprint("=== Token Efficiency Comparison ===")print(f"Text length: {len(sample_text)} characters")print()# Our simple tokenizerour_tokens = enhanced_tokenizer.encode(sample_text, add_endoftext=False)print(f"Simple tokenizer: {len(our_tokens)} tokens")# GPT-2 BPEgpt2_tokens = gpt2_tokenizer.encode(sample_text)print(f"GPT-2 BPE: {len(gpt2_tokens)} tokens")# Calculate efficiencychars_per_token_simple =len(sample_text) /len(our_tokens)chars_per_token_bpe =len(sample_text) /len(gpt2_tokens)print(f"\nCharacters per token:")print(f" Simple: {chars_per_token_simple:.2f}")print(f" BPE: {chars_per_token_bpe:.2f}")print(f" BPE is {chars_per_token_bpe/chars_per_token_simple:.1f}x more efficient")
=== Token Efficiency Comparison ===
Text length: 500 characters
Simple tokenizer: 110 tokens
GPT-2 BPE: 123 tokens
Characters per token:
Simple: 4.55
BPE: 4.07
BPE is 0.9x more efficient
Key Takeaways
Special Tokens: Essential for handling unknown words and sequence boundaries
Unknown Token Handling: <|unk|> tokens allow models to gracefully handle out-of-vocabulary words
End-of-Text Tokens: <|endoftext|> tokens help models understand sequence boundaries
BPE Efficiency: Byte Pair Encoding creates a good balance between vocabulary size and representation efficiency
Domain Adaptation: Different tokenizers may handle domain-specific text differently
In geospatial AI applications, choosing the right tokenization strategy affects both model performance and computational efficiency, especially when working with specialized terminology from remote sensing and Earth science domains.
Source Code
---title: "Deep Learning Basics"subtitle: "Building encoders and decoders"editor_options: chunk_output_type: consolejupyter: geoaiformat: html: toc: true toc-depth: 3 code-fold: show---## IntroductionIn this interactive session, we will explore how to build deep learning encoders and decoders from scratch. Our first example will use text encoders and a small corpus of text.The test we will use is "The Verdict", which is used in Sebastian Raschka's excellent book, "Build a Large Language Model from Scratch."```{python}# | echo: trueimport urllib.requesturl = ("https://raw.githubusercontent.com/rasbt/""LLMs-from-scratch/main/ch02/01_main-chapter-code/""the-verdict.txt")with urllib.request.urlopen(url) as response: raw_text = response.read().decode("utf-8")print("Total number of character:", len(raw_text))print(raw_text[:99])```The first step is to tokenize this text. We can use regular expressions to split the text into words.```{python}#| echo: trueimport retext ="Hello, world. Welcome to text encoding."result = re.split(r'(\s)', text)print(result)```We want to separate punctuation and preserve capitalization.```{python}#| echo: trueresult = re.split(r'([,.]|\s)', text)print(result)```The decision to remove or include whitespace characters is application-dependent. For example, in coding (especially Pyhon coding), whitespace is essential to the structure and function of text, so it should be included in training. :::{.callout-tip}## Whitespace in CodingMost tokenization schemes include whitespace, but for brevity, these next examples exclude it.:::```{python}# | echo: truetext ="Hello, world. Is this-- a good decoder?"result = re.split(r'([,.:;?_!"()\']|--|\s)', text)result = [item.strip() for item in result if item.strip()]print(result)```We can run this against our entire corpus of text.```{python}# | echo: truepreprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)preprocessed = [item.strip() for item in preprocessed if item.strip()]print(preprocessed[:20])```### Converting tokens into IDsModels assign each token a unique ID using a dictionary. These dictionaries are built from the text corpus.We create a set of unique tokens that define the vocabulary of the corpus, which determines the number of items in our dictionary.```{python}# | echo: trueall_words =sorted(set(preprocessed))vocab_size =len(all_words)print(vocab_size)```Creating a vocabulary requires assigning each unique ID to each token. We can do this using a dictionary comprehension.```{python}# | echo: truevocab = {token: idx for idx, token inenumerate( all_words, start=0)}# Print a list of the first 10 tokens and their ids:print(list(vocab.items())[:10])```To facilitate the conversion of tokens to ids and vice versa, we can create a tokenizer class with an encode and decode method. The encode method splits text into tokens and then maps these to ids using the vocabulary. The decode method does the reverse, converting ids back to tokens.```{python}# | echo: trueclass Tokenizer:def__init__(self, vocab):self.vocab = vocabself.token_to_id = vocabself.id_to_token = {idx: token for token, idx in vocab.items()}def encode(self, text):"""Encode a string into a list of token IDs.""" pattern =r'([,.?_!"()\']|--|\s)' tokens = ( t.strip() for t in re.split(pattern, text) if t.strip() )return [self.token_to_id[t] for t in tokens]def decode(self, ids):"""Decode a list of token IDs into a string.""" text =" ".join(self.id_to_token[i] for i in ids)return re.sub(r'\s+([,.?!"()\'])', r'\1', text)```Let's test our new `Tokenizer` class```{python}# | echo: truetext =""""It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""tokenizer = Tokenizer(vocab)ids = tokenizer.encode(text)print(f"Encoded:\n{ids}")print(f"Decoded:\n{tokenizer.decode(ids)}")```The vocabulary of our tokenizer is limited by the tokens that appear in the corpus. Any word that is not in the corpus cannot be encoded.```pythontext ="Hello, do you like tea?"print(tokenizer.encode(text))```Causes a `KeyError`:```bashKeyError:'Hello'```## Handling Unknown Tokens and Special TokensIn practical LLM applications, we need to handle words that don't appear in our training vocabulary. This is where **unknown tokens** and other **special tokens** become essential.### Special Tokens in LLMsModern tokenizers use several special tokens:- `<|unk|>` or `[UNK]`: Unknown token for out-of-vocabulary words- `<|endoftext|>` or `[EOS]`: End of sequence/text marker- `<|startoftext|>` or `[BOS]`: Beginning of sequence marker (less common in GPT-style models)- `<|pad|>`: Padding token for batch processingLet's create an enhanced tokenizer that handles unknown tokens:```{python}# | echo: trueclass EnhancedTokenizer:def__init__(self, vocab):# Add special tokens to vocabularyself.special_tokens = {"<|unk|>": len(vocab),"<|endoftext|>": len(vocab) +1 }# Combine original vocab with special tokensself.full_vocab = {**vocab, **self.special_tokens}self.token_to_id =self.full_vocabself.id_to_token = {idx: token for token, idx inself.full_vocab.items()}# Store reverse mapping for special tokensself.unk_token_id =self.special_tokens["<|unk|>"]self.endoftext_token_id =self.special_tokens["<|endoftext|>"]def encode(self, text, add_endoftext=True):"""Encode text, handling unknown tokens.""" pattern =r'([,.?_!"()\']|--|\s)' tokens = [t.strip() for t in re.split(pattern, text) if t.strip()]# Convert tokens to IDs, using <|unk|> for unknown tokens ids = []for token in tokens:if token inself.token_to_id: ids.append(self.token_to_id[token])else: ids.append(self.unk_token_id) # Use unknown token# Optionally add end-of-text tokenif add_endoftext: ids.append(self.endoftext_token_id)return idsdef decode(self, ids):"""Decode token IDs back to text.""" tokens = []foridin ids:ifid==self.endoftext_token_id:break# Stop at end-of-text tokenelifidinself.id_to_token: tokens.append(self.id_to_token[id])else: tokens.append("<|unk|>") # Fallback for invalid IDs text =" ".join(tokens)return re.sub(r'\s+([,.?!"()\'])', r'\1', text)def vocab_size(self):"""Return the total vocabulary size including special tokens."""returnlen(self.full_vocab)```Let's test our enhanced tokenizer:```{python}# | echo: trueenhanced_tokenizer = EnhancedTokenizer(vocab)# Test with unknown wordstext_with_unknown ="Hello, do you like tea? This is amazing!"encoded = enhanced_tokenizer.encode(text_with_unknown)print(f"Original text: {text_with_unknown}")print(f"Encoded IDs: {encoded}")print(f"Decoded text: {enhanced_tokenizer.decode(encoded)}")print(f"Vocabulary size: {enhanced_tokenizer.vocab_size()}")```Notice how unknown words like "Hello" and "amazing" are now handled gracefully using the `<|unk|>` token, and the sequence ends with an `<|endoftext|>` token.```{python}# | echo: true# Let's see what the special token IDs areprint(f"Unknown token '<|unk|>' has ID: {enhanced_tokenizer.unk_token_id}")print(f"End-of-text token '<|endoftext|>' has ID: {enhanced_tokenizer.endoftext_token_id}")# Test decoding with end-of-text in the middletest_ids = [100, 200, enhanced_tokenizer.endoftext_token_id, 300, 400]print(f"Decoding stops at <|endoftext|>: {enhanced_tokenizer.decode(test_ids)}")```## Byte Pair Encoding (BPE) with TiktokenWhile our simple tokenizer is educational, production LLMs use more sophisticated tokenization schemes like **Byte Pair Encoding (BPE)**. BPE creates subword tokens that balance vocabulary size with representation efficiency.The `tiktoken` library provides access to the same tokenizers used by OpenAI's GPT models.```{python}# | echo: trueimport tiktoken# Load the GPT-2 tokenizer (used by GPT-2, GPT-3, and early GPT-4)gpt2_tokenizer = tiktoken.get_encoding("gpt2")print(f"GPT-2 vocabulary size: {gpt2_tokenizer.n_vocab:,}")```Let's compare our simple tokenizer with GPT-2's BPE tokenizer:```{python}# | echo: truetest_text ="Hello, world! This demonstrates byte pair encoding tokenization."# Our enhanced tokenizerour_tokens = enhanced_tokenizer.encode(test_text, add_endoftext=False)our_decoded = enhanced_tokenizer.decode(our_tokens)# GPT-2 BPE tokenizer gpt2_tokens = gpt2_tokenizer.encode(test_text)gpt2_decoded = gpt2_tokenizer.decode(gpt2_tokens)print("=== Comparison ===")print(f"Original text: {test_text}")print()print(f"Our tokenizer:")print(f" Tokens: {our_tokens}")print(f" Count: {len(our_tokens)} tokens")print(f" Decoded: {our_decoded}")print()print(f"GPT-2 BPE tokenizer:")print(f" Tokens: {gpt2_tokens}")print(f" Count: {len(gpt2_tokens)} tokens") print(f" Decoded: {gpt2_decoded}")```### Understanding BPE Token BreakdownBPE tokenizers split text into subword units. Let's examine how GPT-2 breaks down different types of text:```{python}# | echo: truedef analyze_tokenization(text, tokenizer_name="gpt2"):"""Analyze how tiktoken breaks down text.""" tokenizer = tiktoken.get_encoding(tokenizer_name) tokens = tokenizer.encode(text)print(f"Text: '{text}'")print(f"Tokens: {tokens}")print(f"Token count: {len(tokens)}")print("Token breakdown:")for i, token_id inenumerate(tokens): token_text = tokenizer.decode([token_id])print(f" {i:2d}: {token_id:5d} -> '{token_text}'")print()# Test different types of textanalyze_tokenization("Hello world")analyze_tokenization("Tokenization") analyze_tokenization("supercalifragilisticexpialidocious")analyze_tokenization("AI/ML researcher")analyze_tokenization("The year 2024")```### Working with Different Tiktoken EncodingsOpenAI uses different encodings for different models:```{python}# | echo: true# Available encodingsavailable_encodings = ["gpt2", "p50k_base", "cl100k_base"]test_text ="Geospatial foundation models revolutionize remote sensing!"for encoding_name in available_encodings:try: tokenizer = tiktoken.get_encoding(encoding_name) tokens = tokenizer.encode(test_text)print(f"{encoding_name:12s}: {len(tokens):2d} tokens, vocab size: {tokenizer.n_vocab:6,}")print(f" Tokens: {tokens}")print()exceptExceptionas e:print(f"{encoding_name}: Error - {e}")```### BPE for Geospatial TextLet's see how BPE handles domain-specific geospatial terminology:```{python}# | echo: truegeospatial_texts = ["NDVI vegetation index analysis","Landsat-8 multispectral imagery", "Convolutional neural networks for land cover classification","Sentinel-2 satellite data preprocessing","Geospatial foundation models fine-tuning"]gpt2_enc = tiktoken.get_encoding("gpt2")print("=== Geospatial Text Tokenization ===")for text in geospatial_texts: tokens = gpt2_enc.encode(text)print(f"'{text}'")print(f" -> {len(tokens)} tokens: {tokens}")print()```### Understanding Token EfficiencyThe choice of tokenizer affects model efficiency. Let's compare token counts for our text corpus:```{python}# | echo: true# Use a sample from our corpussample_text = raw_text[:500] # First 500 charactersprint("=== Token Efficiency Comparison ===")print(f"Text length: {len(sample_text)} characters")print()# Our simple tokenizerour_tokens = enhanced_tokenizer.encode(sample_text, add_endoftext=False)print(f"Simple tokenizer: {len(our_tokens)} tokens")# GPT-2 BPEgpt2_tokens = gpt2_tokenizer.encode(sample_text)print(f"GPT-2 BPE: {len(gpt2_tokens)} tokens")# Calculate efficiencychars_per_token_simple =len(sample_text) /len(our_tokens)chars_per_token_bpe =len(sample_text) /len(gpt2_tokens)print(f"\nCharacters per token:")print(f" Simple: {chars_per_token_simple:.2f}")print(f" BPE: {chars_per_token_bpe:.2f}")print(f" BPE is {chars_per_token_bpe/chars_per_token_simple:.1f}x more efficient")```## Key Takeaways1. **Special Tokens**: Essential for handling unknown words and sequence boundaries2. **Unknown Token Handling**: `<|unk|>` tokens allow models to gracefully handle out-of-vocabulary words3. **End-of-Text Tokens**: `<|endoftext|>` tokens help models understand sequence boundaries4. **BPE Efficiency**: Byte Pair Encoding creates a good balance between vocabulary size and representation efficiency5. **Domain Adaptation**: Different tokenizers may handle domain-specific text differentlyIn geospatial AI applications, choosing the right tokenization strategy affects both model performance and computational efficiency, especially when working with specialized terminology from remote sensing and Earth science domains.