Deep Learning Basics

Introduction

In this interactive session, we will explore how to build deep learning encoders and decoders from scratch. Our first example will use text encoders and a small corpus of text.

The test we will use is “The Verdict”, which is used in Sebastian Raschka’s excellent book, “Build a Large Language Model from Scratch.”

Code

import urllib.request


url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")

with urllib.request.urlopen(url) as response:
    raw_text = response.read().decode("utf-8")

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no

The first step is to tokenize this text. We can use regular expressions to split the text into words.

Code

import re
text = "Hello, world. Welcome to text encoding."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'Welcome', ' ', 'to', ' ', 'text', ' ', 'encoding.']

We want to separate punctuation and preserve capitalization.

Code

result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'Welcome', ' ', 'to', ' ', 'text', ' ', 'encoding', '.', '']

The decision to remove or include whitespace characters is application-dependent. For example, in coding (especially Pyhon coding), whitespace is essential to the structure and function of text, so it should be included in training.

Whitespace in Coding

Most tokenization schemes include whitespace, but for brevity, these next examples exclude it.

Code

text = "Hello, world. Is this-- a good decoder?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'good', 'decoder', '?']

We can run this against our entire corpus of text.

Code

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:20])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']

Converting tokens into IDs

Models assign each token a unique ID using a dictionary. These dictionaries are built from the text corpus.

We create a set of unique tokens that define the vocabulary of the corpus, which determines the number of items in our dictionary.

Code

all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

Creating a vocabulary requires assigning each unique ID to each token. We can do this using a dictionary comprehension.

Code

vocab = {token: idx for idx, token in enumerate(
    all_words, start=0)}

# Print a list of the first 10 tokens and their ids:
print(list(vocab.items())[:10])

[('!', 0), ('"', 1), ("'", 2), ('(', 3), (')', 4), (',', 5), ('--', 6), ('.', 7), (':', 8), (';', 9)]

To facilitate the conversion of tokens to ids and vice versa, we can create a tokenizer class with an encode and decode method. The encode method splits text into tokens and then maps these to ids using the vocabulary. The decode method does the reverse, converting ids back to tokens.

Code

class Tokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.token_to_id = vocab
        self.id_to_token = {idx: token for token, idx in vocab.items()}

    def encode(self, text):
        """Encode a string into a list of token IDs."""
        pattern = r'([,.?_!"()\']|--|\s)'
        tokens = (
            t.strip() for t in re.split(pattern, text) if t.strip()
        )
        return [self.token_to_id[t] for t in tokens]

    def decode(self, ids):
        """Decode a list of token IDs into a string."""
        text = " ".join(self.id_to_token[i] for i in ids)
        return re.sub(r'\s+([,.?!"()\'])', r'\1', text)

Let’s test our new Tokenizer class

Code

text = """"It's the last he painted, you know," 
       Mrs. Gisburn said with pardonable pride."""
tokenizer = Tokenizer(vocab)
ids = tokenizer.encode(text)

print(f"Encoded:\n{ids}")
print(f"Decoded:\n{tokenizer.decode(ids)}")

Encoded:
[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
Decoded:
" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.

The vocabulary of our tokenizer is limited by the tokens that appear in the corpus. Any word that is not in the corpus cannot be encoded.

text = "Hello, do you like tea?"
print(tokenizer.encode(text))

Causes a KeyError:

KeyError: 'Hello'

Handling Unknown Tokens and Special Tokens

In practical LLM applications, we need to handle words that don’t appear in our training vocabulary. This is where unknown tokens and other special tokens become essential.

Special Tokens in LLMs

Modern tokenizers use several special tokens:

<|unk|> or [UNK]: Unknown token for out-of-vocabulary words
<|endoftext|> or [EOS]: End of sequence/text marker
<|startoftext|> or [BOS]: Beginning of sequence marker (less common in GPT-style models)
<|pad|>: Padding token for batch processing

Let’s create an enhanced tokenizer that handles unknown tokens:

Code

class EnhancedTokenizer:
    def __init__(self, vocab):
        # Add special tokens to vocabulary
        self.special_tokens = {
            "<|unk|>": len(vocab),
            "<|endoftext|>": len(vocab) + 1
        }
        
        # Combine original vocab with special tokens
        self.full_vocab = {**vocab, **self.special_tokens}
        self.token_to_id = self.full_vocab
        self.id_to_token = {idx: token for token, idx in self.full_vocab.items()}
        
        # Store reverse mapping for special tokens
        self.unk_token_id = self.special_tokens["<|unk|>"]
        self.endoftext_token_id = self.special_tokens["<|endoftext|>"]

    def encode(self, text, add_endoftext=True):
        """Encode text, handling unknown tokens."""
        pattern = r'([,.?_!"()\']|--|\s)'
        tokens = [t.strip() for t in re.split(pattern, text) if t.strip()]
        
        # Convert tokens to IDs, using <|unk|> for unknown tokens
        ids = []
        for token in tokens:
            if token in self.token_to_id:
                ids.append(self.token_to_id[token])
            else:
                ids.append(self.unk_token_id)  # Use unknown token
        
        # Optionally add end-of-text token
        if add_endoftext:
            ids.append(self.endoftext_token_id)
            
        return ids

    def decode(self, ids):
        """Decode token IDs back to text."""
        tokens = []
        for id in ids:
            if id == self.endoftext_token_id:
                break  # Stop at end-of-text token
            elif id in self.id_to_token:
                tokens.append(self.id_to_token[id])
            else:
                tokens.append("<|unk|>")  # Fallback for invalid IDs
                
        text = " ".join(tokens)
        return re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    
    def vocab_size(self):
        """Return the total vocabulary size including special tokens."""
        return len(self.full_vocab)

Let’s test our enhanced tokenizer:

Code

enhanced_tokenizer = EnhancedTokenizer(vocab)

# Test with unknown words
text_with_unknown = "Hello, do you like tea? This is amazing!"
encoded = enhanced_tokenizer.encode(text_with_unknown)

print(f"Original text: {text_with_unknown}")
print(f"Encoded IDs: {encoded}")
print(f"Decoded text: {enhanced_tokenizer.decode(encoded)}")
print(f"Vocabulary size: {enhanced_tokenizer.vocab_size()}")

Original text: Hello, do you like tea? This is amazing!
Encoded IDs: [1130, 5, 355, 1126, 628, 975, 10, 97, 584, 1130, 0, 1131]
Decoded text: <|unk|>, do you like tea? This is <|unk|>!
Vocabulary size: 1132

Notice how unknown words like “Hello” and “amazing” are now handled gracefully using the <|unk|> token, and the sequence ends with an <|endoftext|> token.

Code

# Let's see what the special token IDs are
print(f"Unknown token '<|unk|>' has ID: {enhanced_tokenizer.unk_token_id}")
print(f"End-of-text token '<|endoftext|>' has ID: {enhanced_tokenizer.endoftext_token_id}")

# Test decoding with end-of-text in the middle
test_ids = [100, 200, enhanced_tokenizer.endoftext_token_id, 300, 400]
print(f"Decoding stops at <|endoftext|>: {enhanced_tokenizer.decode(test_ids)}")

Unknown token '<|unk|>' has ID: 1130
End-of-text token '<|endoftext|>' has ID: 1131
Decoding stops at <|endoftext|>: Thwing bean-stalk

Byte Pair Encoding (BPE) with Tiktoken

While our simple tokenizer is educational, production LLMs use more sophisticated tokenization schemes like Byte Pair Encoding (BPE). BPE creates subword tokens that balance vocabulary size with representation efficiency.

The tiktoken library provides access to the same tokenizers used by OpenAI’s GPT models.

Code

import tiktoken

# Load the GPT-2 tokenizer (used by GPT-2, GPT-3, and early GPT-4)
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

print(f"GPT-2 vocabulary size: {gpt2_tokenizer.n_vocab:,}")

GPT-2 vocabulary size: 50,257

Let’s compare our simple tokenizer with GPT-2’s BPE tokenizer:

Code

test_text = "Hello, world! This demonstrates byte pair encoding tokenization."

# Our enhanced tokenizer
our_tokens = enhanced_tokenizer.encode(test_text, add_endoftext=False)
our_decoded = enhanced_tokenizer.decode(our_tokens)

# GPT-2 BPE tokenizer  
gpt2_tokens = gpt2_tokenizer.encode(test_text)
gpt2_decoded = gpt2_tokenizer.decode(gpt2_tokens)

print("=== Comparison ===")
print(f"Original text: {test_text}")
print()
print(f"Our tokenizer:")
print(f"  Tokens: {our_tokens}")
print(f"  Count: {len(our_tokens)} tokens")
print(f"  Decoded: {our_decoded}")
print()
print(f"GPT-2 BPE tokenizer:")
print(f"  Tokens: {gpt2_tokens}")
print(f"  Count: {len(gpt2_tokens)} tokens") 
print(f"  Decoded: {gpt2_decoded}")

=== Comparison ===
Original text: Hello, world! This demonstrates byte pair encoding tokenization.

Our tokenizer:
  Tokens: [1130, 5, 1130, 0, 97, 1130, 1130, 1130, 1130, 1130, 7]
  Count: 11 tokens
  Decoded: <|unk|>, <|unk|>! This <|unk|> <|unk|> <|unk|> <|unk|> <|unk|>.

GPT-2 BPE tokenizer:
  Tokens: [15496, 11, 995, 0, 770, 15687, 18022, 5166, 21004, 11241, 1634, 13]
  Count: 12 tokens
  Decoded: Hello, world! This demonstrates byte pair encoding tokenization.

Understanding BPE Token Breakdown

BPE tokenizers split text into subword units. Let’s examine how GPT-2 breaks down different types of text:

Code

def analyze_tokenization(text, tokenizer_name="gpt2"):
    """Analyze how tiktoken breaks down text."""
    tokenizer = tiktoken.get_encoding(tokenizer_name)
    tokens = tokenizer.encode(text)
    
    print(f"Text: '{text}'")
    print(f"Tokens: {tokens}")
    print(f"Token count: {len(tokens)}")
    print("Token breakdown:")
    
    for i, token_id in enumerate(tokens):
        token_text = tokenizer.decode([token_id])
        print(f"  {i:2d}: {token_id:5d} -> '{token_text}'")
    print()

# Test different types of text
analyze_tokenization("Hello world")
analyze_tokenization("Tokenization")  
analyze_tokenization("supercalifragilisticexpialidocious")
analyze_tokenization("AI/ML researcher")
analyze_tokenization("The year 2024")

Text: 'Hello world'
Tokens: [15496, 995]
Token count: 2
Token breakdown:
   0: 15496 -> 'Hello'
   1:   995 -> ' world'

Text: 'Tokenization'
Tokens: [30642, 1634]
Token count: 2
Token breakdown:
   0: 30642 -> 'Token'
   1:  1634 -> 'ization'

Text: 'supercalifragilisticexpialidocious'
Tokens: [16668, 9948, 361, 22562, 346, 396, 501, 42372, 498, 312, 32346]
Token count: 11
Token breakdown:
   0: 16668 -> 'super'
   1:  9948 -> 'cal'
   2:   361 -> 'if'
   3: 22562 -> 'rag'
   4:   346 -> 'il'
   5:   396 -> 'ist'
   6:   501 -> 'ice'
   7: 42372 -> 'xp'
   8:   498 -> 'ial'
   9:   312 -> 'id'
  10: 32346 -> 'ocious'

Text: 'AI/ML researcher'
Tokens: [20185, 14, 5805, 13453]
Token count: 4
Token breakdown:
   0: 20185 -> 'AI'
   1:    14 -> '/'
   2:  5805 -> 'ML'
   3: 13453 -> ' researcher'

Text: 'The year 2024'
Tokens: [464, 614, 48609]
Token count: 3
Token breakdown:
   0:   464 -> 'The'
   1:   614 -> ' year'
   2: 48609 -> ' 2024'

Working with Different Tiktoken Encodings

OpenAI uses different encodings for different models:

Code

# Available encodings
available_encodings = ["gpt2", "p50k_base", "cl100k_base"]

test_text = "Geospatial foundation models revolutionize remote sensing!"

for encoding_name in available_encodings:
    try:
        tokenizer = tiktoken.get_encoding(encoding_name)
        tokens = tokenizer.encode(test_text)
        
        print(f"{encoding_name:12s}: {len(tokens):2d} tokens, vocab size: {tokenizer.n_vocab:6,}")
        print(f"              Tokens: {tokens}")
        print()
    except Exception as e:
        print(f"{encoding_name}: Error - {e}")

gpt2        : 10 tokens, vocab size: 50,257
              Tokens: [10082, 2117, 34961, 8489, 4981, 5854, 1096, 6569, 34244, 0]

p50k_base   : 10 tokens, vocab size: 50,281
              Tokens: [10082, 2117, 34961, 8489, 4981, 5854, 1096, 6569, 34244, 0]

cl100k_base : 10 tokens, vocab size: 100,277
              Tokens: [9688, 437, 33514, 16665, 4211, 14110, 553, 8870, 60199, 0]

BPE for Geospatial Text

Let’s see how BPE handles domain-specific geospatial terminology:

Code

geospatial_texts = [
    "NDVI vegetation index analysis",
    "Landsat-8 multispectral imagery", 
    "Convolutional neural networks for land cover classification",
    "Sentinel-2 satellite data preprocessing",
    "Geospatial foundation models fine-tuning"
]

gpt2_enc = tiktoken.get_encoding("gpt2")

print("=== Geospatial Text Tokenization ===")
for text in geospatial_texts:
    tokens = gpt2_enc.encode(text)
    print(f"'{text}'")
    print(f"  -> {len(tokens)} tokens: {tokens}")
    print()

=== Geospatial Text Tokenization ===
'NDVI vegetation index analysis'
  -> 5 tokens: [8575, 12861, 28459, 6376, 3781]

'Landsat-8 multispectral imagery'
  -> 10 tokens: [43, 1746, 265, 12, 23, 1963, 271, 806, 1373, 19506]

'Convolutional neural networks for land cover classification'
  -> 10 tokens: [3103, 85, 2122, 282, 17019, 7686, 329, 1956, 3002, 17923]

'Sentinel-2 satellite data preprocessing'
  -> 8 tokens: [31837, 20538, 12, 17, 11210, 1366, 662, 36948]

'Geospatial foundation models fine-tuning'
  -> 9 tokens: [10082, 2117, 34961, 8489, 4981, 3734, 12, 28286, 278]

Understanding Token Efficiency

The choice of tokenizer affects model efficiency. Let’s compare token counts for our text corpus:

Code

# Use a sample from our corpus
sample_text = raw_text[:500]  # First 500 characters

print("=== Token Efficiency Comparison ===")
print(f"Text length: {len(sample_text)} characters")
print()

# Our simple tokenizer
our_tokens = enhanced_tokenizer.encode(sample_text, add_endoftext=False)
print(f"Simple tokenizer: {len(our_tokens)} tokens")

# GPT-2 BPE
gpt2_tokens = gpt2_tokenizer.encode(sample_text)
print(f"GPT-2 BPE:       {len(gpt2_tokens)} tokens")

# Calculate efficiency
chars_per_token_simple = len(sample_text) / len(our_tokens)
chars_per_token_bpe = len(sample_text) / len(gpt2_tokens)

print(f"\nCharacters per token:")
print(f"  Simple: {chars_per_token_simple:.2f}")
print(f"  BPE:    {chars_per_token_bpe:.2f}")
print(f"  BPE is {chars_per_token_bpe/chars_per_token_simple:.1f}x more efficient")

=== Token Efficiency Comparison ===
Text length: 500 characters

Simple tokenizer: 110 tokens
GPT-2 BPE:       123 tokens

Characters per token:
  Simple: 4.55
  BPE:    4.07
  BPE is 0.9x more efficient

Key Takeaways

Special Tokens: Essential for handling unknown words and sequence boundaries
Unknown Token Handling: <|unk|> tokens allow models to gracefully handle out-of-vocabulary words
End-of-Text Tokens: <|endoftext|> tokens help models understand sequence boundaries
BPE Efficiency: Byte Pair Encoding creates a good balance between vocabulary size and representation efficiency
Domain Adaptation: Different tokenizers may handle domain-specific text differently

In geospatial AI applications, choosing the right tokenization strategy affects both model performance and computational efficiency, especially when working with specialized terminology from remote sensing and Earth science domains.