LLM Sampling

Learning

LLM Sampling: Engineering Deep Dive

Cover Image for LLM Sampling: Engineering Deep Dive
Vatsal Bajpai
Vatsal Bajpai
11 min read·

The Problem You're Actually Solving

Every LLM output is a probability distribution over ~50,000 tokens. Sampling strategy determines whether you get deterministic garbage or creative genius. Most engineers get this wrong.

Foundation: From Neural Network to Text

Logits: The Raw Signal

LLMs don't "think" in words. Final layer outputs a vector of raw scores (logits) - one per token in vocabulary. These are unbounded real numbers, straight from the last linear transformation.

# What actually happens in the model's final layer
logits = model.lm_head(hidden_states)  # [batch_size, seq_len, vocab_size]
# Raw values like: [8.432, -2.331, 4.102, ..., -0.823]

Key insight: Higher logit = model's stronger "conviction" about that token. But these aren't probabilities yet.

Softmax: The Probability Transform

Softmax converts logits to valid probability distribution:

def softmax(logits):
    exp_logits = torch.exp(logits - torch.max(logits))  # Numerical stability
    return exp_logits / exp_logits.sum()

# Before: logits = [8.2, 7.1, 6.5, -2.3]
# After:  probs = [0.42, 0.28, 0.15, 0.0001]

Critical: Softmax amplifies differences. Small logit gaps become large probability gaps.

Temperature: The Chaos Knob

Temperature rescales logits BEFORE softmax. This is the master control for randomness.

Mathematical Core

def temperature_sampling(logits, temperature):
    if temperature == 0:
        return torch.argmax(logits)  # Deterministic

    scaled_logits = logits / temperature
    probs = torch.softmax(scaled_logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)

What Actually Happens

  • T → 0: Softmax becomes argmax. Highest logit wins every time.
  • T < 1: Amplifies differences. Rich get richer. Safe, boring.
  • T = 1: Original distribution. Model's raw confidence.
  • T > 1: Flattens distribution. Underdogs get a shot. Creative, risky.

Visualization

# Original logits
logits = torch.tensor([2.0, 1.0, 0.5, 0.1, -1.0])

# T=0.5 (focused)
# Probs: [0.64, 0.24, 0.09, 0.03, 0.00]

# T=1.0 (balanced)
# Probs: [0.42, 0.23, 0.14, 0.09, 0.03]

# T=2.0 (exploratory)
# Probs: [0.28, 0.20, 0.17, 0.14, 0.09]

Engineering Reality

class TemperatureController:
    def __init__(self, base_temp=0.7):
        self.base_temp = base_temp
        self.decay_factor = 0.95  # For length penalty

    def get_temperature(self, step, context_type):
        # Dynamic temperature based on context
        if context_type == "code":
            return max(0.3, self.base_temp * (self.decay_factor ** step))
        elif context_type == "creative":
            return min(1.5, self.base_temp * (1.1 ** step))
        return self.base_temp

Top-k: The Bouncer

Hard cutoff. Only top k tokens allowed in the club.

Implementation

def top_k_filtering(logits, k, filter_value=-float('Inf')):
    if k == 0:
        return logits

    indices_to_keep = torch.topk(logits, min(k, logits.size(-1)))[1]
    filter_mask = torch.ones_like(logits).bool()
    filter_mask.scatter_(0, indices_to_keep, False)
    logits[filter_mask] = filter_value
    return logits

Why It Matters

  • Prevents sampling from long tail of garbage tokens
  • Computationally efficient (no sorting entire vocabulary)
  • Predictable memory footprint

The Problem

Fixed k ignores context. Sometimes model is certain (needs k=5), sometimes uncertain (needs k=50).

Top-p (Nucleus): The Smart Bouncer

Dynamic threshold. Includes smallest set of tokens that sum to probability p.

Algorithm

def nucleus_sampling(logits, p, temperature=1.0):
    # Apply temperature first
    logits = logits / temperature

    # Sort and compute cumulative probabilities
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(
        torch.softmax(sorted_logits, dim=-1), dim=-1
    )

    # Find cutoff
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    # Remove tokens
    indices_to_remove = sorted_indices_to_remove.scatter(
        0, sorted_indices, sorted_indices_to_remove
    )
    logits[indices_to_remove] = -float('Inf')

    return torch.multinomial(torch.softmax(logits, dim=-1), 1)

Genius Design

When model is confident (peaked distribution), includes few tokens. When uncertain (flat distribution), includes many. Adapts automatically.

Real-World Impact

# Confident prediction: "The capital of France is ___"
# Top-p=0.9 might only include ["Paris", "located"]

# Uncertain prediction: "The meaning of life is ___"
# Top-p=0.9 might include 50+ philosophical tokens

Combined Strategies: Production Pipeline

class ProductionSampler:
    def __init__(self, config):
        self.temp = config.get('temperature', 0.7)
        self.top_p = config.get('top_p', 0.9)
        self.top_k = config.get('top_k', 0)  # 0 = disabled
        self.repetition_penalty = config.get('rep_penalty', 1.1)

    def sample(self, logits, past_tokens=None):
        # 1. Repetition penalty (prevent loops)
        if past_tokens is not None:
            for token in set(past_tokens):
                logits[token] /= self.repetition_penalty

        # 2. Temperature scaling
        logits = logits / self.temp

        # 3. Top-k filtering (if enabled)
        if self.top_k > 0:
            logits = top_k_filtering(logits, self.top_k)

        # 4. Top-p filtering
        if self.top_p < 1.0:
            logits = nucleus_filtering(logits, self.top_p)

        # 5. Sample
        probs = torch.softmax(logits, dim=-1)
        return torch.multinomial(probs, 1)

Engineering Configurations by Use Case

Code Generation

config = {
    "temperature": 0.2,      # Very focused
    "top_p": 0.95,          # Allow some creativity
    "frequency_penalty": 0.1 # Prevent repetitive patterns
}

Why: Code needs to be correct, not creative. Low temperature maintains syntactic coherence.

API Response Generation

config = {
    "temperature": 0.0,      # Deterministic
    "top_k": 1,             # Greedy
    "max_tokens": 256       # Controlled length
}

Why: Reproducibility matters. Same input → same output.

Creative Writing

config = {
    "temperature": 0.9,      # High creativity
    "top_p": 0.95,          # Wide vocabulary
    "presence_penalty": 0.6  # Force topic diversity
}

Why: Unexpected word choices create engaging narrative.

Technical Documentation

config = {
    "temperature": 0.7,      # Balanced
    "top_p": 0.9,           # Controlled diversity
    "top_k": 40             # Safety net
}

Why: Clear but not robotic. Natural language with technical precision.

Advanced: Adaptive Sampling

class AdaptiveSampler:
    def __init__(self):
        self.perplexity_threshold = 10.0

    def sample(self, logits, context):
        # Calculate perplexity (uncertainty measure)
        probs = torch.softmax(logits, dim=-1)
        perplexity = torch.exp(-torch.sum(probs * torch.log(probs)))

        # High perplexity = high uncertainty = need more exploration
        if perplexity > self.perplexity_threshold:
            temperature = 1.2
            top_p = 0.95
        else:
            temperature = 0.5
            top_p = 0.8

        return self._sample_with_params(logits, temperature, top_p)

Performance Optimization

GPU Memory Management

@torch.inference_mode()
def efficient_sampling(logits, top_p=0.9, top_k=0):
    # Work with top-k first to reduce memory
    if top_k > 0:
        top_k_values, top_k_indices = torch.topk(
            logits, min(top_k, logits.size(-1))
        )
        logits = torch.full_like(logits, -float('inf'))
        logits.scatter_(-1, top_k_indices, top_k_values)

    # Then apply nucleus on reduced set
    # ... nucleus sampling logic

    return sampled_token

Batch Processing

def batch_sample(logits_batch, configs):
    """Different sampling params per sequence in batch"""
    samples = []
    for logits, config in zip(logits_batch, configs):
        samples.append(sample_with_config(logits, config))
    return torch.stack(samples)

Common Pitfalls

  1. Temperature = 0 with Top-p: Pointless. Temperature 0 makes it deterministic, top-p does nothing.

  2. Top-k with small vocabularies: If vocab < k, you're doing nothing.

  3. Ignoring numerical stability:

# Bad
probs = torch.exp(logits) / torch.sum(torch.exp(logits))

# Good
probs = torch.softmax(logits, dim=-1)  # Handles overflow
  1. Static sampling throughout generation: Early tokens might need different strategy than later ones.

Sampling Cheatsheet: Battle-Tested Configurations

Quick Reference Table

Use Case Temperature Top-p Top-k Why These Settings
Code Generation 0.2 0.95 - Syntax must be perfect. Low temp = fewer syntax errors
SQL Queries 0.0 - 1 Zero tolerance for errors. Deterministic every time
JSON/API Responses 0.0 - 1 Structure critical. Same input → same output
Unit Tests 0.4 0.9 - Some creativity for edge cases, but mostly logical
Bug Fixes 0.3 0.9 - Focus on correctness with slight variation for solutions
Technical Docs 0.6 0.9 - Clear but natural. Not robotic
Code Comments 0.7 0.9 - Human-readable, varied vocabulary
API Documentation 0.5 0.9 - Precise yet readable
README Files 0.7 0.9 - Engaging but accurate
Blog Posts 0.8 0.9 - Engaging content, varied sentence structure
Creative Writing 1.0-1.2 0.95 - Maximum creativity, unexpected connections
Marketing Copy 0.85 0.9 - Catchy but not insane
Social Media 0.9 0.85 50 Punchy, memorable, slightly wild
Customer Support 0.3 0.9 - Consistent, helpful, no hallucinations
Chatbots (Casual) 0.8 0.9 - Natural conversation flow
Virtual Assistant 0.5 0.9 - Balanced: reliable but not mechanical
Tutoring/Education 0.5 0.9 - Clear explanations with some variety
Data Extraction 0.0 - 1 Just the facts
Summarization 0.3 0.9 - Accurate but not repetitive
Translation 0.3 0.95 - Faithful to source, natural target language
Classification 0.0 - 1 Binary decision, no creativity needed
Search Queries 0.0 - 5 Limited, precise options

Understanding the Columns

Temperature: Controls randomness

  • 0.0: Deterministic robot
  • 0.3: Focused professional
  • 0.7: Natural human
  • 1.0+: Creative chaos agent

Top-p: Vocabulary diversity (nucleus sampling)

  • 0.85: Tight vocabulary, common words
  • 0.9: Standard range
  • 0.95: Wide vocabulary, unusual words welcome

Top-k: Hard vocabulary limit

  • Not set (-): Let top-p handle it
  • 1: Only the #1 choice (greedy)
  • 5-50: Fixed pool of top candidates

The Golden Rules

  1. If accuracy matters more than creativity: Temperature ≤ 0.3
  2. If creativity matters more than accuracy: Temperature ≥ 0.8
  3. If you need reproducibility: Temperature = 0.0, Top-k = 1
  4. If you're getting repetitive output: Increase temperature or add frequency penalty
  5. If you're getting nonsense: Lower temperature, reduce top-p

Common Combinations Explained

The Determinist (T=0.0, k=1)

{ "temperature": 0.0, "top_k": 1 }

Use when: Production APIs, database queries, financial calculations Expect: Same output every single time

The Professional (T=0.3, p=0.9)

{ "temperature": 0.3, "top_p": 0.9 }

Use when: Business reports, technical analysis, support responses Expect: Accurate, focused, slight variation to avoid robotic feel

The Balanced (T=0.7, p=0.9)

{ "temperature": 0.7, "top_p": 0.9 }

Use when: General purpose, documentation, explanations Expect: Natural language, good mix of predictability and variety

The Creative (T=1.0, p=0.95)

{ "temperature": 1.0, "top_p": 0.95 }

Use when: Stories, brainstorming, creative content Expect: Surprising connections, unique phrasings, occasional weirdness

The Wildcard (T=1.2+, p=0.95)

{ "temperature": 1.2, "top_p": 0.95 }

Use when: Poetry, experimental writing, breaking patterns Expect: Chaos, brilliance, and complete nonsense in equal measure

Troubleshooting Guide

Problem Solution Config Change
Too repetitive Increase randomness Temperature +0.2, add frequency_penalty: 0.5
Nonsense/hallucinations Reduce randomness Temperature -0.3, top_p to 0.85
Too boring/predictable Add creativity Temperature +0.2, top_p to 0.95
Inconsistent formatting Lock it down Temperature to 0.1, top_k to 10
Going off-topic Tighten focus Lower top_p to 0.8, add presence_penalty
Too verbose Constrain output Add max_tokens, increase frequency_penalty
Too terse Encourage elaboration Temperature +0.1, remove penalties

Advanced: Context-Dependent Sampling

Different parts of generation need different strategies:

# Example: Writing a function with documentation
config_sequence = [
    {"context": "docstring", "temp": 0.7, "top_p": 0.9},    # Natural description
    {"context": "signature", "temp": 0.2, "top_p": 0.95},   # Precise syntax
    {"context": "body", "temp": 0.3, "top_p": 0.9},        # Correct logic
    {"context": "comments", "temp": 0.6, "top_p": 0.9}      # Readable notes
]

The 80/20 Rule

For 80% of use cases, you only need these three configs:

  1. Factual/Code: {"temperature": 0.3, "top_p": 0.9}
  2. General Purpose: {"temperature": 0.7, "top_p": 0.9}
  3. Creative: {"temperature": 1.0, "top_p": 0.95}

Share this Article: