LLM Sampling

Learning

LLM Sampling: Engineering Deep Dive

Vatsal Bajpai

11 min read·October 2, 2025

The Problem You're Actually Solving

Every LLM output is a probability distribution over ~50,000 tokens. Sampling strategy determines whether you get deterministic garbage or creative genius. Most engineers get this wrong.

Foundation: From Neural Network to Text

Logits: The Raw Signal

LLMs don't "think" in words. Final layer outputs a vector of raw scores (logits) - one per token in vocabulary. These are unbounded real numbers, straight from the last linear transformation.

# What actually happens in the model's final layer
logits = model.lm_head(hidden_states)  # [batch_size, seq_len, vocab_size]
# Raw values like: [8.432, -2.331, 4.102, ..., -0.823]

Key insight: Higher logit = model's stronger "conviction" about that token. But these aren't probabilities yet.

Softmax: The Probability Transform

Softmax converts logits to valid probability distribution:

def softmax(logits):
    exp_logits = torch.exp(logits - torch.max(logits))  # Numerical stability
    return exp_logits / exp_logits.sum()

# Before: logits = [8.2, 7.1, 6.5, -2.3]
# After:  probs = [0.42, 0.28, 0.15, 0.0001]

Critical: Softmax amplifies differences. Small logit gaps become large probability gaps.

Temperature: The Chaos Knob

Temperature rescales logits BEFORE softmax. This is the master control for randomness.

Mathematical Core

def temperature_sampling(logits, temperature):
    if temperature == 0:
        return torch.argmax(logits)  # Deterministic

    scaled_logits = logits / temperature
    probs = torch.softmax(scaled_logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)

What Actually Happens

T → 0: Softmax becomes argmax. Highest logit wins every time.
T < 1: Amplifies differences. Rich get richer. Safe, boring.
T = 1: Original distribution. Model's raw confidence.
T > 1: Flattens distribution. Underdogs get a shot. Creative, risky.

Visualization

# Original logits
logits = torch.tensor([2.0, 1.0, 0.5, 0.1, -1.0])

# T=0.5 (focused)
# Probs: [0.64, 0.24, 0.09, 0.03, 0.00]

# T=1.0 (balanced)
# Probs: [0.42, 0.23, 0.14, 0.09, 0.03]

# T=2.0 (exploratory)
# Probs: [0.28, 0.20, 0.17, 0.14, 0.09]

Engineering Reality

class TemperatureController:
    def __init__(self, base_temp=0.7):
        self.base_temp = base_temp
        self.decay_factor = 0.95  # For length penalty

    def get_temperature(self, step, context_type):
        # Dynamic temperature based on context
        if context_type == "code":
            return max(0.3, self.base_temp * (self.decay_factor ** step))
        elif context_type == "creative":
            return min(1.5, self.base_temp * (1.1 ** step))
        return self.base_temp

Top-k: The Bouncer

Hard cutoff. Only top k tokens allowed in the club.

Implementation

def top_k_filtering(logits, k, filter_value=-float('Inf')):
    if k == 0:
        return logits

    indices_to_keep = torch.topk(logits, min(k, logits.size(-1)))[1]
    filter_mask = torch.ones_like(logits).bool()
    filter_mask.scatter_(0, indices_to_keep, False)
    logits[filter_mask] = filter_value
    return logits

Why It Matters

Prevents sampling from long tail of garbage tokens
Computationally efficient (no sorting entire vocabulary)
Predictable memory footprint

The Problem

Fixed k ignores context. Sometimes model is certain (needs k=5), sometimes uncertain (needs k=50).

Top-p (Nucleus): The Smart Bouncer

Dynamic threshold. Includes smallest set of tokens that sum to probability p.

Algorithm

def nucleus_sampling(logits, p, temperature=1.0):
    # Apply temperature first
    logits = logits / temperature

    # Sort and compute cumulative probabilities
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(
        torch.softmax(sorted_logits, dim=-1), dim=-1
    )

    # Find cutoff
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    # Remove tokens
    indices_to_remove = sorted_indices_to_remove.scatter(
        0, sorted_indices, sorted_indices_to_remove
    )
    logits[indices_to_remove] = -float('Inf')

    return torch.multinomial(torch.softmax(logits, dim=-1), 1)

Genius Design

When model is confident (peaked distribution), includes few tokens. When uncertain (flat distribution), includes many. Adapts automatically.

Real-World Impact

# Confident prediction: "The capital of France is ___"
# Top-p=0.9 might only include ["Paris", "located"]

# Uncertain prediction: "The meaning of life is ___"
# Top-p=0.9 might include 50+ philosophical tokens

Combined Strategies: Production Pipeline

class ProductionSampler:
    def __init__(self, config):
        self.temp = config.get('temperature', 0.7)
        self.top_p = config.get('top_p', 0.9)
        self.top_k = config.get('top_k', 0)  # 0 = disabled
        self.repetition_penalty = config.get('rep_penalty', 1.1)

    def sample(self, logits, past_tokens=None):
        # 1. Repetition penalty (prevent loops)
        if past_tokens is not None:
            for token in set(past_tokens):
                logits[token] /= self.repetition_penalty

        # 2. Temperature scaling
        logits = logits / self.temp

        # 3. Top-k filtering (if enabled)
        if self.top_k > 0:
            logits = top_k_filtering(logits, self.top_k)

        # 4. Top-p filtering
        if self.top_p < 1.0:
            logits = nucleus_filtering(logits, self.top_p)

        # 5. Sample
        probs = torch.softmax(logits, dim=-1)
        return torch.multinomial(probs, 1)

Engineering Configurations by Use Case

Code Generation

config = {
    "temperature": 0.2,      # Very focused
    "top_p": 0.95,          # Allow some creativity
    "frequency_penalty": 0.1 # Prevent repetitive patterns
}

Why: Code needs to be correct, not creative. Low temperature maintains syntactic coherence.

API Response Generation

config = {
    "temperature": 0.0,      # Deterministic
    "top_k": 1,             # Greedy
    "max_tokens": 256       # Controlled length
}

Why: Reproducibility matters. Same input → same output.

Creative Writing

config = {
    "temperature": 0.9,      # High creativity
    "top_p": 0.95,          # Wide vocabulary
    "presence_penalty": 0.6  # Force topic diversity
}

Why: Unexpected word choices create engaging narrative.

Technical Documentation

config = {
    "temperature": 0.7,      # Balanced
    "top_p": 0.9,           # Controlled diversity
    "top_k": 40             # Safety net
}

Why: Clear but not robotic. Natural language with technical precision.

Advanced: Adaptive Sampling

class AdaptiveSampler:
    def __init__(self):
        self.perplexity_threshold = 10.0

    def sample(self, logits, context):
        # Calculate perplexity (uncertainty measure)
        probs = torch.softmax(logits, dim=-1)
        perplexity = torch.exp(-torch.sum(probs * torch.log(probs)))

        # High perplexity = high uncertainty = need more exploration
        if perplexity > self.perplexity_threshold:
            temperature = 1.2
            top_p = 0.95
        else:
            temperature = 0.5
            top_p = 0.8

        return self._sample_with_params(logits, temperature, top_p)

Performance Optimization

GPU Memory Management

@torch.inference_mode()
def efficient_sampling(logits, top_p=0.9, top_k=0):
    # Work with top-k first to reduce memory
    if top_k > 0:
        top_k_values, top_k_indices = torch.topk(
            logits, min(top_k, logits.size(-1))
        )
        logits = torch.full_like(logits, -float('inf'))
        logits.scatter_(-1, top_k_indices, top_k_values)

    # Then apply nucleus on reduced set
    # ... nucleus sampling logic

    return sampled_token

Batch Processing

def batch_sample(logits_batch, configs):
    """Different sampling params per sequence in batch"""
    samples = []
    for logits, config in zip(logits_batch, configs):
        samples.append(sample_with_config(logits, config))
    return torch.stack(samples)

Common Pitfalls

Temperature = 0 with Top-p: Pointless. Temperature 0 makes it deterministic, top-p does nothing.
Top-k with small vocabularies: If vocab < k, you're doing nothing.
Ignoring numerical stability:

# Bad
probs = torch.exp(logits) / torch.sum(torch.exp(logits))

# Good
probs = torch.softmax(logits, dim=-1)  # Handles overflow

Static sampling throughout generation: Early tokens might need different strategy than later ones.

Sampling Cheatsheet: Battle-Tested Configurations

Quick Reference Table

Use Case	Temperature	Top-p	Top-k	Why These Settings
Code Generation	0.2	0.95	-	Syntax must be perfect. Low temp = fewer syntax errors
SQL Queries	0.0	-	1	Zero tolerance for errors. Deterministic every time
JSON/API Responses	0.0	-	1	Structure critical. Same input → same output
Unit Tests	0.4	0.9	-	Some creativity for edge cases, but mostly logical
Bug Fixes	0.3	0.9	-	Focus on correctness with slight variation for solutions

Technical Docs	0.6	0.9	-	Clear but natural. Not robotic
Code Comments	0.7	0.9	-	Human-readable, varied vocabulary
API Documentation	0.5	0.9	-	Precise yet readable
README Files	0.7	0.9	-	Engaging but accurate

Blog Posts	0.8	0.9	-	Engaging content, varied sentence structure
Creative Writing	1.0-1.2	0.95	-	Maximum creativity, unexpected connections
Marketing Copy	0.85	0.9	-	Catchy but not insane
Social Media	0.9	0.85	50	Punchy, memorable, slightly wild

Customer Support	0.3	0.9	-	Consistent, helpful, no hallucinations
Chatbots (Casual)	0.8	0.9	-	Natural conversation flow
Virtual Assistant	0.5	0.9	-	Balanced: reliable but not mechanical
Tutoring/Education	0.5	0.9	-	Clear explanations with some variety

Data Extraction	0.0	-	1	Just the facts
Summarization	0.3	0.9	-	Accurate but not repetitive
Translation	0.3	0.95	-	Faithful to source, natural target language
Classification	0.0	-	1	Binary decision, no creativity needed
Search Queries	0.0	-	5	Limited, precise options

Understanding the Columns

Temperature: Controls randomness

0.0: Deterministic robot
0.3: Focused professional
0.7: Natural human
1.0+: Creative chaos agent

Top-p: Vocabulary diversity (nucleus sampling)

0.85: Tight vocabulary, common words
0.9: Standard range
0.95: Wide vocabulary, unusual words welcome

Top-k: Hard vocabulary limit

Not set (-): Let top-p handle it
1: Only the #1 choice (greedy)
5-50: Fixed pool of top candidates

The Golden Rules

If accuracy matters more than creativity: Temperature ≤ 0.3
If creativity matters more than accuracy: Temperature ≥ 0.8
If you need reproducibility: Temperature = 0.0, Top-k = 1
If you're getting repetitive output: Increase temperature or add frequency penalty
If you're getting nonsense: Lower temperature, reduce top-p

Common Combinations Explained

The Determinist (T=0.0, k=1)

{ "temperature": 0.0, "top_k": 1 }

Use when: Production APIs, database queries, financial calculations Expect: Same output every single time

The Professional (T=0.3, p=0.9)

{ "temperature": 0.3, "top_p": 0.9 }

Use when: Business reports, technical analysis, support responses Expect: Accurate, focused, slight variation to avoid robotic feel

The Balanced (T=0.7, p=0.9)

{ "temperature": 0.7, "top_p": 0.9 }

Use when: General purpose, documentation, explanations Expect: Natural language, good mix of predictability and variety

The Creative (T=1.0, p=0.95)

{ "temperature": 1.0, "top_p": 0.95 }

Use when: Stories, brainstorming, creative content Expect: Surprising connections, unique phrasings, occasional weirdness

The Wildcard (T=1.2+, p=0.95)

{ "temperature": 1.2, "top_p": 0.95 }

Use when: Poetry, experimental writing, breaking patterns Expect: Chaos, brilliance, and complete nonsense in equal measure

Troubleshooting Guide

Problem	Solution	Config Change
Too repetitive	Increase randomness	Temperature +0.2, add frequency_penalty: 0.5
Nonsense/hallucinations	Reduce randomness	Temperature -0.3, top_p to 0.85
Too boring/predictable	Add creativity	Temperature +0.2, top_p to 0.95
Inconsistent formatting	Lock it down	Temperature to 0.1, top_k to 10
Going off-topic	Tighten focus	Lower top_p to 0.8, add presence_penalty
Too verbose	Constrain output	Add max_tokens, increase frequency_penalty
Too terse	Encourage elaboration	Temperature +0.1, remove penalties

Advanced: Context-Dependent Sampling

Different parts of generation need different strategies:

# Example: Writing a function with documentation
config_sequence = [
    {"context": "docstring", "temp": 0.7, "top_p": 0.9},    # Natural description
    {"context": "signature", "temp": 0.2, "top_p": 0.95},   # Precise syntax
    {"context": "body", "temp": 0.3, "top_p": 0.9},        # Correct logic
    {"context": "comments", "temp": 0.6, "top_p": 0.9}      # Readable notes
]

The 80/20 Rule

For 80% of use cases, you only need these three configs:

Factual/Code: {"temperature": 0.3, "top_p": 0.9}
General Purpose: {"temperature": 0.7, "top_p": 0.9}
Creative: {"temperature": 1.0, "top_p": 0.95}

Share this Article:

Fixing the $500B problem with today's AI

The key challenges that AI presents today and how we at MatterAI are working on fix them.

Prompt Engineering: The No-BS Guide to AI Communication

Understand, structure and implement prompts that gets you the best, consistant and reduced hallucination outputs.

How KV Caching Works in Large Language Models

KV caching is the optimization that solves this problem, making LLMs faster and more efficient

AI Engineering Productivity: Transforming Software Development

Artificial intelligence isn't just another tool in the developer's toolkit—it's fundamentally changing how we approach problem-solving, code creation, and system design.

Understanding Abstract Syntax Trees

How compilers understand your code, how linters spot bugs or how tools like Prettier can reformat thousands of lines of code in milliseconds

Continue Reading

Fixing the $500B problem with today's AI

The key challenges that AI presents today and how we at MatterAI are working on fix them.

Prompt Engineering: The No-BS Guide to AI Communication

Understand, structure and implement prompts that gets you the best, consistant and reduced hallucination outputs.

How KV Caching Works in Large Language Models

KV caching is the optimization that solves this problem, making LLMs faster and more efficient

LLM Sampling: Engineering Deep Dive

The Problem You're Actually Solving

Foundation: From Neural Network to Text

Logits: The Raw Signal

Softmax: The Probability Transform

Temperature: The Chaos Knob

Mathematical Core

What Actually Happens

Visualization

Engineering Reality

Top-k: The Bouncer

Implementation

Why It Matters

The Problem

Top-p (Nucleus): The Smart Bouncer

Algorithm

Genius Design

Real-World Impact

Combined Strategies: Production Pipeline

Engineering Configurations by Use Case

Code Generation

API Response Generation

Creative Writing

Technical Documentation

Advanced: Adaptive Sampling

Performance Optimization

GPU Memory Management

Batch Processing

Common Pitfalls

Sampling Cheatsheet: Battle-Tested Configurations

Quick Reference Table

Understanding the Columns

The Golden Rules

Common Combinations Explained

The Determinist (T=0.0, k=1)

The Professional (T=0.3, p=0.9)

The Balanced (T=0.7, p=0.9)

The Creative (T=1.0, p=0.95)

The Wildcard (T=1.2+, p=0.95)

Troubleshooting Guide

Advanced: Context-Dependent Sampling

The 80/20 Rule

More Articles

Fixing the $500B problem with today's AI

Prompt Engineering: The No-BS Guide to AI Communication

How KV Caching Works in Large Language Models

AI Engineering Productivity: Transforming Software Development

Understanding Abstract Syntax Trees

Continue Reading

Fixing the $500B problem with today's AI

Prompt Engineering: The No-BS Guide to AI Communication

How KV Caching Works in Large Language Models