LLM Sampling
Learning
LLM Sampling: Engineering Deep Dive

The Problem You're Actually Solving
Every LLM output is a probability distribution over ~50,000 tokens. Sampling strategy determines whether you get deterministic garbage or creative genius. Most engineers get this wrong.
Foundation: From Neural Network to Text
Logits: The Raw Signal
LLMs don't "think" in words. Final layer outputs a vector of raw scores (logits) - one per token in vocabulary. These are unbounded real numbers, straight from the last linear transformation.
# What actually happens in the model's final layer
logits = model.lm_head(hidden_states) # [batch_size, seq_len, vocab_size]
# Raw values like: [8.432, -2.331, 4.102, ..., -0.823]
Key insight: Higher logit = model's stronger "conviction" about that token. But these aren't probabilities yet.
Softmax: The Probability Transform
Softmax converts logits to valid probability distribution:
def softmax(logits):
exp_logits = torch.exp(logits - torch.max(logits)) # Numerical stability
return exp_logits / exp_logits.sum()
# Before: logits = [8.2, 7.1, 6.5, -2.3]
# After: probs = [0.42, 0.28, 0.15, 0.0001]
Critical: Softmax amplifies differences. Small logit gaps become large probability gaps.
Temperature: The Chaos Knob
Temperature rescales logits BEFORE softmax. This is the master control for randomness.
Mathematical Core
def temperature_sampling(logits, temperature):
if temperature == 0:
return torch.argmax(logits) # Deterministic
scaled_logits = logits / temperature
probs = torch.softmax(scaled_logits, dim=-1)
return torch.multinomial(probs, num_samples=1)
What Actually Happens
- T → 0: Softmax becomes argmax. Highest logit wins every time.
- T < 1: Amplifies differences. Rich get richer. Safe, boring.
- T = 1: Original distribution. Model's raw confidence.
- T > 1: Flattens distribution. Underdogs get a shot. Creative, risky.
Visualization
# Original logits
logits = torch.tensor([2.0, 1.0, 0.5, 0.1, -1.0])
# T=0.5 (focused)
# Probs: [0.64, 0.24, 0.09, 0.03, 0.00]
# T=1.0 (balanced)
# Probs: [0.42, 0.23, 0.14, 0.09, 0.03]
# T=2.0 (exploratory)
# Probs: [0.28, 0.20, 0.17, 0.14, 0.09]
Engineering Reality
class TemperatureController:
def __init__(self, base_temp=0.7):
self.base_temp = base_temp
self.decay_factor = 0.95 # For length penalty
def get_temperature(self, step, context_type):
# Dynamic temperature based on context
if context_type == "code":
return max(0.3, self.base_temp * (self.decay_factor ** step))
elif context_type == "creative":
return min(1.5, self.base_temp * (1.1 ** step))
return self.base_temp
Top-k: The Bouncer
Hard cutoff. Only top k tokens allowed in the club.
Implementation
def top_k_filtering(logits, k, filter_value=-float('Inf')):
if k == 0:
return logits
indices_to_keep = torch.topk(logits, min(k, logits.size(-1)))[1]
filter_mask = torch.ones_like(logits).bool()
filter_mask.scatter_(0, indices_to_keep, False)
logits[filter_mask] = filter_value
return logits
Why It Matters
- Prevents sampling from long tail of garbage tokens
- Computationally efficient (no sorting entire vocabulary)
- Predictable memory footprint
The Problem
Fixed k ignores context. Sometimes model is certain (needs k=5), sometimes uncertain (needs k=50).
Top-p (Nucleus): The Smart Bouncer
Dynamic threshold. Includes smallest set of tokens that sum to probability p.
Algorithm
def nucleus_sampling(logits, p, temperature=1.0):
# Apply temperature first
logits = logits / temperature
# Sort and compute cumulative probabilities
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(
torch.softmax(sorted_logits, dim=-1), dim=-1
)
# Find cutoff
sorted_indices_to_remove = cumulative_probs > p
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = False
# Remove tokens
indices_to_remove = sorted_indices_to_remove.scatter(
0, sorted_indices, sorted_indices_to_remove
)
logits[indices_to_remove] = -float('Inf')
return torch.multinomial(torch.softmax(logits, dim=-1), 1)
Genius Design
When model is confident (peaked distribution), includes few tokens. When uncertain (flat distribution), includes many. Adapts automatically.
Real-World Impact
# Confident prediction: "The capital of France is ___"
# Top-p=0.9 might only include ["Paris", "located"]
# Uncertain prediction: "The meaning of life is ___"
# Top-p=0.9 might include 50+ philosophical tokens
Combined Strategies: Production Pipeline
class ProductionSampler:
def __init__(self, config):
self.temp = config.get('temperature', 0.7)
self.top_p = config.get('top_p', 0.9)
self.top_k = config.get('top_k', 0) # 0 = disabled
self.repetition_penalty = config.get('rep_penalty', 1.1)
def sample(self, logits, past_tokens=None):
# 1. Repetition penalty (prevent loops)
if past_tokens is not None:
for token in set(past_tokens):
logits[token] /= self.repetition_penalty
# 2. Temperature scaling
logits = logits / self.temp
# 3. Top-k filtering (if enabled)
if self.top_k > 0:
logits = top_k_filtering(logits, self.top_k)
# 4. Top-p filtering
if self.top_p < 1.0:
logits = nucleus_filtering(logits, self.top_p)
# 5. Sample
probs = torch.softmax(logits, dim=-1)
return torch.multinomial(probs, 1)
Engineering Configurations by Use Case
Code Generation
config = {
"temperature": 0.2, # Very focused
"top_p": 0.95, # Allow some creativity
"frequency_penalty": 0.1 # Prevent repetitive patterns
}
Why: Code needs to be correct, not creative. Low temperature maintains syntactic coherence.
API Response Generation
config = {
"temperature": 0.0, # Deterministic
"top_k": 1, # Greedy
"max_tokens": 256 # Controlled length
}
Why: Reproducibility matters. Same input → same output.
Creative Writing
config = {
"temperature": 0.9, # High creativity
"top_p": 0.95, # Wide vocabulary
"presence_penalty": 0.6 # Force topic diversity
}
Why: Unexpected word choices create engaging narrative.
Technical Documentation
config = {
"temperature": 0.7, # Balanced
"top_p": 0.9, # Controlled diversity
"top_k": 40 # Safety net
}
Why: Clear but not robotic. Natural language with technical precision.
Advanced: Adaptive Sampling
class AdaptiveSampler:
def __init__(self):
self.perplexity_threshold = 10.0
def sample(self, logits, context):
# Calculate perplexity (uncertainty measure)
probs = torch.softmax(logits, dim=-1)
perplexity = torch.exp(-torch.sum(probs * torch.log(probs)))
# High perplexity = high uncertainty = need more exploration
if perplexity > self.perplexity_threshold:
temperature = 1.2
top_p = 0.95
else:
temperature = 0.5
top_p = 0.8
return self._sample_with_params(logits, temperature, top_p)
Performance Optimization
GPU Memory Management
@torch.inference_mode()
def efficient_sampling(logits, top_p=0.9, top_k=0):
# Work with top-k first to reduce memory
if top_k > 0:
top_k_values, top_k_indices = torch.topk(
logits, min(top_k, logits.size(-1))
)
logits = torch.full_like(logits, -float('inf'))
logits.scatter_(-1, top_k_indices, top_k_values)
# Then apply nucleus on reduced set
# ... nucleus sampling logic
return sampled_token
Batch Processing
def batch_sample(logits_batch, configs):
"""Different sampling params per sequence in batch"""
samples = []
for logits, config in zip(logits_batch, configs):
samples.append(sample_with_config(logits, config))
return torch.stack(samples)
Common Pitfalls
-
Temperature = 0 with Top-p: Pointless. Temperature 0 makes it deterministic, top-p does nothing.
-
Top-k with small vocabularies: If vocab < k, you're doing nothing.
-
Ignoring numerical stability:
# Bad
probs = torch.exp(logits) / torch.sum(torch.exp(logits))
# Good
probs = torch.softmax(logits, dim=-1) # Handles overflow
- Static sampling throughout generation: Early tokens might need different strategy than later ones.
Sampling Cheatsheet: Battle-Tested Configurations
Quick Reference Table
Use Case | Temperature | Top-p | Top-k | Why These Settings |
---|---|---|---|---|
Code Generation | 0.2 | 0.95 | - | Syntax must be perfect. Low temp = fewer syntax errors |
SQL Queries | 0.0 | - | 1 | Zero tolerance for errors. Deterministic every time |
JSON/API Responses | 0.0 | - | 1 | Structure critical. Same input → same output |
Unit Tests | 0.4 | 0.9 | - | Some creativity for edge cases, but mostly logical |
Bug Fixes | 0.3 | 0.9 | - | Focus on correctness with slight variation for solutions |
Technical Docs | 0.6 | 0.9 | - | Clear but natural. Not robotic |
Code Comments | 0.7 | 0.9 | - | Human-readable, varied vocabulary |
API Documentation | 0.5 | 0.9 | - | Precise yet readable |
README Files | 0.7 | 0.9 | - | Engaging but accurate |
Blog Posts | 0.8 | 0.9 | - | Engaging content, varied sentence structure |
Creative Writing | 1.0-1.2 | 0.95 | - | Maximum creativity, unexpected connections |
Marketing Copy | 0.85 | 0.9 | - | Catchy but not insane |
Social Media | 0.9 | 0.85 | 50 | Punchy, memorable, slightly wild |
Customer Support | 0.3 | 0.9 | - | Consistent, helpful, no hallucinations |
Chatbots (Casual) | 0.8 | 0.9 | - | Natural conversation flow |
Virtual Assistant | 0.5 | 0.9 | - | Balanced: reliable but not mechanical |
Tutoring/Education | 0.5 | 0.9 | - | Clear explanations with some variety |
Data Extraction | 0.0 | - | 1 | Just the facts |
Summarization | 0.3 | 0.9 | - | Accurate but not repetitive |
Translation | 0.3 | 0.95 | - | Faithful to source, natural target language |
Classification | 0.0 | - | 1 | Binary decision, no creativity needed |
Search Queries | 0.0 | - | 5 | Limited, precise options |
Understanding the Columns
Temperature: Controls randomness
0.0
: Deterministic robot0.3
: Focused professional0.7
: Natural human1.0+
: Creative chaos agent
Top-p: Vocabulary diversity (nucleus sampling)
0.85
: Tight vocabulary, common words0.9
: Standard range0.95
: Wide vocabulary, unusual words welcome
Top-k: Hard vocabulary limit
- Not set (
-
): Let top-p handle it 1
: Only the #1 choice (greedy)5-50
: Fixed pool of top candidates
The Golden Rules
- If accuracy matters more than creativity: Temperature ≤ 0.3
- If creativity matters more than accuracy: Temperature ≥ 0.8
- If you need reproducibility: Temperature = 0.0, Top-k = 1
- If you're getting repetitive output: Increase temperature or add frequency penalty
- If you're getting nonsense: Lower temperature, reduce top-p
Common Combinations Explained
The Determinist (T=0.0, k=1)
{ "temperature": 0.0, "top_k": 1 }
Use when: Production APIs, database queries, financial calculations Expect: Same output every single time
The Professional (T=0.3, p=0.9)
{ "temperature": 0.3, "top_p": 0.9 }
Use when: Business reports, technical analysis, support responses Expect: Accurate, focused, slight variation to avoid robotic feel
The Balanced (T=0.7, p=0.9)
{ "temperature": 0.7, "top_p": 0.9 }
Use when: General purpose, documentation, explanations Expect: Natural language, good mix of predictability and variety
The Creative (T=1.0, p=0.95)
{ "temperature": 1.0, "top_p": 0.95 }
Use when: Stories, brainstorming, creative content Expect: Surprising connections, unique phrasings, occasional weirdness
The Wildcard (T=1.2+, p=0.95)
{ "temperature": 1.2, "top_p": 0.95 }
Use when: Poetry, experimental writing, breaking patterns Expect: Chaos, brilliance, and complete nonsense in equal measure
Troubleshooting Guide
Problem | Solution | Config Change |
---|---|---|
Too repetitive | Increase randomness | Temperature +0.2, add frequency_penalty: 0.5 |
Nonsense/hallucinations | Reduce randomness | Temperature -0.3, top_p to 0.85 |
Too boring/predictable | Add creativity | Temperature +0.2, top_p to 0.95 |
Inconsistent formatting | Lock it down | Temperature to 0.1, top_k to 10 |
Going off-topic | Tighten focus | Lower top_p to 0.8, add presence_penalty |
Too verbose | Constrain output | Add max_tokens, increase frequency_penalty |
Too terse | Encourage elaboration | Temperature +0.1, remove penalties |
Advanced: Context-Dependent Sampling
Different parts of generation need different strategies:
# Example: Writing a function with documentation
config_sequence = [
{"context": "docstring", "temp": 0.7, "top_p": 0.9}, # Natural description
{"context": "signature", "temp": 0.2, "top_p": 0.95}, # Precise syntax
{"context": "body", "temp": 0.3, "top_p": 0.9}, # Correct logic
{"context": "comments", "temp": 0.6, "top_p": 0.9} # Readable notes
]
The 80/20 Rule
For 80% of use cases, you only need these three configs:
- Factual/Code:
{"temperature": 0.3, "top_p": 0.9}
- General Purpose:
{"temperature": 0.7, "top_p": 0.9}
- Creative:
{"temperature": 1.0, "top_p": 0.95}
Share this Article:
More Articles

Prompt Engineering: The No-BS Guide to AI Communication
Understand, structure and implement prompts that gets you the best, consistant and reduced hallucination outputs.

How KV Caching Works in Large Language Models
KV caching is the optimization that solves this problem, making LLMs faster and more efficient

AI Engineering Productivity: Transforming Software Development
Artificial intelligence isn't just another tool in the developer's toolkit—it's fundamentally changing how we approach problem-solving, code creation, and system design.

Understanding Abstract Syntax Trees
How compilers understand your code, how linters spot bugs or how tools like Prettier can reformat thousands of lines of code in milliseconds

How to Improve the PR Review Process for Engineering Teams
Let's dive deep into the PR review process and see how we can improve it
Continue Reading

Prompt Engineering: The No-BS Guide to AI Communication
Understand, structure and implement prompts that gets you the best, consistant and reduced hallucination outputs.

How KV Caching Works in Large Language Models
KV caching is the optimization that solves this problem, making LLMs faster and more efficient

AI Engineering Productivity: Transforming Software Development
Artificial intelligence isn't just another tool in the developer's toolkit—it's fundamentally changing how we approach problem-solving, code creation, and system design.