Tokens, Temperature, and Model Parameters Explained
Model parameters are key factors that affect the quality, cost, and stability of LLM output. In 2026, as model capabilities evolve at breakneck speed (Claude 4.6, Gemini 3.1 Pro, DeepSeek-R1, etc.), understanding these underlying parameters lets you control Agent behavior and costs with precision.
Token: The Model's "Basic Unit"
A Token is not a character, not a word — it is the smallest unit by which a model processes text.
import tiktoken # Token counting library (using OpenAI's system as an example to demonstrate core principles;
# in 2026, different vendors like Qwen and Gemini have their own tokenizers, but the underlying logic is the same)
def count_tokens(text: str, model_encoding: str = "cl100k_base") -> int:
"""Count the number of tokens in a text"""
encoding = tiktoken.get_encoding(model_encoding)
tokens = encoding.encode(text)
return len(tokens)
def visualize_tokens(text: str, model_encoding: str = "cl100k_base"):
"""Visualize token splitting"""
encoding = tiktoken.get_encoding(model_encoding)
tokens = encoding.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Token list: {[encoding.decode([t]) for t in tokens]}")
print()
# English tokenization example
visualize_tokens("Hello, how are you today?")
# Token list: ['Hello', ',', ' how', ' are', ' you', ' today', '?']
# Token count: 7
# Code token count
code = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""
visualize_tokens(code)
The relationship between Tokens and billing:
⏰ Note: The pricing data below is based on publicly available pricing from vendor websites as of March 2026. Model prices change frequently — always refer to the latest official documentation.
# Token cost calculator (price unit: USD per million tokens, latest data as of 2026-03)
PRICE_PER_1M_TOKENS = {
"claude-4.6-sonnet": {"input": 3.0, "output": 15.0}, # Strongest reasoning, higher price
"gemini-3.1-pro-preview": {"input": 2.0, "output": 12.0}, # King of ultra-long context and multimodal
"qwen-3.5-max": {"input": 1.2, "output": 6.0}, # Business workhorse, great value
"kimi-2.5": {"input": 1.0, "output": 3.0}, # Suitable for ultra-long text retrieval
"deepseek-r1": {"input": 0.55, "output": 2.19}, # Best value reasoning model
}
def estimate_cost(
input_text: str,
expected_output_tokens: int,
model: str = "qwen-3.5-max"
) -> dict:
"""Estimate API call cost"""
input_tokens = count_tokens(input_text)
price = PRICE_PER_1M_TOKENS.get(model, {"input": 1.0, "output": 2.0})
input_cost = (input_tokens / 1_000_000) * price["input"]
output_cost = (expected_output_tokens / 1_000_000) * price["output"]
return {
"input_tokens": input_tokens,
"expected_output_tokens": expected_output_tokens,
"total_tokens": input_tokens + expected_output_tokens,
"estimated_cost_usd": input_cost + output_cost,
}
# Estimate cost
prompt = "Please write a 500-word article about Python asynchronous programming"
cost = estimate_cost(prompt, 500, "qwen-3.5-max")
print(f"Input tokens: {cost['input_tokens']}")
print(f"Estimated total cost: ${cost['estimated_cost_usd']:.4f}")
Key rules for Token usage:
- English: ~1 Token/word
- Code: ~1–2 Tokens/line
- Numbers and punctuation: ~1 Token each
Token Consumption Breakdown by Content Type
You might intuitively think "1 character = 1 Token," but the reality is far more complex. Let's break down Token consumption for different content types with concrete examples:
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
# === English text ===
# Common English words are usually 1 Token; short/common combinations may be merged
samples_en = {
"Hello": encoding.encode("Hello"), # 1 Token
"artificial": encoding.encode("artificial"), # 1 Token (common word)
"superintelligence": encoding.encode("superintelligence"), # 2 Tokens (long/rare word)
"Hello, world!": encoding.encode("Hello, world!"), # 4 Tokens
"The quick brown fox jumps over the lazy dog.":
encoding.encode("The quick brown fox jumps over the lazy dog."), # 10 Tokens
}
for text, tokens in samples_en.items():
decoded = [encoding.decode([t]) for t in tokens]
print(f"'{text}' → {len(tokens)} Token(s), split: {decoded}")
# Rules:
# - Common English words are ~1 Token/word; uncommon or very long words split into 2–3 Tokens
# - Spaces are usually merged with the following word (e.g., ' how')
# - On average, English text is ~1 Token ≈ 4 characters (including spaces)
# === Punctuation ===
samples_punct = {
",": encoding.encode(","), # English comma: 1 Token
".": encoding.encode("."), # English period: 1 Token
"!?": encoding.encode("!?"), # Two punctuation marks → usually 2 Tokens
"...": encoding.encode("..."), # Ellipsis: may be 1 Token
}
for text, tokens in samples_punct.items():
decoded = [encoding.decode([t]) for t in tokens]
print(f"'{text}' → {len(tokens)} Token(s), split: {decoded}")
# === Numbers ===
samples_num = {
"42": encoding.encode("42"), # 1 Token
"3.14": encoding.encode("3.14"), # 2 Tokens (decimal point splits)
"2026": encoding.encode("2026"), # 1 Token
"1000000": encoding.encode("1000000"), # 1–2 Tokens
"3.141592653589793": encoding.encode("3.141592653589793"), # Multiple Tokens
}
for text, tokens in samples_num.items():
decoded = [encoding.decode([t]) for t in tokens]
print(f"'{text}' → {len(tokens)} Token(s), split: {decoded}")
# Rules:
# - 1–4 digit integers are usually 1 Token
# - Longer numbers are split, roughly 1 Token per 3–4 digits
# - Decimal points cause additional Token splits
Image Token consumption (multimodal models):
2026 is the year of multimodal Agent explosion. Models like Gemini 3.1 Pro or Claude 4.6 process images not by character count but by "image tiles":
# General image Token consumption estimation (based on mainstream tiling logic)
def estimate_image_tokens(width: int, height: int, detail: str = "auto") -> int:
"""
Estimate the number of tokens consumed by an image
detail modes:
- "low": Fixed base consumption regardless of image size (typically ~85 Tokens)
- "high": Calculated based on image dimensions, split into tiles, may consume hundreds to thousands of Tokens
- "auto": Model automatically selects
"""
if detail == "low":
return 85 # Fixed consumption
# high detail mode calculation (using a typical 512x512 tiling algorithm):
# 1. Limit max dimension and scale proportionally
max_dim = max(width, height)
if max_dim > 2048:
scale = 2048 / max_dim
width = int(width * scale)
height = int(height * scale)
# 2. Calculate how many 512x512 tiles are needed
tiles_x = (width + 511) // 512 # Round up
tiles_y = (height + 511) // 512
num_tiles = tiles_x * tiles_y
# 3. Each tile consumes a fixed number of Tokens (e.g., 170) + base Tokens (e.g., 85)
total_tokens = num_tiles * 170 + 85
return total_tokens
# Token consumption for common image sizes
print("=== Multimodal Image Token Consumption Estimate (high detail mode) ===")
image_sizes = [
(256, 256, "Thumbnail/icon"),
(512, 512, "Small image"),
(1024, 768, "Regular photo"),
(1920, 1080, "Full HD screenshot"),
(4096, 2160, "4K image"),
]
for w, h, desc in image_sizes:
tokens_low = estimate_image_tokens(w, h, "low")
tokens_high = estimate_image_tokens(w, h, "high")
print(f" {desc} ({w}x{h}): low={tokens_low}, high={tokens_high} Tokens")
💡 Practical tip: If your Agent frequently processes images (e.g., web UI screenshot analysis), compress images to below 1080p in your local code first, or use
detail="low"for non-OCR scenarios. This can save you a significant amount in API costs!
Quick reference table for Token consumption by content type:
| Content Type | Example | Approx. Token Cost |
|---|---|---|
| English word | "Hello" | 1 Token/word |
| English long/rare word | "superintelligence" | 2–3 Tokens/word |
| Punctuation | , . ! ? | 1 Token each |
| Number (1–4 digits) | "42", "2026" | 1 Token |
| Long number | "3.141592653589793" | Split per 3–4 digits |
| Code | Python/JS, etc. | ~1–2 Tokens/line |
| Image (low) | Any size | Fixed 85 Tokens |
| Image (high) | 1920×1080 | ~1445 Tokens |
Temperature: The Creativity Dial
Temperature controls the randomness of output and is one of the most important parameters:
from openai import OpenAI
client = OpenAI() # Compatible with mainstream LLM API formats
def test_temperature(prompt: str, temperatures: list, runs: int = 3):
"""Compare output effects at different Temperature values"""
for temp in temperatures:
print(f"\n{'='*50}")
print(f"Temperature = {temp}")
print('='*50)
for i in range(runs):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=temp,
max_tokens=50
)
print(f" Run {i+1}: {response.choices[0].message.content}")
# Test creative writing (higher Temperature is better)
test_temperature(
"Describe spring in one sentence",
temperatures=[0.0, 0.7, 1.5],
runs=3
)
# Temperature=0.0: Output is identical every time (absolute determinism)
# Temperature=0.7: Some variation, language is natural and fluent
# Temperature=1.5: Highly divergent creativity, but words may jump or be incoherent
Recommended Temperature values for different scenarios:
TEMPERATURE_GUIDE = {
"code_generation": 0.1, # Requires precision, low randomness
"data_extraction_json": 0.0, # Complete determinism, prevents crashes
"qa_factual": 0.3, # Slightly stable
"copywriting_summary": 0.7, # Balance creativity and accuracy
"brainstorming_creative": 1.0, # Encourage diversity
"poetry_creative_writing": 1.2, # High creativity
"agent_logic_routing": 0.1, # Tool calls need extreme stability
"conversation_chat": 0.8, # Natural conversation
}
def get_optimal_temperature(task_type: str) -> float:
return TEMPERATURE_GUIDE.get(task_type, 0.7)
Top-p: Another Way to Control Randomness
Top-p (also called Nucleus Sampling) dynamically truncates and samples from the set of highest-probability words, with the set size determined by p:
# Using Top-p and Temperature together
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write the opening of a short story about AI awakening"}],
temperature=0.8, # Controls degree of randomness
top_p=0.9, # Only sample from the top 90% of cumulative probability vocabulary
max_tokens=200
)
Difference between Temperature and Top-p:
| Parameter | Mechanism | Industry Best Practice |
|---|---|---|
| Temperature | Scales the entire probability distribution of all vocabulary | Usually adjust this parameter first |
| Top-p | Directly "cuts off" low-probability long-tail words | Don't adjust both parameters significantly at the same time |
Usually adjust one and keep the other at default (temperature=1.0 or top_p=1.0).
max_tokens: Control Output Length (⚠️ Hidden Trap for Reasoning Models)
def chat_with_length_control(
message: str,
max_output_tokens: int = 500,
model: str = "gpt-4o-mini"
) -> dict:
"""Control output length"""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": message}],
max_tokens=max_output_tokens # Limit generation length
)
usage = response.usage
content = response.choices[0].message.content
finish_reason = response.choices[0].finish_reason
return {
"content": content,
"total_tokens": usage.total_tokens,
"finish_reason": finish_reason # "stop"=normal end, "length"=truncated at limit
}
# Test
result = chat_with_length_control("Write a 500-word article", max_output_tokens=100)
if result["finish_reason"] == "length":
print("⚠️ Output was truncated by max_tokens!")
🔥 2026 Critical Warning: The CoT Token Trap in Reasoning Models (e.g., DeepSeek-R1) If you're using a "slow thinking" reasoning model like
DeepSeek-R1, never setmax_tokenstoo small!Reasoning models output a large amount of internal "Chain of Thought" before giving the final answer. These thinking processes are also subject to max_tokens and are billed. If you set
max_tokensto 500 to save money, the model may exhaust its quota during the thinking phase, and you won't even get the final answer.
- Best practice: When calling reasoning models for complex planning, set
max_tokensto 4000 or even 8000.
Presence Penalty & Frequency Penalty: Control Repetition
# These two parameters help prevent the model from repeating itself
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "List 10 different startup directions"}],
# presence_penalty: Penalizes any token that has already appeared (encourages new topics)
# Range: -2.0 to 2.0; positive values reduce topic repetition
presence_penalty=0.5,
# frequency_penalty: Penalizes tokens proportional to how often they've appeared (encourages vocabulary diversity)
# Range: -2.0 to 2.0; positive values reduce high-frequency words
frequency_penalty=0.3,
)
# Suitable for: listing diverse options, generating long non-repetitive research reports
stop: Custom Stop Conditions
# Make the model stop abruptly at a specific string — extremely cost-effective for data extraction
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": "Please output in format:\nName:\nPrice:\nDescription:"
}],
stop=["Description:"], # Once "Description:" is generated, immediately force-stop API generation
)
# Practical use case: structured data extraction without filler text
def extract_until_marker(text: str, stop_marker: str) -> str:
"""Extract content up to a marker, preventing the model from adding filler text afterward"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": text}],
stop=[stop_marker]
)
return response.choices[0].message.content
n: Generate Multiple Candidate Results
def generate_multiple_options(prompt: str, n: int = 3) -> list:
"""Generate multiple parallel candidate results in a single API call for downstream selection or scoring"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
n=n, # Generate n different branch replies
temperature=0.9 # Must increase temperature to ensure diversity across branches
)
return [choice.message.content for choice in response.choices]
# Suitable for: ad copy headline generation, A/B test proposals, multi-Agent debate drafts
titles = generate_multiple_options(
"Generate 3 compelling titles for a technical blog post about multi-agent collaboration",
n=3
)
for i, title in enumerate(titles, 1):
print(f"Candidate {i}: {title}")
Complete Parameter Reference & Agent Practice Recommendations
In real Agent architecture design, we typically wrap a routing factory that dynamically switches base models and parameter configurations based on task attributes:
def create_agent_call(
messages: list,
task_type: str = "general",
**override_params
) -> dict:
"""
Best-practice Agent call factory
Automatically matches the optimal 2026 model and parameters based on task type
"""
# Dynamic routing presets for different task types
task_presets = {
"reasoning": { # Complex reasoning, logic bug investigation
"model": "deepseek-r1", # Use a powerful reasoning model
"temperature": 0.6, # Reasoning models have internal exploration; don't need too low a temperature
"max_tokens": 8000, # Must reserve large token budget for CoT generation
},
"code": { # Pure code writing/refactoring
"model": "gpt-4o",
"temperature": 0.1,
"max_tokens": 4000,
},
"extraction": { # Information extraction, intent classification, JSON generation
"model": "gpt-4o-mini",
"temperature": 0.0, # Eliminate all randomness to prevent JSON parse failures
"max_tokens": 500,
},
"creative": { # Creative copywriting, brainstorming
"model": "gpt-4o",
"temperature": 0.9,
"max_tokens": 1000,
"presence_penalty": 0.3,
},
"general": { # General conversation and standard instructions
"model": "gpt-4o-mini",
"temperature": 0.7,
"max_tokens": 1500,
}
}
params = task_presets.get(task_type, task_presets["general"])
params.update(override_params) # Allow external business layer to override defaults
params["messages"] = messages
response = client.chat.completions.create(**params)
return {
"content": response.choices[0].message.content,
"usage": {
"input": response.usage.prompt_tokens,
"output": response.usage.completion_tokens,
"total": response.usage.total_tokens
},
"model": response.model,
"finish_reason": response.choices[0].finish_reason
}
# Usage example: have the Agent extract conclusions from copy
result = create_agent_call(
messages=[{"role": "user", "content": "Extract the user purchase intent rating from this text..."}],
task_type="extraction"
)
print(f"Routed to model: {result['model']}")
print(f"Total token usage: {result['usage']['total']}")
Parameter Quick Reference Card
Section Summary
Understanding model parameters is a fundamental skill for Agent architects in 2026:
- Tokens are the billing atom: Multimodal image billing is especially prone to overruns — preprocessing compression is key.
- Temperature determines stability: Set to
0.0for tool parameter extraction; set to0.9+for brainstorming. - Watch out for the max_tokens trap in reasoning models: Models like
DeepSeek-R1consume large amounts of hidden thinking Tokens — always leave enough quota to avoid truncation. - Dynamic model routing: Different task types (extraction vs. reasoning vs. creative) should map to different parameter combinations and the best value-for-money model.
Mastering these parameters lets you achieve the perfect balance between quality, latency, and API costs, giving you true control over Agent behavior.
🤔 Thinking Exercises
- An Agent is responsible for reviewing long advertising copy (potentially tens of thousands of words) for compliance. Which model would you choose for the best value, and how would you set
Temperatureandmax_tokens? - If you notice that your Agent's weekly report summaries frequently use phrases like "overall," "it is worth noting," etc., which parameter should you adjust? Why?
- Calculate: if you use
DeepSeek-R1to build a problem-solving Agent that handles 1,000 requests per day, with an average input of 500 Tokens but an average CoT and output of 3,000 Tokens, what would the monthly cost be? How does this compare to using a standard model?
Next chapter: Chapter 4: Tool Calling (Tool Use / Function Calling)