Latest Advances in Context Engineering
🔬 "The expansion of context windows is not the endpoint — how to efficiently utilize the 'attention bandwidth' of every token is the real challenge."
In the previous sections, we learned the theoretical foundations of context engineering — from the distinction between context vs. prompt engineering, attention budget management, to long-horizon task strategies and GSSC practice. These are the "fundamentals." This section discusses the latest technology breakthroughs and methodology evolution happening in this field, which are fundamentally changing how Agent developers manage context.
In June 2025, Andrej Karpathy publicly stated his preference for using "Context Engineering" to replace the term "Prompt Engineering" [1]. Subsequently, leading institutions like Anthropic [2] and LangChain [3] published systematic context engineering guides. In 2025–2026, context engineering rapidly grew from an emerging concept into the core engineering discipline of Agent development.
Million-Token Context Windows: From Arms Race to Practical Deployment
The Explosive Growth of Context Windows
In 2024–2026, context windows experienced a leap from the hundred-thousand level to the ten-million level:
| Period | Representative Model | Context Window | Equivalent Text |
|---|---|---|---|
| Early 2023 | GPT-3.5 | 4K tokens | ~3,000 words |
| Mid 2023 | Claude 2 | 100K tokens | ~75,000 words |
| 2024 | GPT-4 Turbo | 128K tokens | ~96,000 words |
| Early 2025 | Gemini 2.5 Pro | 1M tokens | ~750,000 words (~10 books) |
| Mid 2025 | Llama 4 Scout | 10M tokens | ~7,500,000 words (~100 books) |
| Early 2026 | Claude Opus 4.6 / Sonnet 4.6 | 1M tokens | 1M token standard pricing, no long-text surcharge |
| Early 2026 | GPT-5.4 | 272K (standard) / 1M (extended) | 2× input surcharge beyond 272K |
| Early 2026 | Gemini 3.1 Pro | 1M tokens | Supports video/audio/image/text multimodal |
| 2026 (experimental) | Magic.dev LTM-2-Mini | 100M tokens | ~75M words (theoretical, no public user validation) |
Two key trends are worth noting:
1. Million-token becomes standard: by early 2026, mainstream models including Claude 4.6, Gemini 3.x, and Llama 4 Maverick all offer 1M token context windows. This means "entire book" or even "entire codebase" level input is no longer a dream.
2. Pricing strategy divergence: Anthropic (Claude 4.6) implements standard pricing for 1M tokens with no additional fees; while OpenAI (GPT-5.4) charges a significant surcharge beyond 272K. This pricing strategy directly affects Agent architecture choices.
But bigger window ≠ problem solved. The Lost-in-the-Middle problem we discussed in Section 8.2 hasn't disappeared — in fact, when the window expands from 128K to 1M, this problem becomes even more severe.
Real-World Testing: True Capabilities of Large Windows
# A real test: retrieving specific information in a 1 million token context
import time
def needle_in_haystack_test(model, context_size, needle_position):
"""
Classic "needle in a haystack" test.
Insert a key piece of information at a specific position in a large amount of filler text,
then ask the model a question related to that information.
"""
haystack = generate_padding_text(context_size)
needle = "The secret number for project Moonlight is 42-ALPHA-7."
# Insert key information at the specified position
position = int(len(haystack) * needle_position)
context = haystack[:position] + needle + haystack[position:]
response = model.query(
context=context,
question="What is the secret number for project Moonlight?"
)
return response
# Real-world test results for various models in 2026 (retrieval accuracy)
results = {
"Claude Opus 4.6 (1M)": {
"Beginning 10%": "✅ 99%",
"Middle 50%": "✅ 97%", # most uniform performance within 1M range
"End 90%": "✅ 99%",
"Full 100%": "✅ 95%", # quality remains stable at million-token scale
},
"Gemini 3.1 Pro (1M)": {
"Beginning 10%": "✅ 99%",
"Middle 50%": "✅ 96%", # significantly improved Lost-in-the-Middle
"End 90%": "✅ 98%",
"Full 100%": "⚠️ 89%", # still some performance degradation near full capacity
},
"GPT-5.4 (272K standard)": {
"Beginning 10%": "✅ 99%",
"Middle 50%": "✅ 93%",
"End 90%": "✅ 97%",
"Full 100%": "⚠️ 88%",
},
"DeepSeek R1 (128K)": {
"Beginning 10%": "✅ 98%",
"Middle 50%": "⚠️ 88%",
"End 90%": "✅ 95%",
"Full 100%": "⚠️ 82%",
},
}
💡 Practical advice: don't blindly chase the largest window. If 128K is sufficient, don't fill 1M. Context quality is far more important than context quantity — this is the first principle of context engineering. A solution with perfect recall within 100K tokens is often superior to one that performs unstably at 500K tokens.
Anthropic's Context Engineering Methodology: From Practice to Theory
On September 29, 2025, Anthropic published the landmark technical article "Effective Context Engineering for AI Agents" [2], the first systematic summary of context management methodology for production-grade Agents. This article had a profound impact on the entire industry.
Core Philosophy: Context Is a Limited and Precious Resource
Anthropic's core view is: find the minimal set of high-signal tokens that maximizes the probability of the desired outcome. This is consistent with the "quality-first" principle we discussed in Section 8.1, but Anthropic provides a more actionable framework from an engineering practice perspective.
# Anthropic's core context engineering principles (expressed in pseudocode)
class AnthropicContextPhilosophy:
"""
Core philosophy: context is a limited resource with diminishing marginal returns.
As token count increases:
- First 10K tokens: high information gain per token
- 10K–50K: information gain starts to diminish
- 50K–200K: requires careful filtering to maintain signal density
- 200K+: without management, noise may overwhelm signal
"""
principles = [
"Find the minimal set of high-signal tokens", # more is not better
"Re-curate context at each inference", # context is dynamic
"Treat context as a resource with diminishing returns", # the 100Kth token is worth far less than the 1Kth
"Do the simplest effective thing", # over-engineering is also a waste
]
Three Pillars of "Effective Context"
Anthropic breaks down the composition of high-quality context into three levels:
1. System Prompt Design — Finding the Right "Altitude"
# ❌ Overly Prescriptive — too granular, brittle
system_prompt_bad_1 = """
If the user asks about a file, first check if the file exists. If the file exists and is smaller than 100KB,
read it directly. If larger than 100KB but smaller than 1MB, use chunked reading. If larger than 1MB,
first check the file type. If it's a text file, use streaming...
"""
# ❌ Too Vague — no practical guidance
system_prompt_bad_2 = """You are a helpful programming assistant. Please try your best to help users."""
# ✅ The right "altitude" — clear principles + appropriate flexibility
system_prompt_good = """
You are a professional programming assistant.
<core_principles>
- Understand the intent of existing code before modifying it
- Prioritize using patterns and conventions already in the project
- For destructive operations (deleting files, rewriting modules), confirm before executing
</core_principles>
<tool_usage>
You can use tools like read_file, write_file, search, etc.
Follow the principle of least privilege when choosing tools — read rather than write if possible, search rather than full scan.
</tool_usage>
"""
2. Tool Definitions — The Interface Contract Between Agent and World
# Anthropic's tool design principles
tool_design_principles = {
"Token efficient": "Tool returns should be concise, don't return large amounts of irrelevant information",
"Non-overlapping functionality": "Like a well-designed function library, each tool has a single responsibility",
"Self-contained": "Tool descriptions should be clear enough that — if a human engineer can't determine when to use it, neither can AI",
"Robustness": "Handle erroneous inputs gracefully, return useful error messages",
}
# ❌ Poor tool design: overlapping functionality, vague descriptions
tools_bad = [
{"name": "search_files", "description": "Search files"},
{"name": "find_files", "description": "Find files"}, # What's the difference from above?
{"name": "lookup_files", "description": "Look up file content"}, # Even more ambiguous
]
# ✅ Good tool design: clear responsibilities, unambiguous
tools_good = [
{"name": "glob_search", "description": "Search by filename pattern (e.g., *.py), returns list of matching file paths"},
{"name": "content_search", "description": "Search files by content regex, returns matching lines with context"},
{"name": "read_file", "description": "Read all or part of a file at the specified path (supports offset+limit)"},
]
3. Just-in-Time Context — Anthropic's Killer Feature
This is Anthropic's most influential practice pattern. The core idea is: don't preload all potentially needed information; instead, maintain lightweight identifiers and retrieve on demand at runtime.
class JustInTimeContextStrategy:
"""
Just-in-Time context strategy (core pattern of Anthropic / Claude Code)
Traditional approach: preload all potentially relevant files into context
JIT approach: only maintain file paths/query pointers, load when needed
Effect: context usage reduced by 70%+, and information is more precise
"""
def __init__(self):
# Maintain lightweight identifiers, not full content
self.file_index = {} # file path → brief summary
self.query_pointers = {} # query description → database/API endpoint
self.web_links = {} # topic → URL
def build_initial_context(self, task):
"""Initial context only contains the 'map', not the 'territory'"""
return {
"system": self.system_prompt,
"task": task,
"available_resources": {
"files": list(self.file_index.keys()), # paths only
"databases": list(self.query_pointers.keys()),
"docs": list(self.web_links.keys()),
},
# Tell the Agent: you have these resources available, actively retrieve when needed
"instruction": "Use tools to fetch specific content on demand. Don't guess."
}
def on_agent_request(self, resource_type, identifier):
"""Load specific content only when the Agent actively requests it"""
if resource_type == "file":
return read_file(identifier) # only reads the file at this point
elif resource_type == "database":
return execute_query(self.query_pointers[identifier])
elif resource_type == "web":
return fetch_url(self.web_links[identifier])
💡 How Claude Code actually works: Claude Code only reads the
CLAUDE.mdfile in the project root directory at startup (serving as the project's "user manual"), then navigates the entire codebase on demand using primitives likeglobandgrep. It never loads the entire codebase into context — even when the model window is "large enough." This is the quintessential application of JIT thinking.
ACE: Self-Evolving Context Engineering (ICLR 2026)
In October 2025, Zhang et al. proposed the ACE (Agentic Context Engineering) framework [4], which was accepted at ICLR 2026. This is an important breakthrough in the field of context engineering — letting Agents learn to manage and optimize their own context.
Core Problem: Context Collapse
Traditional context management faces two chronic issues:
- Brevity Bias: losing domain-depth insights when compressing summaries — the more you compress, the more "generic" it becomes
- Context Collapse: details gradually erode during iterative rewriting, and eventually "summaries of summaries" become completely uninformative
# Intuitive understanding of context collapse
def context_collapse_demo():
"""
Simulate the context collapse process.
Each compression loses some details. After 5-10 rounds of compression,
the original information may only retain the highest-level abstractions —
specific numbers, conditional branches, and edge cases are all lost.
"""
original = """
When processing order #12345, we found: when a user simultaneously uses
coupon A (spend $300 save $50) and member discount (20% off), the system
incorrectly applies the discount first then the coupon, resulting in an
actual reduction of $50 + ($300-$50)*0.2 = $100, while the correct logic
should be $300*0.8 - $50 = $190, a difference of $90.
Fixed in the calculate_discount() function in order_service.py.
Needs regression testing of test_discount_combination_cases().
"""
# Round 1 compression
round_1 = "Fixed calculation order error when coupon and member discount are used simultaneously, $90 difference"
# Round 2 compression
round_2 = "Fixed discount calculation error"
# Round 3 compression
round_3 = "Fixed a bug"
# → Specific order number, amounts, file location, test cases all lost!
ACE Framework: Letting Context Self-Evolve
ACE's core innovation is treating context as an "Evolving Playbook" that continuously improves through three modular stages:
class ACEFramework:
"""
ACE: Agentic Context Engineering
Core idea: context is not static text, but a "tactical playbook" that
continuously evolves with the Agent's execution experience.
Three stages form a cycle: Generate → Reflect → Curate
"""
def __init__(self, base_context):
self.playbook = base_context # initial context (tactical playbook)
self.experience_buffer = [] # experience buffer
# Stage 1: Generate
def generate(self, task):
"""
Agent executes the task using the current playbook,
collecting feedback from the execution process (success/failure/surprises)
"""
result = self.agent.execute(task, context=self.playbook)
feedback = self.collect_natural_feedback(result)
self.experience_buffer.append({
"task": task,
"result": result,
"feedback": feedback, # natural execution feedback, no manual annotation needed
})
return result
# Stage 2: Reflect
def reflect(self):
"""
Analyze execution feedback in the experience buffer,
identify areas in the playbook that need improvement
"""
insights = self.agent.analyze(
prompt="Analyze the following execution experiences, identify success patterns and failure causes:",
data=self.experience_buffer
)
return insights # e.g., "When encountering nested JSON, validate schema first"
# Stage 3: Curate
def curate(self, insights):
"""
Key innovation: structured incremental updates, not full rewrites.
- New strategies added as "patches" to the playbook
- Outdated strategies are marked and cleaned up
- Preserves detail depth, prevents context collapse
"""
self.playbook = self.incremental_update(
current=self.playbook,
new_insights=insights,
mode="structured_patch" # incremental patch, not full rewrite
)
def evolution_loop(self, tasks):
"""Complete evolution loop"""
for task in tasks:
self.generate(task)
if len(self.experience_buffer) >= 5: # reflect every 5 tasks
insights = self.reflect()
self.curate(insights)
self.experience_buffer.clear()
ACE Experimental Results
| Benchmark | Baseline Performance | ACE Improvement | Notes |
|---|---|---|---|
| AppWorld (Agent tasks) | Baseline model | +10.6% | Using smaller open-source models, on par with top production Agents |
| Finance domain tasks | Baseline model | +8.6% | Domain knowledge continuously accumulates through iterations |
| Adaptation latency | Fine-tuning approach | Significantly reduced | No retraining needed, only update context |
| Deployment cost | Fine-tuning approach | Significantly reduced | One context applies to all instances |
💡 Why this matters: ACE proves an exciting possibility: Agents can self-improve by optimizing context, without fine-tuning model weights. This means that even using smaller open-source models, careful context engineering can achieve performance comparable to large commercial models. For resource-constrained teams, this is an extremely cost-effective path.
Context Caching: The Economics of Context Reuse
Problem: Repeatedly Paying the "Context Tax"
In the traditional model, every API call requires resending the complete System Prompt + tool definitions + conversation history. If your Agent has an 8K token system prompt, you pay for those 8K tokens every round of conversation.
# Traditional mode: send everything completely each time
for user_message in conversation:
response = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "system", "content": system_prompt}, # 8K tokens, repeated every time
*conversation_history, # continuously growing
{"role": "user", "content": user_message},
],
tools=tool_definitions, # 2K tokens, repeated every time
)
# If the conversation is 100 rounds, system_prompt is "billed" 100 times
Solution: Prompt Caching
In 2024–2025, major vendors successively launched Prompt Caching features. By 2026, this has become a standard optimization for Agent development:
# Anthropic Prompt Caching example (latest 2026 API)
from anthropic import Anthropic
client = Anthropic()
# First call: cache system prompt (cache write has 25% additional fee)
response = client.messages.create(
model="claude-sonnet-4.6",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # large system prompt
"cache_control": {"type": "ephemeral"} # mark as cacheable
}
],
messages=[{"role": "user", "content": "Hello"}]
)
# Subsequent calls: cache hit, input price reduced by 90%!
# Same cache_control block content unchanged → automatically hits cache
response = client.messages.create(
model="claude-sonnet-4.6",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # content unchanged → cache hit
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Help me analyze this code"}]
)
# Google Gemini Context Caching example
import google.generativeai as genai
# Create a reusable cache (configurable TTL)
cache = genai.caching.CachedContent.create(
model="gemini-3.1-pro",
display_name="agent-system-context",
system_instruction="You are an expert coding assistant...",
contents=[
# Can cache large reference documents
genai.upload_file("codebase_summary.txt"),
genai.upload_file("api_documentation.pdf"),
],
ttl=datetime.timedelta(hours=1), # cache for 1 hour
)
# Subsequent calls directly reference the cache
model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content("What is the rate limiting strategy for this API?")
# Token cost for cached portion is significantly reduced
The Economics of Caching (March 2026 data)
| Provider | Cache Write Cost | Cache Hit Cost | Savings | Cache TTL |
|---|---|---|---|---|
| Anthropic | Normal price ×1.25 | Normal price ×0.1 | 90% savings on hit | 5 minutes (ephemeral) |
| Normal price ×1.0 | Normal price ×0.25 | 75% savings on hit | Customizable (1min–1h) | |
| OpenAI | Normal price ×1.0 | Normal price ×0.5 | 50% savings on hit | Automatically managed |
💡 Impact on Agents: for Agents with long system prompts + multi-round conversations, Prompt Caching can reduce total costs by 40%–70%. This is a pure win optimization — especially under Claude 4.6's 1M token window, the economic benefits of caching large reference documents are even more significant.
KV-Cache Optimization: Context Acceleration at the Model Level
What Is KV-Cache?
During Transformer inference, the Key and Value tensors at each layer, once computed, can be cached and reused — this is KV-Cache. It avoids redundant computation of already-processed tokens and is the core technology for efficient autoregressive generation.
# Intuitive understanding of KV-Cache
class TransformerWithKVCache:
"""
Without KV-Cache: generating the Nth token requires recomputing attention for all N-1 tokens
With KV-Cache: K, V for the first N-1 tokens are already cached, only compute attention for the new token
Time complexity: O(N²) → O(N)
"""
def generate_next_token(self, input_ids, past_kv_cache=None):
if past_kv_cache is not None:
# Only process the latest token
new_token_kv = self.attention(input_ids[-1:], past_kv_cache)
updated_cache = concat(past_kv_cache, new_token_kv)
else:
# First call, process all tokens
updated_cache = self.attention(input_ids)
return next_token, updated_cache
New KV-Cache Optimization Techniques in 2025–2026
As context windows expand to the million-token level, KV-Cache memory usage becomes a critical bottleneck. Here are the latest optimization solutions:
1. MLA (Multi-head Latent Attention) — DeepSeek's Continued Innovation
# DeepSeek-V3/R1's original MLA, widely studied in 2025-2026
# Core idea: compress KV into a low-dimensional latent space
# Effect: KV-Cache size is only ~5% of standard MHA
class MultiHeadLatentAttention:
"""
Standard MHA: cache_size = num_layers × num_heads × seq_len × head_dim × 2
MLA: cache_size = num_layers × seq_len × latent_dim × 2
When latent_dim << num_heads × head_dim, cache size is dramatically reduced
"""
def compress_kv(self, keys, values):
# Project high-dimensional KV into low-dimensional latent space
latent = self.down_proj(concat(keys, values))
return latent # only cache this compressed representation
def restore_kv(self, latent):
# Restore KV from latent space during inference
keys, values = self.up_proj(latent).split(2)
return keys, values
2. ChunkKV — Semantics-Preserving KV-Cache Compression (NeurIPS 2025)
# ChunkKV: semantics-preserving KV-Cache compression method proposed in 2025
# Core idea: not token-by-token eviction, but retain or evict entire "semantic chunks"
class ChunkKV:
"""
Traditional methods (like H2O) evaluate importance token by token → easily breaks semantic coherence
ChunkKV divides KV-Cache into semantically coherent chunks → chunk-level retention/eviction
Achieves SOTA performance at 10% compression ratio
"""
def compress(self, kv_cache, compression_ratio=0.1):
# 1. Divide KV-Cache into chunks by semantic similarity
chunks = self.semantic_chunking(kv_cache)
# 2. Evaluate the overall importance of each chunk
chunk_scores = [self.score_chunk(chunk) for chunk in chunks]
# 3. Retain the most important chunks (maintaining semantic integrity)
keep_count = int(len(chunks) * compression_ratio)
top_chunks = sorted(
zip(chunks, chunk_scores),
key=lambda x: -x[1]
)[:keep_count]
return merge_chunks([c for c, _ in top_chunks])
3. RocketKV — Two-Stage Compression for Long-Context Inference Acceleration (2025)
# RocketKV: two-stage KV-Cache compression for long-context LLM inference
class RocketKV:
"""
Stage 1 (coarse filter): quickly eliminate obviously unimportant tokens based on attention scores
Stage 2 (fine selection): fine-grained importance evaluation and retention of remaining tokens
Effect: inference speed improved 2-4x while maintaining quality
"""
def two_stage_compress(self, kv_cache):
# Stage 1: fast coarse filter (low computational cost)
coarse_mask = self.coarse_filter(kv_cache, keep_ratio=0.3)
candidates = kv_cache[coarse_mask]
# Stage 2: fine selection (high-quality retention)
fine_mask = self.fine_select(candidates, keep_ratio=0.5)
return candidates[fine_mask] # final retention of ~15% of KV
4. Comprehensive Comparison
| Technique | Principle | Compression Ratio | Quality Loss | Published/Adopted |
|---|---|---|---|---|
| GQA | Multiple query heads share KV | 4–8x | Extremely low | 2023, now mainstream standard |
| MLA (DeepSeek) | KV projected to low-dimensional latent space | ~20x | Extremely low | 2024, adopted by DeepSeek series |
| KV-Cache Quantization (INT8/FP8) | Reduce numerical precision | 2–4x | Extremely low | 2024+, widely adopted |
| H2O (Heavy-Hitter Oracle) | Only keep KV of "important" tokens | 5–20x | Low (task-dependent) | 2024 |
| ChunkKV | Semantic chunk-level retention/eviction | 3–10x | Low | NeurIPS 2025 |
| RocketKV | Two-stage coarse filter + fine selection | 5–7x | Low | 2025 |
| SCOPE | Decoding phase optimization | 3–5x | Low | ACL 2025 |
| StreamingLLM | Attention sink + sliding window | Dynamic | Moderate | 2024+ |
💡 Impact on Agents: these underlying optimizations allow model vendors to offer longer contexts at lower cost. As an Agent developer, you don't need to implement these techniques yourself, but understanding them helps make better model selection and architecture decisions — for example, when using DeepSeek series models, MLA's low memory overhead makes long-context inference feasible even on consumer-grade GPUs.
Production-Grade Context Management Patterns
Pattern 1: Tiered Context Architecture
In production-grade Agents, context is not a flat messages list but is hierarchically organized:
class TieredContextManager:
"""
Tiered context architecture (reference: Anthropic methodology)
L0: System core (always retained) ~2K tokens
L1: Task context (current task related) ~4K tokens
L2: Working memory (recent interactions) ~8K tokens
L3: Reference materials (retrieved on demand) ~dynamic
"""
def __init__(self, max_tokens=128000):
self.max_tokens = max_tokens
self.layers = {
"L0_system": {
"budget": 2000,
"priority": "NEVER_DROP",
"content": None # system prompt, role definition
},
"L1_task": {
"budget": 4000,
"priority": "HIGH",
"content": None # current task goal, constraints
},
"L2_working": {
"budget": 8000,
"priority": "MEDIUM",
"content": None # recent conversations and intermediate results
},
"L3_reference": {
"budget": None, # dynamically allocate remaining space
"priority": "LOW",
"content": None # RAG retrieval results, document fragments
},
}
def build_context(self, task, history, retrieved_docs):
"""Build priority-ordered context"""
context = []
used_tokens = 0
# L0: System core (always included)
context.append({"role": "system", "content": self.system_prompt})
used_tokens += count_tokens(self.system_prompt)
# L1: Current task (always included)
task_context = self.format_task(task)
context.append({"role": "system", "content": task_context})
used_tokens += count_tokens(task_context)
# L2: Working memory (keep most recent N rounds, compress if necessary)
remaining = self.max_tokens - used_tokens - 4000 # reserve 4K for output
working_memory = self.compress_history(history, budget=min(8000, remaining // 2))
context.extend(working_memory)
used_tokens += count_tokens(working_memory)
# L3: Reference materials (fill remaining space)
remaining = self.max_tokens - used_tokens - 4000
if remaining > 500 and retrieved_docs:
selected = self.select_references(retrieved_docs, budget=remaining)
context.append({"role": "system", "content": f"Reference materials:\n{selected}"})
return context
Pattern 2: Context Compaction
This is the production pattern used by Anthropic in Claude Code — when context approaches the limit, automatically call the model to summarize history, then replace the original conversation with the summary:
class ContextCompactor:
"""
Context compactor (reference: Claude Code implementation pattern)
Automatically triggers compaction when token usage exceeds threshold.
Key improvements (2025-2026):
- Tool result clearing: safest lightweight compression, only cleans old tool outputs
- Structured summaries: preserve key decisions and operation results
- Progressive compression: multi-level compression rather than all-at-once
"""
def __init__(self, model, threshold_ratio=0.8):
self.model = model
self.threshold_ratio = threshold_ratio
def maybe_compact(self, messages, max_tokens):
"""Check if compaction is needed"""
current_usage = count_tokens(messages)
if current_usage < max_tokens * self.threshold_ratio:
return messages # haven't reached threshold, no compaction needed
# Try lightweight compaction first
messages = self.clear_old_tool_results(messages)
if count_tokens(messages) < max_tokens * self.threshold_ratio:
return messages # lightweight compaction was sufficient
# Still over limit, trigger full compaction
return self.full_compact(messages)
def clear_old_tool_results(self, messages):
"""
Lightweight compaction: clear old tool return results.
Anthropic's recommended "safest form of compression."
"""
result = []
for i, msg in enumerate(messages):
if (msg.get("role") == "tool" and
i < len(messages) - 8): # only clear older tool results
result.append({
"role": "tool",
"content": f"[Executed: {msg.get('name', 'tool')} → result archived]"
})
else:
result.append(msg)
return result
def full_compact(self, messages):
"""Full compaction"""
# Separate: protected zone (not compressed) vs. compression zone
system_msgs = [m for m in messages if m["role"] == "system"]
recent_msgs = messages[-6:] # keep the most recent 3 rounds verbatim
old_msgs = messages[len(system_msgs):-6] # middle history to compress
if not old_msgs:
return messages
# Have the model generate a structured summary
summary = self.model.chat([
{"role": "system", "content": """
Please compress the following conversation history into a structured summary. Preserve:
1. The user's core goals and requirements
2. Key operations completed and their results (including specific file paths, values, error messages)
3. Important decisions and their rationale
4. Current work status and to-do items
Discard: repeated attempts, verbose tool outputs, small talk.
Format: use structured lists, ensure key details are not lost.
"""},
{"role": "user", "content": format_messages(old_msgs)}
])
# Replace original history with summary
compacted = system_msgs + [
{"role": "system", "content": f"[Conversation History Summary]\n{summary}"}
] + recent_msgs
return compacted
Pattern 3: Dynamic Tool Context
Agents often register many tools, but only use a few for each task. Dynamic tool loading intelligently selects which tool definitions to expose to the model based on the current task:
class DynamicToolContext:
"""
Dynamic tool context management.
Instead of stuffing all 50 tool definitions into the context,
only expose the 5-10 most relevant ones based on the current task.
This is also Anthropic's recommended pattern:
"If a human engineer can't determine when to use which tool, neither can AI"
→ So reduce the number of tools to make choices clearer
"""
def __init__(self, all_tools, embedding_model):
self.all_tools = all_tools
self.embedding_model = embedding_model
# Pre-compute embeddings for all tool descriptions
self.tool_embeddings = {
tool.name: embedding_model.embed(tool.description)
for tool in all_tools
}
def select_tools(self, user_message, task_context, top_k=8):
"""Select the most relevant tools based on current context"""
query = f"{task_context}\n{user_message}"
query_embedding = self.embedding_model.embed(query)
# Sort by semantic similarity
scores = {
name: cosine_similarity(query_embedding, emb)
for name, emb in self.tool_embeddings.items()
}
# Always include core tools
core_tools = [t for t in self.all_tools if t.is_core]
# Supplement with semantically most relevant tools
sorted_tools = sorted(scores.items(), key=lambda x: -x[1])
selected_names = {t.name for t in core_tools}
for name, score in sorted_tools:
if len(selected_names) >= top_k:
break
if score > 0.3 and name not in selected_names:
selected_names.add(name)
return [t for t in self.all_tools if t.name in selected_names]
Frontier Research Directions
1. Retrieval-Augmented Context
Combining RAG (Chapter 7) with context engineering — not stuffing all information into context, but establishing an "on-demand retrieval" mechanism. This is consistent with Anthropic's JIT strategy:
# Traditional approach: put all potentially relevant documents into context
messages = [
{"role": "system", "content": system_prompt},
{"role": "system", "content": f"Reference documents:\n{all_documents}"}, # possibly 50K tokens
{"role": "user", "content": user_query},
]
# Retrieval-augmented approach: only retrieve when needed (JIT thinking)
messages = [
{"role": "system", "content": system_prompt},
{"role": "system", "content": "You have a search_knowledge tool; actively retrieve when you need information"},
{"role": "user", "content": user_query},
]
# Model will actively call search_knowledge → only retrieve the truly needed ~2K tokens
2. Structured Context Protocol
More and more research is exploring the use of structured formats (XML, JSON Schema) to organize context, helping models better "understand" the structure of context. Anthropic recommends using XML tags to delineate different semantic regions in its guidelines:
<!-- Structured context example (Anthropic recommended pattern) -->
<context>
<system priority="critical">
<role>You are a code review assistant</role>
<constraints>
<constraint>Only review security and performance issues</constraint>
<constraint>Output format must be a standardized review report</constraint>
</constraints>
</system>
<task priority="high">
<objective>Review code changes in PR #1234</objective>
<files changed="3" additions="45" deletions="12" />
</task>
<reference priority="medium">
<code_diff>...</code_diff>
<project_conventions>...</project_conventions>
</reference>
<history priority="low" compacted="true">
<summary>User previously requested focus on SQL injection risks...</summary>
</history>
</context>
3. Multi-Agent Context Sharing
In multi-Agent systems (Chapter 14), cross-Agent context transfer and sharing is an active research direction. The core challenge is: how can multiple Agents collaborate efficiently without each needing to carry the complete context?
class SharedContextStore:
"""
Multi-Agent shared context store.
- Each Agent has private context
- Share public information through a Blackboard
- Avoid each Agent carrying the complete context
Reference: Anthropic's sub-agent architecture:
Main agent holds high-level plan, sub-agents only receive context
needed to complete their current subtask
"""
def __init__(self):
self.blackboard = {} # public blackboard: visible to all Agents
self.private = {} # private context: only visible to current Agent
def publish(self, agent_id, key, value, visibility="public"):
"""Agent publishes information to shared context"""
if visibility == "public":
self.blackboard[key] = {
"value": value,
"author": agent_id,
"timestamp": time.time()
}
else:
self.private.setdefault(agent_id, {})[key] = value
def get_context_for(self, agent_id, task):
"""Build context for a specific Agent"""
# Public information + this Agent's private information + task-related information
relevant_public = self.select_relevant(self.blackboard, task)
private = self.private.get(agent_id, {})
return {**relevant_public, **private}
4. Automated Evaluation of Context Engineering
As the importance of context engineering continues to grow, how to evaluate context quality has become a new research direction:
class ContextQualityMetrics:
"""
Context quality evaluation metrics.
As context engineering becomes an independent discipline,
the evaluation system is also rapidly developing.
"""
metrics = {
"Signal-to-Noise Ratio (SNR)": "Effective information tokens / total tokens",
"Recall Completeness": "Proportion of key information retained (after vs. before compression)",
"Attention Utilization": "Proportion of tokens the model actually attends to (via attention heatmap analysis)",
"Redundancy": "Proportion of duplicate or near-duplicate information",
"Freshness": "Distribution of information recency in context",
"Task Alignment": "Semantic relevance of context information to the current task",
}
def evaluate(self, context, task, model_attention_map=None):
"""Comprehensive evaluation of context quality"""
scores = {}
scores["snr"] = self.calc_signal_noise_ratio(context, task)
scores["redundancy"] = self.calc_redundancy(context)
scores["freshness"] = self.calc_freshness(context)
if model_attention_map:
scores["attention_utilization"] = self.calc_attention_util(
context, model_attention_map
)
return scores
Section Summary
| Advancement Direction | Core Breakthrough | Practical Impact on Agent Development |
|---|---|---|
| Million-token context windows | Mainstream models reach 1M tokens by 2026 | Entire book/codebase-level input becomes possible, but quality management becomes more critical |
| Anthropic methodology | JIT context, structured prompts, tool design principles | Industry's first systematic production-grade context engineering guide |
| ACE self-evolution framework | Agents automatically optimize context through execution feedback | Self-improvement without fine-tuning; small model + good context ≈ large model |
| Prompt Caching | Caching and reusing repeated context | Multi-round conversation Agent costs reduced by 40%–70% |
| New KV-Cache techniques | ChunkKV/RocketKV/MLA etc. | Longer context + lower latency + lower memory consumption |
| Tiered context architecture | Priority-based hierarchical management | Standard pattern for production-grade Agents |
| Context compaction | Tool result clearing + structured summaries | Long-horizon tasks no longer limited by window size |
| Dynamic tool context | Load tool definitions on demand | Agents with many tools can save significant context space |
⏰ Note: Context management technology evolves rapidly. The data in this section is current as of March 2026. It is recommended to follow the Anthropic Engineering Blog, LangChain Blog, and API update logs from various model vendors for the latest information.
References
[1] KARPATHY A. Context engineering[EB/OL]. X/Twitter, 2025-06.
[2] ANTHROPIC APPLIED AI TEAM. Effective context engineering for AI agents[EB/OL]. Anthropic Engineering Blog, 2025-09-29.
[3] LANGCHAIN TEAM. Context engineering for agents[EB/OL]. LangChain Blog, 2025-07-02.
[4] ZHANG Q, HU C, UPASANI S, et al. Agentic context engineering: evolving contexts for self-improving language models[C]//ICLR, 2026. arXiv:2510.04618.
[5] LI X, et al. RocketKV: accelerating long-context LLM inference via two-stage KV cache compression[J]. arXiv preprint, 2025.
[6] ChunkKV: semantic-preserving KV cache compression for efficient long-context LLM inference[C]//NeurIPS, 2025. arXiv:2502.00299.
[7] SCOPE: optimizing key-value cache compression in long-context generation[C]//ACL, 2025.
Next Chapter: Chapter 9 Skill System