Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Short-Term Memory: Conversation History Management

Short-term memory is the most fundamental form of memory — maintaining conversation history so the Agent knows "what we just talked about."

Why does this matter? LLMs are inherently stateless — each call is a completely fresh request, and the model does not "remember" the content of previous conversations. If the user says "My name is Alex" in the first turn and asks "What's my name?" in the second turn, the model will be completely lost if the previous conversation history is not passed back. Short-term memory solves exactly this problem: by attaching previous conversation records to each request, the model "appears" to have memory.

But this creates a new problem: the LLM's context window is limited (GPT-4o has approximately 128K tokens). As conversations grow longer, token consumption increases sharply, which is not only costly but may also exceed the model's processing limit. Therefore, we need conversation history management strategies to balance "memory completeness" and "resource efficiency."

This section introduces three strategies: full retention, sliding window, and summary compression, each suited to different conversation length scenarios.

Comparison of three short-term memory strategies

Basic Conversation History Management

Let's start with the simplest approach: retaining the complete conversation history. This ConversationHistory class encapsulates message addition, token counting, and API calls. Note that it uses tiktoken to precisely calculate the token count of each message — this is critical for the window truncation strategy later.

from openai import OpenAI
from dataclasses import dataclass, field
from typing import Optional
import tiktoken

client = OpenAI()

@dataclass
class Message:
    role: str  # "user", "assistant", "system", "tool"
    content: str
    token_count: int = 0

class ConversationHistory:
    """Conversation history manager"""
    
    def __init__(self, system_prompt: str = None, model: str = "gpt-4o"):
        self.model = model
        self.encoding = tiktoken.encoding_for_model(model)
        self.messages: list[Message] = []
        
        if system_prompt:
            self.add_message("system", system_prompt)
    
    def count_tokens(self, text: str) -> int:
        """Count the number of tokens in a text"""
        return len(self.encoding.encode(text))
    
    def add_message(self, role: str, content: str):
        """Add a message"""
        token_count = self.count_tokens(content)
        msg = Message(role=role, content=content, token_count=token_count)
        self.messages.append(msg)
        return msg
    
    def total_tokens(self) -> int:
        """Calculate total token count"""
        return sum(m.token_count for m in self.messages)
    
    def to_api_format(self) -> list[dict]:
        """Convert to OpenAI API format"""
        return [{"role": m.role, "content": m.content} for m in self.messages]
    
    def chat(self, user_message: str) -> str:
        """Conduct one round of conversation"""
        self.add_message("user", user_message)
        
        response = client.chat.completions.create(
            model=self.model,
            messages=self.to_api_format()
        )
        
        reply = response.choices[0].message.content
        self.add_message("assistant", reply)
        
        return reply
    
    def get_stats(self) -> dict:
        """Get history statistics"""
        return {
            "message_count": len(self.messages),
            "total_tokens": self.total_tokens(),
            "user_messages": sum(1 for m in self.messages if m.role == "user"),
            "assistant_messages": sum(1 for m in self.messages if m.role == "assistant"),
        }

Sliding Window: Controlling History Length

The problem with full retention is obvious: as the number of conversation turns increases, token consumption grows linearly. When a conversation exceeds 50 turns, the conversation history alone may occupy tens of thousands of tokens.

Sliding window is the most intuitive solution — only keep the most recent N turns of conversation and discard earlier records. This is analogous to the capacity limit of human short-term memory (the famous "7±2" rule in psychology).

The implementation below has two truncation dimensions: max_turns (maximum number of turns) and max_tokens (maximum token count). This dual safeguard is important — sometimes a single turn contains very long code, and turn count alone cannot effectively control token consumption. The _get_window_messages method first truncates by turn count, then checks whether the token count exceeds the limit, and if so, further removes the earliest turns.

Note that system_prompt is always retained and does not participate in truncation — this is because the system prompt defines the Agent's role and behavior, and losing it would cause the Agent to "lose its identity."

class SlidingWindowMemory:
    """Sliding window conversation manager: only keeps the most recent N turns"""
    
    def __init__(
        self,
        system_prompt: str = None,
        max_turns: int = 10,       # Maximum turns to retain
        max_tokens: int = 8000,    # Maximum token count
        model: str = "gpt-4o-mini"
    ):
        self.model = model
        self.max_turns = max_turns
        self.max_tokens = max_tokens
        self.system_prompt = system_prompt
        self.all_messages: list[dict] = []  # Complete history (not sent to LLM)
        
        self.encoding = tiktoken.encoding_for_model(model)
    
    def _count_tokens(self, messages: list[dict]) -> int:
        """Calculate total token count for a list of messages"""
        total = 0
        for msg in messages:
            total += len(self.encoding.encode(msg.get("content", "")))
        return total
    
    def _get_window_messages(self) -> list[dict]:
        """Get messages within the sliding window"""
        # System prompt is always retained
        result = []
        if self.system_prompt:
            result.append({"role": "system", "content": self.system_prompt})
        
        # Only take the most recent max_turns * 2 messages (each turn = user + assistant)
        recent = self.all_messages[-(self.max_turns * 2):]
        
        # If token count exceeds limit, truncate further
        while recent and self._count_tokens(result + recent) > self.max_tokens:
            recent = recent[2:]  # Remove the earliest turn (2 messages) each time
        
        return result + recent
    
    def chat(self, user_message: str) -> str:
        """Chat using the sliding window"""
        self.all_messages.append({"role": "user", "content": user_message})
        
        # Use messages within the window
        window_messages = self._get_window_messages()
        
        response = client.chat.completions.create(
            model=self.model,
            messages=window_messages
        )
        
        reply = response.choices[0].message.content
        self.all_messages.append({"role": "assistant", "content": reply})
        
        # Display window info
        print(f"[Memory] Total history: {len(self.all_messages)} messages | "
              f"Window in use: {len(window_messages)} messages | "
              f"Window tokens: {self._count_tokens(window_messages)}")
        
        return reply

# Test
memory = SlidingWindowMemory(
    system_prompt="You are a programming assistant",
    max_turns=5
)

for i in range(8):  # Simulate a long conversation
    reply = memory.chat(f"This is question {i+1}, please answer in one sentence: what is Python?")
    print(f"Q{i+1}: {reply[:50]}...")

Summary Compression: Intelligently Compressing History

The downside of the sliding window is its "all-or-nothing" nature — conversations outside the window are discarded entirely, regardless of whether they contain important information. If the user introduced their background and requirements at the beginning of the conversation, the Agent completely forgets it once the window slides past.

Summary compression is a smarter approach: when the conversation history exceeds a threshold, use an LLM to compress the old conversation into a concise summary, then only retain "summary + recent messages." This controls the total token count while preserving key information from early conversations (such as user preferences, project background, and decisions already made).

The cost of this approach is: each compression requires an additional LLM call (you can use a lightweight model like gpt-4o-mini to reduce costs), and some details are inevitably lost during compression. In practice, summary compression is especially suitable for scenarios where conversations are very long but information density is uneven — for example, technical support conversations where the first half may involve extensive troubleshooting attempts, and the truly valuable parts are the conclusions and decisions.

class SummaryMemory:
    """
    Summary memory manager
    Automatically compresses old conversations into summaries when history exceeds the threshold
    """
    
    def __init__(
        self,
        system_prompt: str = None,
        max_tokens_before_summary: int = 3000,
        model: str = "gpt-4o-mini",
        summary_model: str = "gpt-4o-mini"
    ):
        self.model = model
        self.summary_model = summary_model
        self.system_prompt = system_prompt
        self.max_tokens = max_tokens_before_summary
        
        self.summary: str = ""  # Compressed history summary
        self.recent_messages: list[dict] = []  # Recent messages (not compressed)
        
        self.encoding = tiktoken.encoding_for_model(model)
    
    def _count_tokens(self, text: str) -> int:
        return len(self.encoding.encode(text))
    
    def _should_summarize(self) -> bool:
        """Determine whether compression is needed"""
        total = sum(self._count_tokens(m["content"]) 
                   for m in self.recent_messages)
        return total > self.max_tokens
    
    def _create_summary(self) -> str:
        """Compress the current conversation history into a summary"""
        if not self.recent_messages:
            return self.summary
        
        # Build the compression prompt
        conversation_text = "\n".join([
            f"{m['role'].upper()}: {m['content']}"
            for m in self.recent_messages
        ])
        
        existing_summary = f"Existing summary:\n{self.summary}\n\n" if self.summary else ""
        
        summary_prompt = f"""{existing_summary}Please compress the following conversation into a concise summary, retaining key information, decisions, and user preferences:

{conversation_text}

Requirements:
1. Retain the user's key needs and preferences
2. Retain important decisions and conclusions
3. Ignore small talk and repetitive content
4. Use third-person description, concise and objective
5. No more than 200 words"""
        
        response = client.chat.completions.create(
            model=self.summary_model,
            messages=[{"role": "user", "content": summary_prompt}],
            max_tokens=400
        )
        
        return response.choices[0].message.content
    
    def _build_messages(self) -> list[dict]:
        """Build the message list to send to the LLM"""
        messages = []
        
        # System prompt
        if self.system_prompt:
            messages.append({"role": "system", "content": self.system_prompt})
        
        # History summary (if any)
        if self.summary:
            messages.append({
                "role": "system",
                "content": f"[Conversation History Summary]\n{self.summary}"
            })
        
        # Recent messages
        messages.extend(self.recent_messages)
        
        return messages
    
    def chat(self, user_message: str) -> str:
        """Chat with automatic summary compression"""
        self.recent_messages.append({"role": "user", "content": user_message})
        
        # Check if compression is needed
        if self._should_summarize():
            print("[Memory] Conversation history too long, compressing...")
            self.summary = self._create_summary()
            # Clear recent messages, keep only the last user message
            self.recent_messages = [self.recent_messages[-1]]
            print(f"[Memory] Compression complete. Summary: {self.summary[:100]}...")
        
        # Call the LLM
        response = client.chat.completions.create(
            model=self.model,
            messages=self._build_messages()
        )
        
        reply = response.choices[0].message.content
        self.recent_messages.append({"role": "assistant", "content": reply})
        
        return reply

# Test summary memory
summary_mem = SummaryMemory(
    system_prompt="You are a Python programming assistant",
    max_tokens_before_summary=500  # Set small for testing
)

conversations = [
    "My name is Alex, I'm a backend developer working mainly with Python and Go",
    "I'm building a microservices project using FastAPI and SQLAlchemy",
    "I prefer a clean coding style and don't like over-abstraction",
    "Help me write a user authentication endpoint in FastAPI",
    "The endpoint needs to support JWT tokens",
]

for msg in conversations:
    print(f"\nUser: {msg}")
    reply = summary_mem.chat(msg)
    print(f"Assistant: {reply[:100]}...")

Summary

Three strategies for short-term memory management:

StrategyApplicable ScenarioProsCons
Full historyShort conversationsComplete informationHigh token consumption
Sliding windowGeneral conversationsSimple and efficientMay lose early information
Summary compressionLong conversationsRetains important informationRequires additional LLM call

Next section: 5.3 Long-Term Memory: Vector Databases and Retrieval