Benchmarks and Evaluation Metrics
Section Goal: Gain a deep understanding of mainstream Agent benchmarks and their evaluation principles; master the underlying algorithms of BFCL, GAIA, AgentBench, WebArena, SWE-bench, and other benchmarks; and learn how to design your own evaluation system.
Why Do We Need Benchmarks?
Imagine you're interviewing two candidates — if you don't give them the same questions, how can you compare who is better? Benchmarks are the "standardized exam questions" for Agents — they use unified tasks, data, and scoring criteria to measure the performance of different Agents.
But different benchmarks evaluate completely different capability dimensions. The "evaluation panorama" below helps you quickly orient yourself:
Agent Evaluation Benchmark Categories
Agent evaluation benchmarks can be categorized by capability dimension as follows [1]:
| Category | Representative Benchmarks | Capabilities Evaluated |
|---|---|---|
| Tool calling | BFCL, ToolBench, API-Bank | Correctly calling APIs, handling parameters, combining tools |
| General reasoning | GAIA, MMLU, GSM8K | Multi-step reasoning, knowledge breadth, math ability |
| Comprehensive Agent | AgentBench | End-to-end task completion across 8 domains |
| Web operations | WebArena, Mind2Web | Completing specified tasks on real websites |
| Software engineering | SWE-bench, HumanEval | Code generation, bug fixing, project-level changes |
| Multimodal | VisualWebArena, OSWorld | Visual understanding + action execution |
1. BFCL — Tool Calling Evaluation Benchmark
Overview
BFCL (Berkeley Function Calling Leaderboard) [2] is a tool calling evaluation benchmark released by UC Berkeley that systematically evaluates LLMs' ability to call functions/APIs. It is one of the most authoritative tool calling benchmarks available.
Four Test Scenarios
BFCL divides tool calling into four progressively difficult levels:
| Type | Description | Example |
|---|---|---|
| Simple | Single function, complete parameters | get_weather(city="Beijing") |
| Multiple | Select the correct function from multiple candidates | Given 10 functions, select and correctly call 1 |
| Parallel | Call multiple functions at once | Query weather and book a flight simultaneously |
| Irrelevance | Identify irrelevant requests and refuse to call | Should not call any tool when the user says "Hello" |
AST Matching Algorithm: Why String Matching Isn't Enough
BFCL's core innovation is using AST (Abstract Syntax Tree) matching rather than simple string matching to evaluate the correctness of tool calls.
The problem with string matching:
# These two calls are semantically identical, but string matching would judge them as "unequal"
ground_truth = 'get_weather(city="Beijing", unit="celsius")'
prediction = 'get_weather(unit="celsius", city="Beijing")'
# String match: 'city="Beijing", unit="celsius"' ≠ 'unit="celsius", city="Beijing"'
# Result: ❌ Incorrectly judged as wrong!
How AST matching works:
import ast
from typing import Any
def ast_match(prediction: str, ground_truth: str) -> bool:
"""
BFCL's AST matching algorithm (simplified)
Core idea: Parse function calls into AST nodes,
compare function names and argument sets (ignoring argument order)
"""
try:
# Parse into AST
pred_ast = ast.parse(prediction, mode='eval').body
true_ast = ast.parse(ground_truth, mode='eval').body
# Check that both are function calls
if not (isinstance(pred_ast, ast.Call) and isinstance(true_ast, ast.Call)):
return False
# Compare function names
pred_func = ast.dump(pred_ast.func)
true_func = ast.dump(true_ast.func)
if pred_func != true_func:
return False
# Compare keyword arguments (ignoring order)
pred_kwargs = {
kw.arg: ast.literal_eval(kw.value)
for kw in pred_ast.keywords
}
true_kwargs = {
kw.arg: ast.literal_eval(kw.value)
for kw in true_ast.keywords
}
if pred_kwargs != true_kwargs:
return False
# Compare positional arguments
pred_args = [ast.literal_eval(a) for a in pred_ast.args]
true_args = [ast.literal_eval(a) for a in true_ast.args]
return pred_args == true_args
except (SyntaxError, ValueError):
return False
# Test
print(ast_match(
'get_weather(city="Beijing", unit="celsius")',
'get_weather(unit="celsius", city="Beijing")'
))
# ✅ True — AST matching correctly identifies argument order independence
print(ast_match(
'get_weather(city="Beijing")',
'get_weather(city="Shanghai")'
))
# ❌ False — argument values differ
Type-Aware Matching
BFCL also handles type equivalence:
def type_aware_match(pred_value: Any, true_value: Any) -> bool:
"""
Type-aware argument value matching
Handles common type equivalence scenarios:
- Integer 1 and float 1.0
- String "true" and boolean True
- Single-element list ["a"] and string "a"
"""
# Direct equality
if pred_value == true_value:
return True
# Numeric type equivalence: 1 == 1.0
if isinstance(pred_value, (int, float)) and isinstance(true_value, (int, float)):
return abs(float(pred_value) - float(true_value)) < 1e-6
# String vs boolean: "true" == True
if isinstance(pred_value, str) and isinstance(true_value, bool):
return pred_value.lower() == str(true_value).lower()
# List vs set: ignore order
if isinstance(pred_value, list) and isinstance(true_value, list):
return sorted(str(x) for x in pred_value) == sorted(str(x) for x in true_value)
return False
2. GAIA — General AI Assistant Evaluation
Overview
GAIA (General AI Assistant Benchmark) [3] was released by Meta to evaluate AI assistants' ability to handle real-world tasks. Its unique characteristics are:
- Easy for humans, hard for AI: Questions are designed so humans can easily answer them (through search and reasoning), but AI needs to combine multiple capabilities to complete them
- Short, definitive answers: Each question has a short standard answer (usually a word or number), avoiding the subjectivity of open-ended evaluation
- Three-level difficulty system: From simple to complex, comprehensively evaluating different capability levels
Three Difficulty Levels
| Level | Required Capabilities | Example |
|---|---|---|
| Level 1 | 1–2 steps, basic reasoning | "What is the capital of France?" |
| Level 2 | 3–5 steps, requires tools | "What is the population of the birthplace of the 2024 Nobel Physics Prize winner?" |
| Level 3 | 5+ steps, multiple tools + reasoning | "Find the data in row 2 of the table on page 3 of a PDF, and calculate its ratio to CPI" |
Quasi-Exact Match Algorithm
GAIA's evaluation uses a quasi-exact match algorithm — more lenient than strict string matching, but maintaining objectivity:
import re
import unicodedata
def quasi_exact_match(prediction: str, ground_truth: str) -> bool:
"""
GAIA's quasi-exact match algorithm
Core idea: Normalize both the prediction and the ground truth before
exact comparison, tolerating meaningless differences like case,
punctuation, and whitespace
"""
def normalize(text: str) -> str:
"""Normalization"""
# Lowercase
text = text.lower().strip()
# Unicode normalization (handles full-width/half-width, accents, etc.)
text = unicodedata.normalize("NFKD", text)
# Remove punctuation (but keep decimal points and minus signs)
text = re.sub(r'[^\w\s\.\-]', '', text)
# Collapse extra whitespace
text = re.sub(r'\s+', ' ', text)
# Remove articles (English)
text = re.sub(r'\b(a|an|the)\b', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
def normalize_number(text: str) -> str | None:
"""Try to parse text into a standard numeric format"""
try:
# Remove thousands separators
cleaned = text.replace(",", "").replace(" ", "")
num = float(cleaned)
# If integer, return integer format
if num == int(num):
return str(int(num))
return f"{num:.6f}".rstrip('0').rstrip('.')
except ValueError:
return None
# Normalized comparison
norm_pred = normalize(prediction)
norm_truth = normalize(ground_truth)
if norm_pred == norm_truth:
return True
# Numeric comparison
num_pred = normalize_number(prediction)
num_truth = normalize_number(ground_truth)
if num_pred is not None and num_truth is not None:
return num_pred == num_truth
# Containment (answer may be within a longer response)
if norm_truth in norm_pred:
# Ensure it's a complete match, not a substring
pattern = r'\b' + re.escape(norm_truth) + r'\b'
if re.search(pattern, norm_pred):
return True
return False
# Tests
print(quasi_exact_match("Paris", "paris")) # True: case
print(quasi_exact_match("42,000", "42000")) # True: thousands separator
print(quasi_exact_match("The answer is Paris.", "Paris")) # True: containment
print(quasi_exact_match("3.14", "3.14000")) # True: decimal precision
print(quasi_exact_match("Beijing", "Shanghai")) # False: different answers
GAIA's Multimodal Data Handling
A unique challenge in GAIA is that tasks may include attachments — PDF files, Excel spreadsheets, images, etc. Agents need to:
- Identify the attachment type
- Use the appropriate tool to read the content
- Extract key information from the content
- Combine the information to complete the reasoning
3. AgentBench — Comprehensive Agent Evaluation
Overview
AgentBench [4] is a comprehensive Agent benchmark released by Tsinghua University, covering tasks across 8 different domains to comprehensively evaluate Agent performance in various environments:
| Domain | Task Type | Evaluation Focus |
|---|---|---|
| OS | Operating system command execution | File operations, process management |
| DB | Database queries | SQL generation, data analysis |
| KG | Knowledge graph reasoning | Graph traversal, relational reasoning |
| DCG | Digital card game | Strategic decision-making, state tracking |
| LTP | Lateral thinking puzzles | Creative reasoning |
| HouseHold | Home environment operations | Spatial reasoning, object interaction |
| WebShop | Online shopping | Search, filtering, decision-making |
| WebBrowse | Web browsing | Information extraction, navigation |
AgentBench Evaluation Framework
class AgentBenchEvaluator:
"""
AgentBench evaluation architecture (conceptual implementation)
Key features:
1. Each domain has its own environment and evaluator
2. Agents interact with environments through a unified text interface
3. The evaluation metric is task Success Rate
"""
def __init__(self, agent):
self.agent = agent
self.environments = {
"os": OSEnvironment(),
"db": DatabaseEnvironment(),
"web_shop": WebShopEnvironment(),
# ... other environments
}
def evaluate_task(self, env_name: str, task: dict) -> dict:
"""
Evaluate a single task
Process:
1. Initialize the environment and provide the task description
2. Agent interacts with the environment (up to N steps)
3. Check whether the final state satisfies the success conditions
"""
env = self.environments[env_name]
observation = env.reset(task)
for step in range(task.get("max_steps", 20)):
# Agent generates an action based on the observation
action = self.agent.act(observation)
# Environment executes the action
observation, reward, done, info = env.step(action)
if done:
break
return {
"success": info.get("success", False),
"steps": step + 1,
"reward": reward,
}
def evaluate_all(self, benchmark_data: dict) -> dict:
"""Evaluate all domains"""
results = {}
for env_name, tasks in benchmark_data.items():
env_results = [
self.evaluate_task(env_name, task)
for task in tasks
]
results[env_name] = {
"success_rate": sum(r["success"] for r in env_results) / len(env_results),
"avg_steps": sum(r["steps"] for r in env_results) / len(env_results),
}
return results
Top Model Performance by Domain (as of 2025)
| Environment | GPT-4o | Claude-3.5 | Open-source SOTA |
|---|---|---|---|
| OS | ~45% | ~42% | ~30% (CodeLlama) |
| DB | ~52% | ~48% | ~35% |
| WebShop | ~60% | ~55% | ~40% |
| Overall | ~42% | ~38% | ~28% |
Key insight: Even the strongest closed-source models achieve only about 40% overall on AgentBench. This shows there is still enormous room for improvement in current LLMs' Agent capabilities.
4. WebArena — Web Operation Evaluation
Overview
WebArena [5] is a Web Agent evaluation benchmark released by CMU that tests Agents' ability to complete tasks in real Web application environments. It deploys 4 complete Web applications:
- Reddit (forum)
- GitLab (code hosting)
- Shopping (e-commerce website)
- CMS (content management system)
Task Example
Task: "On GitLab, create a new repository named 'ml-pipeline',
add a .gitignore file using the Python template,
then create an Issue named 'setup-ci'."
Evaluation criteria:
1. Does the repository 'ml-pipeline' exist? ✅/❌
2. Does the .gitignore contain Python template content? ✅/❌
3. Has the Issue 'setup-ci' been created? ✅/❌
Final score: Success only if all conditions are met
Evaluation Method
WebArena uses state-based evaluation — checking whether the state of the Web application after the operation matches expectations:
class WebArenaEvaluator:
"""WebArena evaluation logic (conceptual implementation)"""
def evaluate(self, task: dict, final_state: dict) -> bool:
"""
Check whether the final state of the Web application meets the task requirements
Evaluation types:
1. URL match: Is the final page correct?
2. Element existence: Does the page contain specific elements?
3. Database state: Is the backend data correct?
"""
for condition in task["success_conditions"]:
if condition["type"] == "url_match":
if not self._check_url(final_state["url"], condition["pattern"]):
return False
elif condition["type"] == "element_exists":
if not self._find_element(
final_state["page_html"],
condition["selector"]
):
return False
elif condition["type"] == "db_check":
if not self._query_db(
condition["query"],
condition["expected"]
):
return False
return True # All conditions met
5. SWE-bench — Software Engineering Evaluation
Overview
SWE-bench [6] is a software engineering evaluation benchmark released by Princeton that tests Agents' ability to resolve real GitHub Issues. Each test case comes from a real open-source project (such as Django, Flask, scikit-learn) and includes an Issue description and corresponding test cases.
Task Flow
1. Receive Issue description:
"Django QuerySet.union() returns duplicate results when using values()"
2. Understand the codebase:
Analyze Django's QuerySet implementation (thousands of files)
3. Locate the problem:
Find the union logic in django/db/models/sql/query.py
4. Generate a fix:
Generate a git patch file
5. Validate:
Run the project's unit tests and check if they pass
SWE-bench Variants
| Variant | Test Count | Description |
|---|---|---|
| SWE-bench Full | 2,294 Issues | Complete collection |
| SWE-bench Lite | 300 Issues | Curated subset, more reproducible |
| SWE-bench Verified | 500 Issues | High-quality subset verified by humans |
Evaluation Metric
SWE-bench's core metric is the Resolved Rate:
def swe_bench_evaluate(
patch: str, # git patch generated by the Agent
test_suite: str, # original test cases
repo_path: str, # code repository path
) -> dict:
"""
SWE-bench evaluation process (simplified)
1. Apply the Agent-generated patch
2. Run the project's test suite
3. Check whether previously failing tests now pass
"""
import subprocess
# Check if patch can be applied
apply_result = subprocess.run(
["git", "apply", "--check", "-"],
input=patch.encode(),
cwd=repo_path,
capture_output=True,
)
if apply_result.returncode != 0:
return {"resolved": False, "reason": "Patch cannot be applied"}
# Actually apply the patch
subprocess.run(
["git", "apply", "-"],
input=patch.encode(),
cwd=repo_path,
)
# Run tests
test_result = subprocess.run(
["python", "-m", "pytest", test_suite, "-x"],
cwd=repo_path,
capture_output=True,
timeout=300, # 5-minute timeout
)
return {
"resolved": test_result.returncode == 0,
"test_output": test_result.stdout.decode(),
}
Current SOTA (as of 2025)
| Agent | SWE-bench Verified | Method |
|---|---|---|
| DeepSWE | 59.0% | GRPO reinforcement learning training |
| Amazon Q Developer | 55.2% | Closed-source commercial product |
| Claude-3.5 Sonnet | ~49% | Direct inference |
| SWE-Agent | ~33% | Open-source framework |
6. LLM-as-Judge Evaluation Method
For open-ended tasks without standard answers, LLM-as-Judge [7] is the most commonly used evaluation method — using a powerful LLM to judge the output of another LLM.
Three Judgment Modes
class LLMJudge:
"""LLM-as-Judge evaluator"""
def __init__(self, judge_model: str = "gpt-4o"):
from langchain_openai import ChatOpenAI
self.judge = ChatOpenAI(model=judge_model, temperature=0)
def pointwise_scoring(
self,
question: str,
answer: str,
rubric: str,
) -> dict:
"""
Mode 1: Pointwise scoring
Evaluates the absolute quality of a single answer
"""
prompt = f"""Please evaluate the answer quality according to the following rubric.
Question: {question}
Answer: {answer}
Rubric:
{rubric}
Please score from 1–10 and explain your reasoning.
Output JSON: {{"score": <score>, "reasoning": "<reasoning>"}}"""
response = self.judge.invoke(prompt)
return json.loads(response.content)
def pairwise_comparison(
self,
question: str,
answer_a: str,
answer_b: str,
) -> dict:
"""
Mode 2: Pairwise comparison
Determines which of two answers is better
Used to calculate Win Rate and ELO scores
"""
prompt = f"""Please compare the following two answers and determine which is better.
Question: {question}
Answer A:
{answer_a}
Answer B:
{answer_b}
Please choose: A is better / B is better / Tie
And explain your reasoning.
Output JSON: {{"winner": "A"/"B"/"tie", "reasoning": "<reasoning>"}}"""
response = self.judge.invoke(prompt)
return json.loads(response.content)
def reference_based(
self,
question: str,
answer: str,
reference: str,
) -> dict:
"""
Mode 3: Reference-based comparison
Compare with a standard answer
"""
prompt = f"""Please evaluate the consistency between the answer and the reference answer.
Question: {question}
Reference answer: {reference}
Answer to evaluate: {answer}
Scoring criteria:
- 1.0: Completely consistent with the reference answer
- 0.8: Mostly consistent, with minor differences
- 0.5: Partially correct
- 0.2: Mostly incorrect
- 0.0: Completely incorrect
Output JSON: {{"score": <score>, "reasoning": "<reasoning>"}}"""
response = self.judge.invoke(prompt)
return json.loads(response.content)
Win Rate Calculation
def compute_win_rate(
judge: LLMJudge,
questions: list[str],
answers_a: list[str], # Agent A's answers
answers_b: list[str], # Agent B's answers
) -> dict:
"""Calculate the Win Rate of two Agents"""
wins_a, wins_b, ties = 0, 0, 0
for q, a, b in zip(questions, answers_a, answers_b):
# Forward evaluation
result1 = judge.pairwise_comparison(q, a, b)
# Reverse evaluation (swap positions to eliminate position bias)
result2 = judge.pairwise_comparison(q, b, a)
# Combine both evaluations
if result1["winner"] == "A" and result2["winner"] == "B":
wins_a += 1 # Both agree A is better
elif result1["winner"] == "B" and result2["winner"] == "A":
wins_b += 1 # Both agree B is better
else:
ties += 1 # Inconsistent results, count as tie
total = len(questions)
return {
"agent_a_win_rate": wins_a / total,
"agent_b_win_rate": wins_b / total,
"tie_rate": ties / total,
}
Known Biases in LLM-as-Judge [7]
| Bias Type | Description | Mitigation |
|---|---|---|
| Position bias | Tends to prefer the first (or last) answer | Swap positions and evaluate twice |
| Verbosity bias | Tends to prefer longer answers | Explicitly state "conciseness is not penalized" in the rubric |
| Self-preference | Tends to prefer its own generated answers | Use a different model as the Judge |
| Format bias | Tends to prefer better-formatted answers | Unify formatting before evaluation |
Designing Your Own Evaluation System
Complete Evaluation Framework
import json
import time
from dataclasses import dataclass, field
@dataclass
class AgentMetrics:
"""Agent evaluation metric set"""
# Quality metrics
accuracy: float = 0.0 # Accuracy rate
f1_score: float = 0.0 # F1 score
hallucination_rate: float = 0.0 # Hallucination rate
# Efficiency metrics
avg_latency: float = 0.0 # Average response time (seconds)
avg_steps: float = 0.0 # Average number of execution steps
avg_tokens: float = 0.0 # Average token consumption
avg_cost: float = 0.0 # Average cost (USD)
# Reliability metrics
success_rate: float = 0.0 # Task success rate
error_rate: float = 0.0 # Error rate
timeout_rate: float = 0.0 # Timeout rate
# Safety metrics
safety_violation_rate: float = 0.0 # Safety violation rate
pii_leak_rate: float = 0.0 # Privacy leakage rate
class AgentBenchmarkRunner:
"""Agent benchmark test runner"""
def __init__(self, agent_func, test_cases: list[dict]):
self.agent_func = agent_func
self.test_cases = test_cases
self.results = []
def run(self) -> AgentMetrics:
"""Run all test cases"""
metrics = AgentMetrics()
latencies = []
step_counts = []
token_counts = []
successes = 0
errors = 0
timeouts = 0
correct = 0
for case in self.test_cases:
try:
start = time.time()
result = self.agent_func(
case["input"],
timeout=case.get("timeout", 30)
)
elapsed = time.time() - start
latencies.append(elapsed)
step_counts.append(result.get("steps", 0))
token_counts.append(result.get("tokens", 0))
if self._check_answer(
result.get("answer", ""),
case["expected"]
):
correct += 1
successes += 1
except TimeoutError:
timeouts += 1
except Exception:
errors += 1
self.results.append({
"case": case["input"],
"status": "success" if successes else "error"
})
total = len(self.test_cases)
metrics.accuracy = correct / total if total else 0
metrics.success_rate = successes / total if total else 0
metrics.error_rate = errors / total if total else 0
metrics.timeout_rate = timeouts / total if total else 0
metrics.avg_latency = (
sum(latencies) / len(latencies) if latencies else 0
)
metrics.avg_steps = (
sum(step_counts) / len(step_counts) if step_counts else 0
)
metrics.avg_tokens = (
sum(token_counts) / len(token_counts) if token_counts else 0
)
return metrics
def _check_answer(self, actual: str, expected) -> bool:
"""Check if the answer is correct (supports multiple matching methods)"""
if isinstance(expected, str):
return actual.strip().lower() == expected.strip().lower()
elif isinstance(expected, list):
return any(kw.lower() in actual.lower() for kw in expected)
elif callable(expected):
return expected(actual)
return False
Recommended Evaluation Combination Strategy
| Agent Type | Recommended Benchmarks | Custom Evaluation Focus |
|---|---|---|
| General assistant | GAIA + MMLU | Knowledge accuracy + multi-step reasoning |
| Code Agent | SWE-bench + HumanEval | Test pass rate + code quality |
| Tool-calling Agent | BFCL + ToolBench | AST match accuracy + parameter correctness |
| Web Agent | WebArena | Task completion rate + operation efficiency |
| Customer service Agent | Custom | LLM-as-Judge + manual spot checks |
🏭 Production Practice
- Build your own evaluation set: General benchmarks only reflect a model's general capabilities; production requires building test sets tailored to your own business scenarios (typically 100–500 cases)
- Three-layer evaluation system: Automated rules (fast) → LLM Judge (batch) → manual spot checks (precise), three progressive layers
- Evaluation frequency: Run a full evaluation after every model upgrade or prompt change; integrate automated evaluation into CI/CD
- Watch Metrics: In production, focus on monitoring these three metrics: P95 latency, tool call success rate, and user rating distribution
Regression Testing: Ensuring Improvements Don't Introduce New Problems
class RegressionTracker:
"""Regression test tracker"""
def __init__(self, history_file: str = "eval_history.json"):
self.history_file = history_file
self.history = self._load_history()
def _load_history(self) -> list:
try:
with open(self.history_file) as f:
return json.load(f)
except FileNotFoundError:
return []
def record(self, version: str, metrics: AgentMetrics):
"""Record an evaluation result"""
entry = {
"version": version,
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
"accuracy": metrics.accuracy,
"success_rate": metrics.success_rate,
"avg_latency": metrics.avg_latency,
"avg_tokens": metrics.avg_tokens
}
self.history.append(entry)
with open(self.history_file, "w") as f:
json.dump(self.history, f, indent=2)
def check_regression(
self,
current: AgentMetrics,
threshold: float = 0.05
) -> list[str]:
"""Check if any metric has regressed beyond the threshold"""
if not self.history:
return []
previous = self.history[-1]
warnings = []
if previous["accuracy"] - current.accuracy > threshold:
warnings.append(
f"⚠️ Accuracy dropped: "
f"{previous['accuracy']:.1%} → {current.accuracy:.1%}"
)
if previous["success_rate"] - current.success_rate > threshold:
warnings.append(
f"⚠️ Success rate dropped: "
f"{previous['success_rate']:.1%} → {current.success_rate:.1%}"
)
if (current.avg_latency > previous["avg_latency"] * 1.5
and previous["avg_latency"] > 0):
warnings.append(
f"⚠️ Latency increased: "
f"{previous['avg_latency']:.2f}s → {current.avg_latency:.2f}s"
)
return warnings
Summary
| Benchmark | Core Capability | Evaluation Method | Current SOTA |
|---|---|---|---|
| BFCL | Tool calling | AST matching algorithm | GPT-4o ~90% |
| GAIA | General reasoning | Quasi-exact match | GPT-4o ~75% (L1) |
| AgentBench | Comprehensive Agent | Task success rate | GPT-4o ~42% |
| WebArena | Web operations | State checking | GPT-4o ~35% |
| SWE-bench | Software engineering | Test pass rate | DeepSWE 59% |
Preview of next section: Now that we've mastered evaluation methods, let's learn how to improve Agent performance through prompt tuning.
References
[1] LIU X, YU H, ZHANG H, et al. AgentBench: Evaluating LLMs as agents[C]//ICLR. 2024.
[2] YAN F, MIAO H, ZHONG C, et al. Berkeley function calling leaderboard[EB/OL]. 2024. https://gorilla.cs.berkeley.edu/leaderboard.html.
[3] MIALON G, FOURRIER C, SWIFT C, et al. GAIA: A benchmark for general AI assistants[C]//ICLR. 2024.
[4] LIU X, YU H, ZHANG H, et al. AgentBench: Evaluating LLMs as agents[C]//ICLR. 2024.
[5] ZHOU S, XU F F, ZHU H, et al. WebArena: A realistic web environment for building autonomous agents[C]//ICLR. 2024.
[6] JIMENEZ C E, YANG J, WETTIG A, et al. SWE-bench: Can language models resolve real-world GitHub issues?[C]//ICLR. 2024.
[7] ZHENG L, CHIANG W L, SHENG Y, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena[C]//NeurIPS. 2023.