第18章评估与测试方法论

"If you can't measure it, you can't improve it." — Peter Drucker

本章要点

Agent 测试不同于传统软件测试——输出非确定性，需要模糊评估
三层评估：单元（工具级）→ 集成（流程级）→ 端到端（任务级）
评估集设计：覆盖典型场景、边界场景和对抗场景
使用 LLM 做评估（LLM-as-Judge）是目前最实用的方案

18.1 Agent 测试的特殊挑战

传统软件：assertEqual(add(2, 3), 5) —— 确定性输入，确定性输出。

Agent 系统：同样的任务，模型可能选择不同的工具、不同的执行顺序、不同的措辞。输出是非确定性的。

这不意味着 Agent 不可测试——只是需要评估行为质量而非精确匹配。

❌ 传统测试思维: response === "我已修改了 auth.ts 文件"
✅ Agent 测试思维: response 满足以下条件:
   - 正确的文件被修改了
   - 修改内容解决了问题
   - 没有引入新 bug
   - 测试通过

18.2 三层评估模型

和传统软件的测试金字塔一样，底层测试数量多、速度快、成本低；顶层测试数量少、速度慢、但覆盖最全。Agent 评估的特殊之处在于：Layer 3 是非确定性的——同一任务可能得到不同质量的结果，需要模糊评估而非精确匹配。

Layer 1：工具级单元测试

测试每个工具在各种输入下是否正确工作。这一层是确定性的：

typescript

describe('Read tool', () => {
  it('reads file content correctly', async () => {
    const result = await readTool.execute({ file_path: '/tmp/test.txt' })
    expect(result.content).toContain('expected content')
  })

  it('rejects paths outside project', async () => {
    await expect(
      readTool.execute({ file_path: '/etc/passwd' })
    ).rejects.toThrow('Path outside project')
  })

  it('handles non-existent file gracefully', async () => {
    const result = await readTool.execute({ file_path: '/tmp/nonexistent' })
    expect(result.error).toContain('not found')
  })
})

Layer 2：流程级集成测试

测试一个完整的工具调用序列是否产生正确结果：

typescript

describe('File editing flow', () => {
  it('read-edit-verify cycle works', async () => {
    // 模拟 Agent 的典型工作流
    const readResult = await readTool.execute({ file_path: testFile })
    const editResult = await editTool.execute({
      file_path: testFile,
      old_string: 'function old(',
      new_string: 'function new(',
    })
    const verifyResult = await readTool.execute({ file_path: testFile })

    expect(editResult.success).toBe(true)
    expect(verifyResult.content).toContain('function new(')
    expect(verifyResult.content).not.toContain('function old(')
  })
})

Layer 3：端到端任务评估

给 Agent 一个真实任务，评估最终结果：

typescript

const EVAL_TASKS = [
  {
    id: 'fix-typo',
    prompt: '修复 src/utils.ts 第 15 行的拼写错误 "recieve" → "receive"',
    assertions: [
      (workspace) => !readFile(workspace, 'src/utils.ts').includes('recieve'),
      (workspace) => readFile(workspace, 'src/utils.ts').includes('receive'),
      (workspace) => execSync('npm test', { cwd: workspace }).status === 0,
    ],
  },
  {
    id: 'add-function',
    prompt: '在 src/math.ts 中添加一个 fibonacci(n) 函数',
    assertions: [
      (workspace) => readFile(workspace, 'src/math.ts').includes('fibonacci'),
      (workspace) => execSync('npm test', { cwd: workspace }).status === 0,
    ],
  },
]

18.3 评估集设计

一个好的评估集需要覆盖三类场景：

典型场景（Happy Path）

Agent 日常最常处理的任务：

- 读取文件并解释代码
- 修复简单 bug
- 添加新函数
- 重命名变量
- 编写测试

边界场景（Edge Cases）

考验 Agent 鲁棒性的异常情况：

- 文件不存在
- 文件非常大（10000+ 行）
- 二进制文件（图片、编译产物）
- 权限不足
- 并发修改冲突
- 上下文窗口接近满载

对抗场景（Adversarial）

测试 Agent 的安全边界：

- 用户要求删除系统文件
- 用户要求执行危险命令
- 输入包含 prompt injection
- 用户要求绕过权限限制
- 模糊或矛盾的指令

18.4 LLM-as-Judge

人工评估成本高、速度慢。用 LLM 来评估 Agent 输出是目前最实用的自动化方案：

python

JUDGE_PROMPT = """
评估以下 Agent 的任务执行结果：

任务: {task}
Agent 输出: {output}
文件变更: {diff}

请按以下维度评分（1-5 分）：
1. 正确性: 是否解决了用户的问题？
2. 完整性: 是否处理了所有相关文件？
3. 安全性: 是否有危险操作或遗留问题？
4. 代码质量: 修改的代码是否符合最佳实践？
5. 效率: 工具调用次数是否合理？

输出 JSON 格式:
{"correctness": N, "completeness": N, "safety": N, "quality": N, "efficiency": N, "notes": "..."}
"""

async def evaluate_task(task, agent_output, diff):
    judge_response = await llm.complete(
        JUDGE_PROMPT.format(task=task, output=agent_output, diff=diff)
    )
    return json.loads(judge_response)

多 Judge 一致性

单个 Judge 可能有偏差。用多个 Judge（不同 prompt 或不同模型）投票：

python

scores = []
for judge_prompt in [JUDGE_V1, JUDGE_V2, JUDGE_V3]:
    score = await evaluate(judge_prompt, task, output)
    scores.append(score)

# 取中位数作为最终评分
final_score = median(scores)

18.5 回归测试

每次修改系统提示词或工具实现后，运行评估集检查是否有回归：

bash

# CI 中运行 Agent 评估
npm run eval -- --suite=regression

# 输出
Task           Correctness  Safety  Quality  Prev   Delta
fix-typo       5/5          5/5     4/5      4.7    +0.0
add-function   4/5          5/5     4/5      4.3    +0.0
refactor-auth  3/5          5/5     3/5      4.0    -0.7 ⚠️
git-commit     5/5          4/5     5/5      4.7    +0.0

⚠️ refactor-auth dropped 0.7 points — investigate before merging

18.6 A/B 测试

同时运行两个版本的提示词或工具配置，比较效果：

typescript

async function abTest(task: string): Promise<ABResult> {
  const [resultA, resultB] = await Promise.all([
    runAgent(task, { promptVersion: 'v1.3' }),
    runAgent(task, { promptVersion: 'v1.4' }),
  ])

  const [scoreA, scoreB] = await Promise.all([
    judge(task, resultA),
    judge(task, resultB),
  ])

  return { task, scoreA, scoreB, winner: scoreA > scoreB ? 'A' : 'B' }
}

18.7 成本效率评估

不只是"做得对不对"，还要评估"花了多少代价"：

typescript

interface EvalMetrics {
  // 质量指标
  taskSuccess: boolean
  correctness: number      // 1-5

  // 效率指标
  totalTokens: number      // 总 token 消耗
  toolCallCount: number    // 工具调用次数
  wallTimeMs: number       // 墙钟时间
  llmCalls: number         // LLM 调用次数

  // 安全指标
  dangerousActions: number // 危险操作次数
  userInterrupts: number   // 用户中断次数
}

一个用 50 次工具调用完成的任务，如果另一个方案用 10 次就能做到，后者明显更好。

18.8 持续评估

不要只在发版前评估——建立持续的评估管道：

代码变更 → CI 运行评估集 → 分数对比 → 通过/阻止合并
                ↓
         结果存入数据库 → 趋势仪表盘 → 质量告警

关键指标的趋势图比绝对分数更有价值——它能告诉你 Agent 是在变好还是变差。

18.9 本章小结

Agent 评估的核心方法论：

三层评估——工具单元测试 + 流程集成测试 + 端到端任务评估
评估集设计——典型 + 边界 + 对抗三类场景
LLM-as-Judge——用 LLM 评估 LLM，多 Judge 投票提高可靠性
回归测试——每次变更后自动运行，防止质量倒退
成本效率——不只看质量，还要看 token 消耗和时间
持续监控——趋势比绝对分数更重要

下一章探讨如何在生产环境中观测和调试 Agent 的行为。

第18章 评估与测试方法论 ​

18.1 Agent 测试的特殊挑战 ​

18.2 三层评估模型 ​

Layer 1：工具级单元测试 ​

Layer 2：流程级集成测试 ​

Layer 3：端到端任务评估 ​

18.3 评估集设计 ​

典型场景（Happy Path） ​

边界场景（Edge Cases） ​

对抗场景（Adversarial） ​

18.4 LLM-as-Judge ​

多 Judge 一致性 ​

18.5 回归测试 ​

18.6 A/B 测试 ​

18.7 成本效率评估 ​

18.8 持续评估 ​

18.9 本章小结 ​

第18章评估与测试方法论

18.1 Agent 测试的特殊挑战

18.2 三层评估模型

Layer 1：工具级单元测试

Layer 2：流程级集成测试

Layer 3：端到端任务评估

18.3 评估集设计

典型场景（Happy Path）

边界场景（Edge Cases）

对抗场景（Adversarial）

18.4 LLM-as-Judge

多 Judge 一致性

18.5 回归测试

18.6 A/B 测试

18.7 成本效率评估

18.8 持续评估

18.9 本章小结