让 agent 改自己的 prompt:补边界条件比改方法论有用

跑 20 题 Python 任务,让 LLM 看自己的失败再改 prompt。3 轮搜索,基线 50% → R1-C1 75%。被采纳的两次都是同一种策略:ADD_CONSTRAINTS。

数据来自 2026-03-21 报告 · LAP Evolution Engine 实验

起点:基线 prompt 跑 50% pass

baseline prompt

# 完整 baseline,人类手写

You are a Python developer. Your task is to write Python code.

IMPORTANT:
- Write ONLY the implementation code
- Include all necessary imports at the top
- Do NOT write tests
- Do NOT include example usage or main blocks
- When done, call the `finish` tool with the COMPLETE CODE
  as the message.
  ...

The code will be tested with an external test suite that you cannot see.

# 20 题 Python 任务集,每题 5%
# baseline 跑出:50% pass (10/20)

闭环 5 步

evolution loop

reward 公式带成本惩罚:

reward formula

reward = α · (baseline.residual_rate - candidate.residual_rate)
       - β · (prompt_length 增量 / 1000)

# α 与 Δresidual 同向(失败率下降是好事)
# β 是成本惩罚 — 每多 1000 字符扣 0.1 reward
# 没成本惩罚的话,prompt 会一直膨胀,
# LLM 加约束直到加到无关也算"通过"

跑了 3 轮 · pass rate 变化

R1-C1 把 pass rate 从 50% 推到 75%。注意 R2 baseline 60%、R3 baseline 50% — 同一 prompt 三次跑出三种结果。LLM 有随机性,所以每轮必须重新评估 baseline。不重新评估会产生"假胜"。

R1-C1 改了什么 — 5 条 LLM 自动生成的约束

R1-C1 prompt diff

# baseline 末尾追加:

CONSTRAINTS:                          ← 进化引擎自动添加
- Ensure bin edge calculations produce tick
  labels that match the input data range when
  creating histograms
- Always maintain consistent DataFrame index
  structure between empty and non-empty return
  cases
- Ensure index names are preserved when
  returning DataFrames from groupby operations
  in all code paths
- Always handle edge cases (empty input, single
  element) with the same output structure as
  normal cases
- Verify that arange offsets for bin edges
  align with expected histogram tick label
  positions

# 这 5 条不是人写的,是 LLM 看失败 task 后归因生成的
# 每一条对应一个具体失败模式 — task-specific 边界条件

4 策略的 reward 优先级

3 轮实际跑里,被采纳的两次都是 ADD_CONSTRAINTS。

具体差别:

方法论指令(baseline 里已经有):"Write ONLY the implementation code" / "Include all necessary imports" / "Do NOT write tests" — task-agnostic,关于"该写什么不该写什么"。
边界条件约束(C1 加的):"empty input 要和正常 input 输出结构一致" / "groupby 返回时 index name 要保留" — task-specific,关于"具体某种情况下要怎么处理"。

agent 不缺方法论。它缺的是遇到边界情况时该怎么处理的具体规则。这跟反思式进化(GEPA / TextGrad)的核心观察一致:LLM 改 prompt 时,最有信息量的是失败案例的具体边界,不是抽象方法论。

跟业界方法的对比

方面	这个实验	DSPy + GEPA
框架性质	hand-rolled,7 个 Router + SQLite 事件总线	declarative pipeline,Python 程序
变异方式	4 种策略(ADD_CONSTRAINTS / PROMPT_ENHANCE / CODE_REVIEW / RETRY_ON_ERROR)	LLM 反思式生成
跑的规模	3 轮 × 20 题	通常数十轮 × 数百题
实际表现	50% → 75%(3 轮小样本)	GEPA 比 GRPO 平均高 6%(在 6 benchmark)
跨任务 prompt 复用	没做	DSPy 的核心能力之一

简单说:这个实验是同一类问题的 hand-rolled 小样本验证。生产用应该直接接 DSPy — 不要重复造轮子。这个实验的价值是看到"在最小闭环里,哪条策略最有效"。

能观察到的细节

ADD_CONSTRAINTS 最有效。补具体边界比改方法论有用,前提是失败 task 的归因要准。
reward 必须带成本惩罚。否则 prompt 会一直膨胀,LLM 不断加约束,直到加到无关也算"通过"。
每轮重新评估 baseline。LLM 随机性会让同一 prompt 跑出 50% / 60% / 50% 三种结果,不重新评估会产生假胜。
停滞检测要有。连续 N 轮 reward ≤ 0 → 退出搜索。否则浪费 token。
不要把失败归因到"模型太弱"。在这个实验里,95% 的失败是 prompt 不够精确或缺少边界条件,不是模型问题。把失败归到模型上,改进路径就消失了。

已知边界

这个实验只在 20 题 Python 编程任务上跑过 3 轮。50% → 75% 是小样本结果,不构成方法论的强证据。要更强的证据,应按 DSPy / GEPA 常规设置:数百题、数十轮。
ADD_CONSTRAINTS 之所以最有效,可能跟任务类型(Python 编程,失败模式相对结构化)有关。开放性任务(写设计、做策划、写文案)上不一定。
跟业界成熟框架比,这个实现缺很多功能:声明式 pipeline、跨任务 prompt 复用、checkpoint 续跑等。生产用应直接接 DSPy + MIPROv2 / GEPA。
"LLM 自己改自己 prompt" 在业界已有大量讨论 — 这个实验补的是 ablation 级别的小数据点,不是方向性的新工作。