Topic 6 評価(Eval)とは何か — 「最後のチェック」ではなく「毎日良くするエンジン」

このトピックの概要約1分

AIシステムを変えたとき、いちばん危ういのは「なんとなく良くなった気がする」という感覚です。本当に改善したのか、別の場所を壊しただけなのかは、測らないと分かりません。ここで効くのが evaluation(評価。略して eval) です。eval には2つの顔があります。出荷前に走らせる offline eval(オフライン評価)——出してよいか止めるかを決める門番。そしてもう一つが、本番運用後の online eval(オンライン評価)——実ユーザーが残す trace(トレース=実行の記録) やABテスト、人間のレビューを手がかりに、毎日エージェントを直し続ける改善ループです。eval を「出荷直前の最後の合否判定」で終わらせず、改善のエンジンに変える——これが Topic 6 の核心です。

重要語彙覚えたい 0 / 11

を付けた単語はマイ単語にまとまります（この端末のみ保存）。

用語	今回の講演文脈での意味	一般的な意味	語源・由来
eval / evaluation	自分の用途に合わせて作る、出力品質を測る仕組み	評価、査定	ラテン語 valere(価値がある)
benchmark	モデル同士を比べる汎用テスト。eval(自分用)と区別	基準、指標	測量の基準点(bench mark)
offline eval	出荷前の評価。自社用途の eval suite で出荷可否を判定		off-line=本番から切り離す
online eval	出荷後、trace・ABテスト・レビューで直す改善ループ		on-line=本番稼働中
trace	エージェント1回の実行の全記録	足跡、記録	trace=たどった跡
trace clustering	似た trace をまとめ、失敗の塊を見つける		cluster=群れ
AB testing	2案を出し分けて優劣を測る検証		A案/B案の対比
regression	変更で前より悪くなること	後退、退行	regress=後ろへ戻る
failure mode	どう失敗するかのパターン。1つずつ潰す	失敗の型	工学の故障様式から
green field	テストも枠組みも無い白紙状態。評価が難しい	更地	何も建っていない野原
production quality	デモでなく実サービスで使える水準	本番品質

英文リーディング

English Reading Passage

When you change an AI system, the real question is whether it actually got better — and at scale, "vibes" are not enough. A good setup rests on two pillars. The first is offline eval: a custom eval suite run before shipping, which stops a release if there is a major regression. The second is online eval: every session in production leaves a trace, and millions of them show what users really do. Because one trace is hard to debug, the key move is trace clustering — grouping similar failures so a recurring failure mode stands out. Important or controversial fixes can then be checked with AB testing when needed, since results are rarely clean and human judgment still decides. This is hard because vibe coding starts from a green field with no tests, so grading cannot just run a fixed test set. The takeaway: do not treat evaluation as a last check before shipping; treat it as an engine that ships a better, production quality agent every day.

日本語訳

AIシステムを変えたとき、本当の問いは「実際に良くなったのか」です——そして大規模になると「雰囲気」では足りません。良い仕組みは2本の柱に支えられます。1つ目は offline eval(オフライン評価)——出荷前に走らせる、自社用途に合わせた eval suite で、重大な regression(後退) があれば出荷を止めます。2つ目は online eval(オンライン評価)——本番の各セッションは trace(トレース) を残し、その何百万件もが「ユーザーが実際に何をしているか」を示します。1本の trace は単体ではデバッグが難しいため、鍵となるのが trace clustering(トレースのクラスタリング)——似た失敗をまとめ、繰り返す failure mode(失敗の型) を浮かび上がらせることです。影響が大きい修正や判断が割れる修正は、必要に応じて AB testing(ABテスト) で検証します。結果はきれいに白黒つくことが稀なので、最後は人間の判断が決めます。これが難しいのは、vibe coding(自然言語からアプリを作る開発) が green field(更地) から始まりテストが無いため、採点が固定のテストを走らせるだけでは済まないからです。要点は、評価を「出荷直前の最後のチェック」と捉えず、毎日より良い production quality(本番品質) のエージェントを出荷する「エンジン」と捉えることです。

出典・参考リンク6

この講演を聞くなら覚える単語

下記の講演を聞く予定なら、上の「重要語彙」を覚えておくと当日の聞き取りがぐっと楽になります。
並びはトピックとの関連性の順で、講演自体の優劣ではありません。

このトピックに一番沿っている講演4件

参考録画 (SF)

Evaluating and improving Replit Agent at scale

6/1017:30 – 18:15 Workshop

The prompting playbook

6/1111:30 – 12:00 Founder stage

The 1% problem: How domain expertise + Claude let a 2-person team hit #1 on a global classification benchmark

6/1113:00 – 13:45 Workshop

Evals for taste: Hill-climbing a slide-generation agent

このトピックに内容が近い講演6件

6/1015:20 – 15:50 Main stage

The capability curve

6/1012:30 – 13:15 Workshop

Stop babysitting your agents

6/1110:45 – 11:15 Founder stage

From Claude prototype to production: How Myrealtrip builds and ships AI workflows

6/1113:15 – 13:45 Founder stage

The model is ready. The harness is ready. The spec isn't.

6/1114:00 – 14:45 Workshop

Tool, skill, or subagent? Decomposing an agent that outgrew its prompt

参考録画

Caching, harnesses, and advisors: Building on Claude at GitHub scale

プランナーで参加予定を決める