Leaderboards

Weave を使用するリーダーボードを使用して、複数のモデルを複数のメトリクスで評価・比較し、精度、生成品質、レイテンシー、またはカスタム評価ロジックを測定します。リーダーボードは、モデルのパフォーマンスを一元的に可視化し、時間の経過に伴う変化を追跡し、チーム全体のベンチマークを調整するのに役立ちます。リーダーボードは以下に最適です：

モデルのパフォーマンス低下の追跡
共有評価ワークフローの調整

リーダーボードの作成

リーダーボードはWeave UIまたはプログラムで作成できます。

UI

Weave UIで直接リーダーボードを作成・カスタマイズするには：

Weave UIで、Leadersセクションに移動します。表示されていない場合は、More→Leadersをクリックします。
&#xNAN;+ New Leaderboardをクリックします。
Leaderboard Titleフィールドに、わかりやすい名前（例：summarization-benchmark-v1）を入力します。
必要に応じて、このリーダーボードが何を比較するのかを説明する説明文を追加します。
列を追加して、表示する評価とメトリクスを定義します。
レイアウトに満足したら、リーダーボードを保存して公開し、他のユーザーと共有します。

列の追加

リーダーボードの各列は、特定の評価からのメトリクスを表します。列を設定するには、以下を指定します：

Evaluation：ドロップダウンから評価実行を選択します（事前に作成されている必要があります）。
Scorer：その評価で使用されるスコアリング関数を選択します（例：jaccard_similarity、simple_accuracy）。
Metric：表示する要約メトリクスを選択します（例：mean、true_fractionなど）。

列を追加するには、Add Columnをクリックします。列を編集するには、右側の3点メニュー（⋯）をクリックします。以下の操作が可能です：

Move before / after– 列の順序を変更
Duplicate– 列の定義をコピー
Delete– 列を削除
Sort ascending– リーダーボードのデフォルトソートを設定（再度クリックすると降順に切り替わります）

Python

完全な実行可能なコードサンプルをお探しですか？エンドツーエンドのPython例をご覧ください。

リーダーボードを作成して公開するには：

テストデータセットを定義します。組み込みのDatasetを使用するか、入力とターゲットのリストを手動で定義できます：
```
dataset = [
    {"input": "...", "target": "..."},
    ...
]
```

1つ以上のscorersを定義します：

@weave.op
def jaccard_similarity(target: str, output: str) -> float:
    ...

Evaluationを作成します：

evaluation = weave.Evaluation(
    name="My Eval",
    dataset=dataset,
    scorers=[jaccard_similarity],
)

評価するモデルを定義します：

@weave.op
def my_model(input: str) -> str:
    ...

評価を実行します：

 async def run_all():
     await evaluation.evaluate(model_vanilla)
     await evaluation.evaluate(model_humanlike)
     await evaluation.evaluate(model_messy)

asyncio.run(run_all())

リーダーボードを作成します：

spec = leaderboard.Leaderboard(
    name="My Leaderboard",
    description="Evaluating models on X task",
    columns=[
        leaderboard.LeaderboardColumn(
            evaluation_object_ref=get_ref(evaluation).uri(),
            scorer_name="jaccard_similarity",
            summary_metric_path="mean",
        )
    ]
)

リーダーボードを公開します。
```
weave.publish(spec)
```

結果を取得します：

results = leaderboard.get_leaderboard_results(spec, client)
print(results)

エンドツーエンドのPython例

以下の例では、Weave Evaluationsを使用して、共有データセット上で3つの要約モデルをカスタムメトリクスで比較するリーダーボードを作成します。小規模なベンチマークを作成し、各モデルを評価し、Jaccard similarityで各モデルをスコアリングし、結果をWeaveリーダーボードに公開します。

import weave
from weave.flow import leaderboard
from weave.trace.ref_util import get_ref
import asyncio

client = weave.init("leaderboard-demo")

dataset = [
    {
        "input": "Weave is a tool for building interactive LLM apps. It offers observability, trace inspection, and versioning.",
        "target": "Weave helps developers build and observe LLM applications."
    },
    {
        "input": "The OpenAI GPT-4o model can process text, audio, and vision inputs, making it a multimodal powerhouse.",
        "target": "GPT-4o is a multimodal model for text, audio, and images."
    },
    {
        "input": "The W&B team recently added native support for agents and evaluations in Weave.",
        "target": "W&B added agents and evals to Weave."
    }
]

@weave.op
def jaccard_similarity(target: str, output: str) -> float:
    target_tokens = set(target.lower().split())
    output_tokens = set(output.lower().split())
    intersection = len(target_tokens & output_tokens)
    union = len(target_tokens | output_tokens)
    return intersection / union if union else 0.0

evaluation = weave.Evaluation(
    name="Summarization Quality",
    dataset=dataset,
    scorers=[jaccard_similarity],
)

@weave.op
def model_vanilla(input: str) -> str:
    return input[:50]

@weave.op
def model_humanlike(input: str) -> str:
    if "Weave" in input:
        return "Weave helps developers build and observe LLM applications."
    elif "GPT-4o" in input:
        return "GPT-4o supports text, audio, and vision input."
    else:
        return "W&B added agent support to Weave."

@weave.op
def model_messy(input: str) -> str:
    return "Summarizer summarize models model input text LLMs."

async def run_all():
    await evaluation.evaluate(model_vanilla)
    await evaluation.evaluate(model_humanlike)
    await evaluation.evaluate(model_messy)

asyncio.run(run_all())

spec = leaderboard.Leaderboard(
    name="Summarization Model Comparison",
    description="Evaluate summarizer models using Jaccard similarity on 3 short samples.",
    columns=[
        leaderboard.LeaderboardColumn(
            evaluation_object_ref=get_ref(evaluation).uri(),
            scorer_name="jaccard_similarity",
            summary_metric_path="mean",
        )
    ]
)

weave.publish(spec)

results = leaderboard.get_leaderboard_results(spec, client)
print(results)

リーダーボードの表示と解釈

スクリプトの実行が完了したら、リーダーボードを表示します：

Weave UIで、Leadersタブに移動します。表示されていない場合は、Moreをクリックし、Leadersを選択します。
リーダーボードの名前をクリックします（例：Summarization Model Comparison）。

リーダーボードテーブルでは、各行が特定のモデル（model_humanlike、model_vanilla、model_messy）を表します。mean列は、モデルの出力と参照要約間のJaccard類似度の平均を示しています。 A leaderboard in the Weave UI

この例では：

model_humanlikeが最も良いパフォーマンスを示し、約46%の重複があります。
model_vanilla（単純な切り捨て）は約21%です。
model_messy意図的に悪いモデルで、約2%のスコアです。

はじめに

反復

評価

本番化

インテグレーション

エンタープライズ

ツールとリソース

リーダーボードの作成

UI

列の追加

Python

エンドツーエンドのPython例

リーダーボードの表示と解釈

はじめに

反復

評価

本番化

インテグレーション

エンタープライズ

ツールとリソース

​リーダーボードの作成

​UI

​列の追加

​Python

​エンドツーエンドのPython例

​リーダーボードの表示と解釈

リーダーボードの作成

UI

列の追加

Python

エンドツーエンドのPython例

リーダーボードの表示と解釈