Verdict

Weave は Verdict Python ライブラリを通じて行われるすべての呼び出しを追跡し記録することをVerdict Python library簡単にするように設計されています。 AI評価パイプラインを扱う際、デバッグは非常に重要です。パイプラインステップが失敗したり、予期しない出力が生じたり、ネストされた操作が混乱を招いたりする場合、問題を特定することは難しい場合があります。Verdictアプリケーションは多くの場合、複数のパイプラインステップ、ジャッジ、変換で構成されており、評価ワークフローの内部動作を理解することが不可欠です。 WeaveはあなたのVerdictアプリケーションのトレースを自動的にキャプチャすることでこのプロセスを簡素化します。これによりパイプラインのパフォーマンスを監視・分析し、AIの評価ワークフローのデバッグと最適化が容易になります。

はじめに

始めるには、スクリプトの先頭で単にweave.init(project=...)を呼び出すだけです。project引数を使用して特定のW&Bチーム名にteam-name/project-nameでログを記録するか、project-nameを実行してデフォルトのチーム/エンティティにログを記録します。

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

# Create a simple evaluation pipeline
pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Rate the quality of this text: {source.text}")

# Create sample data
data = Schema.of(text="This is a sample text for evaluation.")

# Run the pipeline - this will be automatically traced
output = pipeline.run(data)

print(output)

呼び出しメタデータの追跡

Verdictパイプライン呼び出しからメタデータを追跡するには、weave.attributesコンテキストマネージャーを使用できます。このコンテキストマネージャーを使用すると、パイプラインの実行や評価バッチなど、特定のコードブロックにカスタムメタデータを設定できます。

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Evaluate sentiment: {source.text}")

data = Schema.of(text="I love this product!")

# highlight-next-line
with weave.attributes({"evaluation_type": "sentiment", "batch_id": "batch_001"}):
    output = pipeline.run(data)

print(output)

Weaveは自動的にVerdictパイプライン呼び出しのトレースに対してメタデータを追跡します。Weaveウェブインターフェイスでメタデータを確認できます。

トレース

AI評価パイプラインのトレースを中央データベースに保存することは、開発中も本番環境でも非常に重要です。これらのトレースは、貴重なデータセットを提供することで、評価ワークフローのデバッグと改善に不可欠です。 Weaveは自動的にVerdictアプリケーションのトレースをキャプチャします。Verdictライブラリを通じて行われるすべての呼び出しを追跡し記録します。これには以下が含まれます：

パイプライン実行ステップ
ジャッジユニット評価
レイヤー変換
プーリング操作
カスタムユニットと変換

Weaveウェブインターフェイスでトレースを表示でき、パイプライン実行の階層構造が表示されます。

パイプライントレースの例

以下は、Weaveがネストされたパイプライン操作をトレースする方法を示すより複雑な例です：

import weave
from verdict import Pipeline, Layer
from verdict.common.judge import JudgeUnit
from verdict.transform import MeanPoolUnit
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

# Create a complex pipeline with multiple steps
pipeline = Pipeline()
pipeline = pipeline >> Layer([
    JudgeUnit().prompt("Rate coherence: {source.text}"),
    JudgeUnit().prompt("Rate relevance: {source.text}"),
    JudgeUnit().prompt("Rate accuracy: {source.text}")
], 3)
pipeline = pipeline >> MeanPoolUnit()

# Sample data
data = Schema.of(text="This is a comprehensive evaluation of text quality across multiple dimensions.")

# Run the pipeline - all operations will be traced
result = pipeline.run(data)

print(f"Average score: {result}")

これにより、以下を示す詳細なトレースが作成されます：

メインのPipeline実行
Layer内の各JudgeUnit評価
MeanPoolUnit集約ステップ
各操作のタイミング情報

設定

を呼び出すと、weave.init()Verdictパイプラインのトレースが自動的に有効になります。この統合はPipeline.__init__メソッドにパッチを適用してVerdictTracerを注入し、すべてのトレースデータをWeaveに転送します。追加の設定は必要ありません - Weaveは自動的に以下を行います：

すべてのパイプライン操作をキャプチャ
実行タイミングを追跡
入力と出力を記録
トレース階層を維持
同時パイプライン実行を処理

カスタムトレーサーとWeave

アプリケーションでカスタムVerdictトレーサーを使用している場合、WeaveのVerdictTracerはそれらと並行して動作できます：

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.util.tracing import ConsoleTracer
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

# You can still use Verdict's built-in tracers
console_tracer = ConsoleTracer()

# Create pipeline with both Weave (automatic) and Console tracing
pipeline = Pipeline(tracer=[console_tracer])  # Weave tracer is added automatically
pipeline = pipeline >> JudgeUnit().prompt("Evaluate: {source.text}")

data = Schema.of(text="Sample evaluation text")

# This will trace to both Weave and console
result = pipeline.run(data)

モデルと評価

複数のパイプラインコンポーネントを持つAIシステムの整理と評価は難しい場合があります。weave.Modelを使用すると、プロンプト、パイプライン設定、評価パラメータなどの実験詳細をキャプチャして整理し、異なるイテレーションを比較しやすくなります。次の例は、VerdictパイプラインをWeaveModelでラップする方法を示しています：

import asyncio
import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

# highlight-next-line
class TextQualityEvaluator(weave.Model):
    judge_prompt: str
    pipeline_name: str

# highlight-next-line
    @weave.op()
    async def predict(self, text: str) -> dict:
        pipeline = Pipeline(name=self.pipeline_name)
        pipeline = pipeline >> JudgeUnit().prompt(self.judge_prompt)
        
        data = Schema.of(text=text)
        result = pipeline.run(data)
        
        return {
            "text": text,
            "quality_score": result.score if hasattr(result, 'score') else result,
            "evaluation_prompt": self.judge_prompt
        }

model = TextQualityEvaluator(
    judge_prompt="Rate the quality of this text on a scale of 1-10: {source.text}",
    pipeline_name="text_quality_evaluator"
)

text = "This is a well-written and informative piece of content that provides clear value to readers."

prediction = asyncio.run(model.predict(text))

# if you're in a Jupyter Notebook, run:
# prediction = await model.predict(text)

print(prediction)

このコードは、パイプライン構造と評価結果の両方を表示するWeave UIで視覚化できるモデルを作成します。

評価

評価は、評価パイプライン自体のパフォーマンスを測定するのに役立ちます。weave.Evaluationクラスを使用することで、Verdictパイプラインが特定のタスクやデータセットでどのように機能するかをキャプチャできます：

import asyncio
import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Initialize Weave
# highlight-next-line
weave.init("verdict_demo")

# Create evaluation model
class SentimentEvaluator(weave.Model):
    @weave.op()
    async def predict(self, text: str) -> dict:
        pipeline = Pipeline()
        pipeline = pipeline >> JudgeUnit().prompt(
            "Classify sentiment as positive, negative, or neutral: {source.text}"
        )
        
        data = Schema.of(text=text)
        result = pipeline.run(data)
        
        return {"sentiment": result}

# Test data
texts = [
    "I love this product, it's amazing!",
    "This is terrible, worst purchase ever.",
    "The weather is okay today."
]
labels = ["positive", "negative", "neutral"]

examples = [
    {"id": str(i), "text": texts[i], "target": labels[i]}
    for i in range(len(texts))
]

# Scoring function
@weave.op()
def sentiment_accuracy(target: str, output: dict) -> dict:
    predicted = output.get("sentiment", "").lower()
    return {"correct": target.lower() in predicted}

model = SentimentEvaluator()

evaluation = weave.Evaluation(
    dataset=examples,
    scorers=[sentiment_accuracy],
)

scores = asyncio.run(evaluation.evaluate(model))
# if you're in a Jupyter Notebook, run:
# scores = await evaluation.evaluate(model)

print(scores)

これにより、Verdictパイプラインがさまざまなテストケースでどのように機能するかを示す評価トレースが作成されます。

ベストプラクティス

パフォーマンスモニタリング

Weaveはすべてのパイプライン操作のタイミング情報を自動的にキャプチャします。これを使用してパフォーマンスのボトルネックを特定できます：

import weave
from verdict import Pipeline, Layer
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# highlight-next-line
weave.init("verdict_demo")

# Create a pipeline that might have performance variations
pipeline = Pipeline()
pipeline = pipeline >> Layer([
    JudgeUnit().prompt("Quick evaluation: {source.text}"),
    JudgeUnit().prompt("Detailed analysis: {source.text}"),  # This might be slower
], 2)

data = Schema.of(text="Sample text for performance testing")

# Run multiple times to see timing patterns
for i in range(3):
    with weave.attributes({"run_number": i}):
        result = pipeline.run(data)

エラー処理

Weaveはパイプライン実行中に発生する例外を自動的にキャプチャします：

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# highlight-next-line
weave.init("verdict_demo")

pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Process: {source.invalid_field}")  # This will cause an error

data = Schema.of(text="Sample text")

try:
    result = pipeline.run(data)
except Exception as e:
    print(f"Pipeline failed: {e}")
    # Error details are captured in Weave trace

WeaveとVerdictを統合することで、AI評価パイプラインに対する包括的な可観測性が得られ、評価ワークフローのデバッグ、最適化、理解が容易になります。

はじめに

反復

評価

本番化

インテグレーション

エンタープライズ

ツールとリソース

はじめに

呼び出しメタデータの追跡

トレース

パイプライントレースの例

設定

カスタムトレーサーとWeave

モデルと評価

評価

ベストプラクティス

パフォーマンスモニタリング

エラー処理

​はじめに

​呼び出しメタデータの追跡

​トレース

​パイプライントレースの例

​設定

​カスタムトレーサーとWeave

​モデルと評価

​評価

​ベストプラクティス

​パフォーマンスモニタリング

​エラー処理

はじめに

呼び出しメタデータの追跡

トレース

パイプライントレースの例

設定

カスタムトレーサーとWeave

モデルと評価

評価

ベストプラクティス

パフォーマンスモニタリング

エラー処理