Builtin scorers

Python
TypeScript

インストールWeaveの事前定義されたスコアラーを使用するには、追加の依存関係をインストールする必要があります：

pip install weave[scorers]

LLM-evaluators 2025年2月更新：LLMを活用する事前定義されたスコアラーは、現在litellmと自動的に統合されています。 LLMクライアントを渡す必要はなくなりました。単に設定するだけです model_id. サポートされているモデルを確認する here.

`HallucinationFreeScorer`

このスコアラーは、AIシステムの出力に入力データに基づいた幻覚（ハルシネーション）が含まれているかどうかをチェックします。

from weave.scorers import HallucinationFreeScorer

scorer = HallucinationFreeScorer()

Customization:

スコアラーの system_prompt と user_prompt フィールドをカスタマイズして、あなたにとっての「幻覚」の定義を設定します。

Notes:

この score メソッドは context という名前の入力列を想定しています。データセットが異なる名前を使用している場合は、column_map 属性を使用して context をデータセット列にマッピングします。

評価の文脈での例を以下に示します：

import asyncio
import weave
from weave.scorers import HallucinationFreeScorer

# Initialize scorer with a column mapping if needed.
hallucination_scorer = HallucinationFreeScorer(
    model_id="openai/gpt-4o", # or any other model supported by litellm
    column_map={"context": "input", "output": "other_col"}
)

# Create dataset
dataset = [
    {"input": "John likes various types of cheese."},
    {"input": "Pepe likes various types of cheese."},
]

@weave.op
def model(input: str) -> str:
    return "The person's favorite cheese is cheddar."

# Run evaluation
evaluation = weave.Evaluation(
    dataset=dataset,
    scorers=[hallucination_scorer],
)
result = asyncio.run(evaluation.evaluate(model))
print(result)
# Example output:
# {'HallucinationFreeScorer': {'has_hallucination': {'true_count': 2, 'true_fraction': 1.0}}, 'model_latency': {'mean': ...}}

`SummarizationScorer`

LLMを使用して要約を元のテキストと比較し、要約の品質を評価します。

from weave.scorers import SummarizationScorer

scorer = SummarizationScorer(
    model_id="openai/gpt-4o"  # or any other model supported by litellm
)

仕組み：このスコアラーは要約を2つの方法で評価します：

エンティティ密度： 要約で言及されているユニークなエンティティ（名前、場所、物など）の比率を要約の総単語数に対して確認し、要約の「情報密度」を推定します。LLMを使用してエンティティを抽出します。Chain of Density論文で使用されているエンティティ密度と同様の方法です。https://arxiv.org/abs/2309.04269
品質評価： LLM評価者が要約を poor, ok, または excellent として評価します。これらの評価はスコアにマッピングされます（poorは0.0、okは0.5、excellentは1.0）。これにより総合的なパフォーマンス評価が行われます。

Customization:

評価プロセスを調整するには summarization_evaluation_system_prompt と summarization_evaluation_prompt を調整します。

Notes:

このスコアラーは内部でlitellmを使用します。
この score メソッドは、元のテキスト（要約されるテキスト）が input 列に存在することを想定しています。データセットが異なる名前を使用している場合は column_map を使用してください。

評価の文脈での使用例を以下に示します：

import asyncio
import weave
from weave.scorers import SummarizationScorer

class SummarizationModel(weave.Model):
    @weave.op()
    async def predict(self, input: str) -> str:
        return "This is a summary of the input text."

# Initialize scorer
summarization_scorer = SummarizationScorer(
    model_id="openai/gpt-4o"  # or any other model supported by litellm
)
# Create dataset
dataset = [
    {"input": "The quick brown fox jumps over the lazy dog."},
    {"input": "Artificial Intelligence is revolutionizing various industries."}
]
# Run evaluation
evaluation = weave.Evaluation(dataset=dataset, scorers=[summarization_scorer])
results = asyncio.run(evaluation.evaluate(SummarizationModel()))
print(results)
# Example output:
# {'SummarizationScorer': {'is_entity_dense': {'true_count': 0, 'true_fraction': 0.0}, 'summarization_eval_score': {'mean': 0.0}, 'entity_density': {'mean': 0.0}}, 'model_latency': {'mean': ...}}

`OpenAIModerationScorer`

この OpenAIModerationScorer はOpenAIのModeration APIを使用して、AIシステムの出力に憎悪表現や露骨な内容などの禁止されたコンテンツが含まれているかどうかをチェックします。

from weave.scorers import OpenAIModerationScorer

scorer = OpenAIModerationScorer()

仕組み：

AIの出力をOpenAI Moderationエンドポイントに送信し、コンテンツがフラグ付けされているかどうかを示す構造化された応答を返します。

Notes: 評価の文脈での例を以下に示します：

import asyncio
import weave
from weave.scorers import OpenAIModerationScorer

class MyModel(weave.Model):
    @weave.op
    async def predict(self, input: str) -> str:
        return input

# Initialize scorer
moderation_scorer = OpenAIModerationScorer()

# Create dataset
dataset = [
    {"input": "I love puppies and kittens!"},
    {"input": "I hate everyone and want to hurt them."}
]

# Run evaluation
evaluation = weave.Evaluation(dataset=dataset, scorers=[moderation_scorer])
results = asyncio.run(evaluation.evaluate(MyModel()))
print(results)
# Example output:
# {'OpenAIModerationScorer': {'flagged': {'true_count': 1, 'true_fraction': 0.5}, 'categories': {'violence': {'true_count': 1, 'true_fraction': 1.0}}}, 'model_latency': {'mean': ...}}

`EmbeddingSimilarityScorer`

この EmbeddingSimilarityScorer はAIシステムの出力とデータセットのターゲットテキストの埋め込み間のコサイン類似度を計算します。AIの出力が参照テキストにどれだけ似ているかを測定するのに役立ちます。

from weave.scorers import EmbeddingSimilarityScorer

similarity_scorer = EmbeddingSimilarityScorer(
    model_id="openai/text-embedding-3-small",  # or any other model supported by litellm
    threshold=0.4  # the cosine similarity threshold
)

Note: You can use column_map を使用して target 列を別の名前にマッピングします。Parameters:

threshold (float): 2つのテキストが類似していると見なすために必要な最小コサイン類似度スコア（-1から1の間）（デフォルトは 0.5）。

評価の文脈での使用例を以下に示します：

import asyncio
import weave
from weave.scorers import EmbeddingSimilarityScorer

# Initialize scorer
similarity_scorer = EmbeddingSimilarityScorer(
    model_id="openai/text-embedding-3-small",  # or any other model supported by litellm
    threshold=0.7
)
# Create dataset
dataset = [
    {
        "input": "He's name is John",
        "target": "John likes various types of cheese.",
    },
    {
        "input": "He's name is Pepe.",
        "target": "Pepe likes various types of cheese.",
    },
]
# Define model
@weave.op
def model(input: str) -> str:
    return "John likes various types of cheese."

# Run evaluation
evaluation = weave.Evaluation(
    dataset=dataset,
    scorers=[similarity_scorer],
)
result = asyncio.run(evaluation.evaluate(model))
print(result)
# Example output:
# {'EmbeddingSimilarityScorer': {'is_similar': {'true_count': 1, 'true_fraction': 0.5}, 'similarity_score': {'mean': 0.844851403}}, 'model_latency': {'mean': ...}}

`ValidJSONScorer`

この ValidJSONScorer はAIシステムの出力が有効なJSONかどうかをチェックします。このスコアラーは、出力がJSON形式であることを期待し、その有効性を検証する必要がある場合に役立ちます。

from weave.scorers import ValidJSONScorer

json_scorer = ValidJSONScorer()

評価の文脈での例を以下に示します：

import asyncio
import weave
from weave.scorers import ValidJSONScorer

class JSONModel(weave.Model):
    @weave.op()
    async def predict(self, input: str) -> str:
        # This is a placeholder.
        # In a real scenario, this would generate JSON.
        return '{"key": "value"}'

model = JSONModel()
json_scorer = ValidJSONScorer()

dataset = [
    {"input": "Generate a JSON object with a key and value"},
    {"input": "Create an invalid JSON"}
]

evaluation = weave.Evaluation(dataset=dataset, scorers=[json_scorer])
results = asyncio.run(evaluation.evaluate(model))
print(results)
# Example output:
# {'ValidJSONScorer': {'json_valid': {'true_count': 2, 'true_fraction': 1.0}}, 'model_latency': {'mean': ...}}

`ValidXMLScorer`

この ValidXMLScorer はAIシステムの出力が有効なXMLかどうかをチェックします。XML形式の出力を期待する場合に役立ちます。

from weave.scorers import ValidXMLScorer

xml_scorer = ValidXMLScorer()

評価の文脈での例を以下に示します：

import asyncio
import weave
from weave.scorers import ValidXMLScorer

class XMLModel(weave.Model):
    @weave.op()
    async def predict(self, input: str) -> str:
        # This is a placeholder. In a real scenario, this would generate XML.
        return '<root><element>value</element></root>'

model = XMLModel()
xml_scorer = ValidXMLScorer()

dataset = [
    {"input": "Generate a valid XML with a root element"},
    {"input": "Create an invalid XML"}
]

evaluation = weave.Evaluation(dataset=dataset, scorers=[xml_scorer])
results = asyncio.run(evaluation.evaluate(model))
print(results)
# Example output:
# {'ValidXMLScorer': {'xml_valid': {'true_count': 2, 'true_fraction': 1.0}}, 'model_latency': {'mean': ...}}

`PydanticScorer`

この PydanticScorer はAIシステムの出力をPydanticモデルに対して検証し、指定されたスキーマやデータ構造に準拠していることを確認します。

from weave.scorers import PydanticScorer
from pydantic import BaseModel

class FinancialReport(BaseModel):
    revenue: int
    year: str

pydantic_scorer = PydanticScorer(model=FinancialReport)

RAGAS - `ContextEntityRecallScorer`

この ContextEntityRecallScorer はAIシステムの出力と提供されたコンテキストの両方からエンティティを抽出し、リコールスコアを計算することでコンテキストリコールを推定します。これは RAGAS 評価ライブラリに基づいています。

from weave.scorers import ContextEntityRecallScorer

entity_recall_scorer = ContextEntityRecallScorer(
    model_id="openai/gpt-4o"
)

仕組み：

LLMを使用して出力とコンテキストから一意のエンティティを抽出し、リコールを計算します。
Recall はコンテキストから重要なエンティティが出力にどれだけ含まれているかの割合を示します。
リコールスコアを含む辞書を返します。

Notes:

データセットに context 列があることを想定しています。列名が異なる場合は column_map 属性を使用してください。

RAGAS - `ContextRelevancyScorer`

この ContextRelevancyScorer は提供されたコンテキストがAIシステムの出力に対してどれだけ関連性があるかを評価します。これは RAGAS 評価ライブラリに基づいています。

from weave.scorers import ContextRelevancyScorer

relevancy_scorer = ContextRelevancyScorer(
    model_id="openai/gpt-4o",  # or any other model supported by litellm
    relevancy_prompt="""
Given the following question and context, rate the relevancy of the context to the question on a scale from 0 to 1.

Question: {question}
Context: {context}
Relevancy Score (0-1):
"""
)

仕組み：

LLMを使用して、コンテキストと出力の関連性を0から1のスケールで評価します。
辞書を返し、その中に relevancy_score.

Notes:

データセットに context 列があることを想定しています。異なる名前が使用されている場合は column_map を使用してください。
関連性の評価方法を定義するには relevancy_prompt をカスタマイズしてください。

評価の文脈での使用例を以下に示します：

import asyncio
from textwrap import dedent
import weave
from weave.scorers import ContextEntityRecallScorer, ContextRelevancyScorer

class RAGModel(weave.Model):
    @weave.op()
    async def predict(self, question: str) -> str:
        "Retrieve relevant context"
        return "Paris is the capital of France."

# Define prompts
relevancy_prompt: str = dedent("""
    Given the following question and context, rate the relevancy of the context to the question on a scale from 0 to 1.

    Question: {question}
    Context: {context}
    Relevancy Score (0-1):
    """)
# Initialize scorers
entity_recall_scorer = ContextEntityRecallScorer()
relevancy_scorer = ContextRelevancyScorer(relevancy_prompt=relevancy_prompt)
# Create dataset
dataset = [
    {
        "question": "What is the capital of France?",
        "context": "Paris is the capital city of France."
    },
    {
        "question": "Who wrote Romeo and Juliet?",
        "context": "William Shakespeare wrote many famous plays."
    }
]
# Run evaluation
evaluation = weave.Evaluation(
    dataset=dataset,
    scorers=[entity_recall_scorer, relevancy_scorer]
)
results = asyncio.run(evaluation.evaluate(RAGModel()))
print(results)
# Example output:
# {'ContextEntityRecallScorer': {'recall': {'mean': ...}}, 
# 'ContextRelevancyScorer': {'relevancy_score': {'mean': ...}}, 
# 'model_latency': {'mean': ...}}

This feature is not available in TypeScript yet.  Stay tuned!

Note: 組み込みスコアラーはOpenAIモデル（例： openai/gpt-4o, openai/text-embedding-3-small）を使用して調整されています。他のプロバイダーを試したい場合は、単に model_id を更新するだけです。例えば、Anthropicモデルを使用するには：

from weave.scorers import SummarizationScorer

# Switch to Anthropic's Claude model
summarization_scorer = SummarizationScorer(
    model_id="anthropic/claude-3-5-sonnet-20240620"
)

はじめに

反復

評価

本番化

インテグレーション

エンタープライズ

ツールとリソース

`HallucinationFreeScorer`

`SummarizationScorer`

`OpenAIModerationScorer`

`EmbeddingSimilarityScorer`

`ValidJSONScorer`

`ValidXMLScorer`

`PydanticScorer`

RAGAS - `ContextEntityRecallScorer`

RAGAS - `ContextRelevancyScorer`

はじめに

反復

評価

本番化

インテグレーション

エンタープライズ

ツールとリソース

Documentation Index

​HallucinationFreeScorer

​SummarizationScorer

​OpenAIModerationScorer

​EmbeddingSimilarityScorer

​ValidJSONScorer

​ValidXMLScorer

​PydanticScorer

​RAGAS - ContextEntityRecallScorer

​RAGAS - ContextRelevancyScorer

`HallucinationFreeScorer`

`SummarizationScorer`

`OpenAIModerationScorer`

`EmbeddingSimilarityScorer`

`ValidJSONScorer`

`ValidXMLScorer`

`PydanticScorer`

RAGAS - `ContextEntityRecallScorer`

RAGAS - `ContextRelevancyScorer`