> ## Documentation Index
> Fetch the complete documentation index at: https://wb-21fd5541-feature-automate-reference-docs-generation.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Tutorial eval

アプリケーションを改善するためには、それが向上しているかどうかを評価する方法が必要です。そのために、一般的な方法は変更があった際に同じ例のセットに対してテストすることです。Weaveには評価を追跡するためのファーストクラスの方法があります`Model` & `Evaluation` クラス。私たちは幅広いユースケースをサポートする柔軟性を可能にするために、最小限の前提条件で構築されたAPIを提供しています。

![Evals hero](https://mintlify.s3.us-west-1.amazonaws.com/wb-21fd5541-feature-automate-reference-docs-generation/ja/images/evals-hero.png)

## 1. `Model`

`Model`を構築し、プロンプト、温度などのシステムに関する情報をバージョン管理します。
Weaveは自動的にそれらが使用されたタイミングを記録し、変更があった場合にバージョンを更新します。

`Model`は`Model`をサブクラス化し、`predict`関数定義を実装することで宣言されます。この関数は1つの例を受け取り、応答を返します。

<Important>
  **既知の問題**：Google Colabを使用している場合は、以下の例から`async`を削除してください。
</Important>

<CodeGroup>
  ```python Python
  import json
  import openai
  import weave

  class ExtractFruitsModel(weave.Model):
      model_name: str
      prompt_template: str

      @weave.op()
      async def predict(self, sentence: str) -> dict:
          client = openai.AsyncClient()

          response = await client.chat.completions.create(
              model=self.model_name,
              messages=[
                  {"role": "user", "content": self.prompt_template.format(sentence=sentence)}
              ],
          )
          result = response.choices[0].message.content
          if result is None:
              raise ValueError("No response from model")
          parsed = json.loads(result)
          return parsed
  ```

  ```typescript TypeScript
  // Note: weave.Model is not supported in TypeScript yet.
  // Instead, wrap your model-like function with weave.op

  const model = weave.op(async function myModel({datasetRow}) {
    const prompt = `Extract fields ("fruit": <str>, "color": <str>, "flavor") from the following text, as json: ${datasetRow.sentence}`;
    const response = await openaiClient.chat.completions.create({
      model: 'gpt-3.5-turbo',
      messages: [{ role: 'user', content: prompt }],
      response_format: { type: 'json_object' }
    });
    return JSON.parse(response.choices[0].message.content);
  });
  ```
</CodeGroup>

次のように`Model`オブジェクトを通常通りインスタンス化できます：

<CodeGroup>
  ```python Python
  import asyncio
  import weave

  weave.init('intro-example')

  model = ExtractFruitsModel(
      model_name='gpt-3.5-turbo-1106',
      prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: {sentence}'
  )
  sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy."
  print(asyncio.run(model.predict(sentence)))
  # if you're in a Jupyter Notebook, run:
  # await model.predict(sentence)
  ```

  ```typescript TypeScript
  await weave.init('intro-example');

  const sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.";
  const result = await model({ datasetRow: { sentence } });
  console.log(result);
  ```
</CodeGroup>

<Note>
  詳細については[Models](/ja/guides/core-types/models)ガイドをご覧ください。
</Note>

## 2. いくつかの例を収集する

次に、モデルを評価するためのデータセットが必要です。`Dataset`はWeaveオブジェクトとして保存された例のコレクションです。Weave UIでデータセットをダウンロード、閲覧、評価を実行することができます。

ここではコードで例のリストを構築していますが、実行中のアプリケーションから一度に1つずつログに記録することもできます。

<CodeGroup>
  ```python Python
  sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
  "Pounits are a bright green color and are more savory than sweet.",
  "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
  labels = [
      {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
      {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
      {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
  ]
  examples = [
      {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
      {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
      {'id': '2', 'sentence': sentences[2], 'target': labels[2]}
  ]
  ```

  ```typescript TypeScript
  const sentences = [
    "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
    "Pounits are a bright green color and are more savory than sweet.",
    "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."
  ];
  const labels = [
    { fruit: 'neoskizzles', color: 'purple', flavor: 'candy' },
    { fruit: 'pounits', color: 'bright green', flavor: 'savory' },
    { fruit: 'glowls', color: 'pale orange', flavor: 'sour and bitter' }
  ];
  const examples = sentences.map((sentence, i) => ({
    id: i.toString(),
    sentence,
    target: labels[i]
  }));
  ```
</CodeGroup>

次に、データセットを公開します：

<CodeGroup>
  ```python Python
  import weave
  # highlight-next-line
  weave.init('intro-example')
  dataset = weave.Dataset(name='fruits', rows=examples)
  # highlight-next-line
  weave.publish(dataset)
  ```

  ```typescript TypeScript
  import * as weave from 'weave';
  // highlight-next-line
  await weave.init('intro-example');
  const dataset = new weave.Dataset({
    name: 'fruits',
    rows: examples
  });
  // highlight-next-line
  await dataset.save();
  ```
</CodeGroup>

<Note>
  詳細については[Datasets](/ja/guides/core-types/datasets)ガイドをご覧ください。
</Note>

## 3. スコアリング関数を定義する

`Evaluation`は`Model`のパフォーマンスを、指定されたスコアリング関数または`weave.scorer.Scorer`クラスのリストを使用して一連の例で評価します。

<CodeGroup>
  ```python Python
  # highlight-next-line
  import weave
  from weave.scorers import MultiTaskBinaryClassificationF1

  @weave.op()
  def fruit_name_score(target: dict, output: dict) -> dict:
      return {'correct': target['fruit'] == output['fruit']}
  ```

  ```typescript TypeScript
  // highlight-next-line
  import * as weave from 'weave';

  const fruitNameScorer = weave.op(
    function fruitNameScore({target, output}) {
      return { correct: target.fruit === output.fruit };
    }
  );
  ```
</CodeGroup>

<Note>
  独自のスコアリング関数を作成するには、[Scorers](/ja/guides/evaluation/scorers)ガイドで詳細をご覧ください。

  一部のアプリケーションでは、カスタム`Scorer`クラスを作成したい場合があります - 例えば、標準化された`LLMJudge`クラスを特定のパラメータ（例：チャットモデル、プロンプト）、各行の特定のスコアリング、および集計スコアの特定の計算で作成する必要がある場合です。`Scorer`クラスを定義する方法については、次の章の[Model-Based Evaluation of RAG applications](/ja/tutorial-rag#optional-defining-a-scorer-class)のチュートリアルで詳細をご覧ください。
</Note>

## 4. 評価を実行する

これで、`ExtractFruitsModel`の`fruits`データセットに対する評価をスコアリング関数を使用して実行する準備ができました。

<CodeGroup>
  ```python Python
  import asyncio
  import weave
  from weave.scorers import MultiTaskBinaryClassificationF1

  weave.init('intro-example')

  evaluation = weave.Evaluation(
      name='fruit_eval',
      dataset=dataset, 
      scorers=[
          MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), 
          fruit_name_score
      ],
  )
  print(asyncio.run(evaluation.evaluate(model)))
  # if you're in a Jupyter Notebook, run:
  # await evaluation.evaluate(model)
  ```

  ```typescript TypeScript
  import * as weave from 'weave';

  await weave.init('intro-example');

  const evaluation = new weave.Evaluation({
    name: 'fruit_eval',
    dataset: dataset,
    scorers: [fruitNameScorer],
  });
  const results = await evaluation.evaluate(model);
  console.log(results);
  ```
</CodeGroup>

<Note>
  Pythonスクリプトから実行している場合は、`asyncio.run`を使用する必要があります。ただし、Jupyterノートブックから実行している場合は、`await`を直接使用できます。
</Note>

## 5. 評価結果を表示する

Weaveは各予測とスコアのトレースを自動的に記録します。

評価によって出力されたリンクをクリックして、Weave UIで結果を表示します。

![Evaluation results](https://mintlify.s3.us-west-1.amazonaws.com/wb-21fd5541-feature-automate-reference-docs-generation/ja/images/evals-hero.png)

## 次のステップは？

以下の方法を学びましょう：

1. **モデルのパフォーマンスを比較する**：異なるモデルを試して結果を比較する
2. **組み込みスコアラーを探索する**：Weaveの組み込みスコアリング関数を[Scorers guide](/ja/guides/evaluation/scorers)
3. **RAGアプリを構築する**：[RAG tutorial](/ja/tutorial-rag)に従って、検索拡張生成の評価について学ぶ
4. **高度な評価パターン**：[Model-Based Evaluation](/ja/guides/evaluation/scorers#model-based-evaluation)でLLMを審査員として使用する方法を学ぶ