이것은 대화형 노트북입니다. 로컬에서 실행하거나 아래 링크를 사용할 수 있습니다:

리더보드 빠른 시작

이 노트북에서는 Weave의 리더보드를 사용하여 다양한 데이터셋과 점수 함수에서 모델 성능을 비교하는 방법을 배웁니다. 구체적으로 다음을 수행합니다:

가짜 우편번호 데이터 데이터셋 생성하기
점수 함수를 작성하고 기준 모델 평가하기
이러한 기술을 사용하여 모델과 평가의 매트릭스를 평가하기
Weave UI에서 리더보드 검토하기

1단계: 가짜 우편번호 데이터 데이터셋 생성하기

먼저 generate_dataset_rows 함수를 만들어 가짜 우편번호 데이터 목록을 생성합니다.

import json

from openai import OpenAI
from pydantic import BaseModel

class Row(BaseModel):
    zip_code: str
    city: str
    state: str
    avg_temp_f: float
    population: int
    median_income: int
    known_for: str

class Rows(BaseModel):
    rows: list[Row]

def generate_dataset_rows(
    location: str = "United States", count: int = 5, year: int = 2022
):
    client = OpenAI()

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": f"Please generate {count} rows of data for random zip codes in {location} for the year {year}.",
            },
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "response_format",
                "schema": Rows.model_json_schema(),
            },
        },
    )

    return json.loads(completion.choices[0].message.content)["rows"]
python
import weave

weave.init("leaderboard-demo")

2단계: 점수 함수 작성하기

다음으로 3개의 점수 함수를 작성합니다:

check_concrete_fields: 모델 출력이 예상 도시 및 주와 일치하는지 확인합니다.
check_value_fields: 모델 출력이 예상 인구 및 중간 소득의 10% 이내인지 확인합니다.
check_subjective_fields: LLM을 사용하여 모델 출력이 예상 “알려진 특징” 필드와 일치하는지 확인합니다.

@weave.op
def check_concrete_fields(city: str, state: str, output: dict):
    return {
        "city_match": city == output["city"],
        "state_match": state == output["state"],
    }

@weave.op
def check_value_fields(
    avg_temp_f: float, population: int, median_income: int, output: dict
):
    return {
        "avg_temp_f_err": abs(avg_temp_f - output["avg_temp_f"]) / avg_temp_f,
        "population_err": abs(population - output["population"]) / population,
        "median_income_err": abs(median_income - output["median_income"])
        / median_income,
    }

@weave.op
def check_subjective_fields(zip_code: str, known_for: str, output: dict):
    client = OpenAI()

    class Response(BaseModel):
        correct_known_for: bool

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": f"My student was asked what the zip code {zip_code} is best known best for. The right answer is '{known_for}', and they said '{output['known_for']}'. Is their answer correct?",
            },
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "response_format",
                "schema": Response.model_json_schema(),
            },
        },
    )

    return json.loads(completion.choices[0].message.content)

3단계: 간단한 평가 만들기

다음으로 가짜 데이터와 점수 함수를 사용하여 간단한 평가를 정의합니다.

rows = generate_dataset_rows()
evaluation = weave.Evaluation(
    name="United States - 2022",
    dataset=rows,
    scorers=[
        check_concrete_fields,
        check_value_fields,
        check_subjective_fields,
    ],
)

4단계: 기준 모델 평가하기

이제 정적 응답을 반환하는 기준 모델을 평가합니다.

@weave.op
def baseline_model(zip_code: str):
    return {
        "city": "New York",
        "state": "NY",
        "avg_temp_f": 50.0,
        "population": 1000000,
        "median_income": 100000,
        "known_for": "The Big Apple",
    }

await evaluation.evaluate(baseline_model)

5단계: 더 많은 모델 만들기

이제 기준 모델과 비교할 2개의 모델을 더 만듭니다.

@weave.op
def gpt_4o_mini_no_context(zip_code: str):
    client = OpenAI()

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"""Zip code {zip_code}"""}],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "response_format",
                "schema": Row.model_json_schema(),
            },
        },
    )

    return json.loads(completion.choices[0].message.content)

await evaluation.evaluate(gpt_4o_mini_no_context)
python
@weave.op
def gpt_4o_mini_with_context(zip_code: str):
    client = OpenAI()

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": f"""Please answer the following questions about the zip code {zip_code}:
                   1. What is the city?
                   2. What is the state?
                   3. What is the average temperature in Fahrenheit?
                   4. What is the population?
                   5. What is the median income?
                   6. What is the most well known thing about this zip code?
                   """,
            }
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "response_format",
                "schema": Row.model_json_schema(),
            },
        },
    )

    return json.loads(completion.choices[0].message.content)

await evaluation.evaluate(gpt_4o_mini_with_context)

6단계: 더 많은 평가 만들기

이제 모델과 평가의 매트릭스를 평가합니다.

scorers = [
    check_concrete_fields,
    check_value_fields,
    check_subjective_fields,
]
evaluations = [
    weave.Evaluation(
        name="United States - 2022",
        dataset=weave.Dataset(
            name="United States - 2022",
            rows=generate_dataset_rows("United States", 5, 2022),
        ),
        scorers=scorers,
    ),
    weave.Evaluation(
        name="California - 2022",
        dataset=weave.Dataset(
            name="California - 2022", rows=generate_dataset_rows("California", 5, 2022)
        ),
        scorers=scorers,
    ),
    weave.Evaluation(
        name="United States - 2000",
        dataset=weave.Dataset(
            name="United States - 2000",
            rows=generate_dataset_rows("United States", 5, 2000),
        ),
        scorers=scorers,
    ),
]
models = [
    baseline_model,
    gpt_4o_mini_no_context,
    gpt_4o_mini_with_context,
]

for evaluation in evaluations:
    for model in models:
        await evaluation.evaluate(
            model, __weave={"display_name": evaluation.name + ":" + model.__name__}
        )

7단계: 리더보드 검토하기

UI에서 리더보드 탭으로 이동하여 “Create Leaderboard”를 클릭하여 새 리더보드를 만들 수 있습니다. Python에서 직접 리더보드를 생성할 수도 있습니다:

from weave.flow import leaderboard
from weave.trace.ref_util import get_ref

spec = leaderboard.Leaderboard(
    name="Zip Code World Knowledge",
    description="""
This leaderboard compares the performance of models in terms of world knowledge about zip codes.

### Columns

1. **State Match against `United States - 2022`**: The fraction of zip codes that the model correctly identified the state for.
2. **Avg Temp F Error against `California - 2022`**: The mean absolute error of the model's average temperature prediction.
3. **Correct Known For against `United States - 2000`**: The fraction of zip codes that the model correctly identified the most well known thing about the zip code.
""",
    columns=[
        leaderboard.LeaderboardColumn(
            evaluation_object_ref=get_ref(evaluations[0]).uri(),
            scorer_name="check_concrete_fields",
            summary_metric_path="state_match.true_fraction",
        ),
        leaderboard.LeaderboardColumn(
            evaluation_object_ref=get_ref(evaluations[1]).uri(),
            scorer_name="check_value_fields",
            should_minimize=True,
            summary_metric_path="avg_temp_f_err.mean",
        ),
        leaderboard.LeaderboardColumn(
            evaluation_object_ref=get_ref(evaluations[2]).uri(),
            scorer_name="check_subjective_fields",
            summary_metric_path="correct_known_for.true_fraction",
        ),
    ],
)

ref = weave.publish(spec)

Getting Started

Evaluations & Datasets

Models & Prompts

Advanced Topics

Production & Monitoring

API & Integration

Leaderboard quickstart

리더보드 빠른 시작

1단계: 가짜 우편번호 데이터 데이터셋 생성하기

2단계: 점수 함수 작성하기

3단계: 간단한 평가 만들기

4단계: 기준 모델 평가하기

5단계: 더 많은 모델 만들기

6단계: 더 많은 평가 만들기

7단계: 리더보드 검토하기

​리더보드 빠른 시작

​1단계: 가짜 우편번호 데이터 데이터셋 생성하기

​2단계: 점수 함수 작성하기

​3단계: 간단한 평가 만들기

​4단계: 기준 모델 평가하기

​5단계: 더 많은 모델 만들기

​6단계: 더 많은 평가 만들기

​7단계: 리더보드 검토하기

리더보드 빠른 시작

1단계: 가짜 우편번호 데이터 데이터셋 생성하기

2단계: 점수 함수 작성하기

3단계: 간단한 평가 만들기

4단계: 기준 모델 평가하기

5단계: 더 많은 모델 만들기

6단계: 더 많은 평가 만들기

7단계: 리더보드 검토하기