Verdict

Weave는 다음을 통해 이루어진 모든 호출을 추적하고 로깅하도록 설계되었습니다 Verdict Python 라이브러리 쉽게 할 수 있습니다. AI 평가 파이프라인 작업 시 디버깅이 중요합니다. 파이프라인 단계가 실패하거나, 출력이 예상과 다르거나, 중첩된 작업이 혼란을 야기할 때 문제를 정확히 찾아내는 것이 어려울 수 있습니다. Verdict 애플리케이션은 종종 여러 파이프라인 단계, 판단자, 변환으로 구성되어 있어 평가 워크플로우의 내부 작동 방식을 이해하는 것이 필수적입니다. Weave는 자동으로 다음에 대한 추적을 캡처하여 이 과정을 단순화합니다 Verdict 애플리케이션. 이를 통해 파이프라인의 성능을 모니터링하고 분석할 수 있어 AI 평가 워크플로우를 더 쉽게 디버깅하고 최적화할 수 있습니다.

시작하기

시작하려면 스크립트 시작 부분에서 간단히 weave.init(project=...) 를 호출하세요. project 인수를 사용하여 특정 W&B 팀 이름에 로깅하거나 team-name/project-name 또는 project-name 를 사용하여 기본 팀/엔티티에 로깅할 수 있습니다.

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

# Create a simple evaluation pipeline
pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Rate the quality of this text: {source.text}")

# Create sample data
data = Schema.of(text="This is a sample text for evaluation.")

# Run the pipeline - this will be automatically traced
output = pipeline.run(data)

print(output)

호출 메타데이터 추적

Verdict 파이프라인 호출에서 메타데이터를 추적하려면 weave.attributes 컨텍스트 관리자를 사용할 수 있습니다. 이 컨텍스트 관리자를 사용하면 파이프라인 실행이나 평가 배치와 같은 특정 코드 블록에 대한 사용자 지정 메타데이터를 설정할 수 있습니다.

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Evaluate sentiment: {source.text}")

data = Schema.of(text="I love this product!")

# highlight-next-line
with weave.attributes({"evaluation_type": "sentiment", "batch_id": "batch_001"}):
    output = pipeline.run(data)

print(output)

Weave는 Verdict 파이프라인 호출의 추적에 대해 메타데이터를 자동으로 추적합니다. Weave 웹 인터페이스에서 메타데이터를 볼 수 있습니다.

추적

AI 평가 파이프라인의 추적을 중앙 데이터베이스에 저장하는 것은 개발 및 프로덕션 단계에서 모두 중요합니다. 이러한 추적은 가치 있는 데이터셋을 제공하여 평가 워크플로우를 디버깅하고 개선하는 데 필수적입니다. Weave는 Verdict 애플리케이션에 대한 추적을 자동으로 캡처합니다. Verdict 라이브러리를 통해 이루어진 모든 호출을 추적하고 로깅하며, 다음을 포함합니다:

파이프라인 실행 단계
Judge 유닛 평가
레이어 변환
풀링 작업
사용자 정의 유닛 및 변환

Weave 웹 인터페이스에서 파이프라인 실행의 계층적 구조를 보여주는 추적을 볼 수 있습니다.

파이프라인 추적 예제

다음은 Weave가 중첩된 파이프라인 작업을 추적하는 방법을 보여주는 더 복잡한 예제입니다:

import weave
from verdict import Pipeline, Layer
from verdict.common.judge import JudgeUnit
from verdict.transform import MeanPoolUnit
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

# Create a complex pipeline with multiple steps
pipeline = Pipeline()
pipeline = pipeline >> Layer([
    JudgeUnit().prompt("Rate coherence: {source.text}"),
    JudgeUnit().prompt("Rate relevance: {source.text}"),
    JudgeUnit().prompt("Rate accuracy: {source.text}")
], 3)
pipeline = pipeline >> MeanPoolUnit()

# Sample data
data = Schema.of(text="This is a comprehensive evaluation of text quality across multiple dimensions.")

# Run the pipeline - all operations will be traced
result = pipeline.run(data)

print(f"Average score: {result}")

이는 다음을 보여주는 상세한 추적을 생성합니다:

주요 Pipeline 실행
Layer 내의 각 JudgeUnit 평가
MeanPoolUnit 집계 단계
각 작업에 대한 타이밍 정보

구성

호출 시 weave.init(), Verdict 파이프라인에 대한 추적이 자동으로 활성화됩니다. 이 통합은 Pipeline.__init__ 메서드를 패치하여 VerdictTracer 를 주입하여 모든 추적 데이터를 Weave로 전달합니다. 추가 구성이 필요하지 않습니다 - Weave는 자동으로 다음을 수행합니다:

모든 파이프라인 작업 캡처
실행 타이밍 추적
입력 및 출력 로깅
추적 계층 구조 유지
동시 파이프라인 실행 처리

사용자 정의 트레이서와 Weave

애플리케이션에서 사용자 정의 Verdict 트레이서를 사용하는 경우, Weave의 VerdictTracer 는 함께 작동할 수 있습니다:

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.util.tracing import ConsoleTracer
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

# You can still use Verdict's built-in tracers
console_tracer = ConsoleTracer()

# Create pipeline with both Weave (automatic) and Console tracing
pipeline = Pipeline(tracer=[console_tracer])  # Weave tracer is added automatically
pipeline = pipeline >> JudgeUnit().prompt("Evaluate: {source.text}")

data = Schema.of(text="Sample evaluation text")

# This will trace to both Weave and console
result = pipeline.run(data)

모델 및 평가

여러 파이프라인 구성 요소가 있는 AI 시스템을 구성하고 평가하는 것은 어려울 수 있습니다. weave.Model 를 사용하면 프롬프트, 파이프라인 구성, 평가 매개변수와 같은 실험 세부 정보를 캡처하고 구성하여 다양한 반복을 더 쉽게 비교할 수 있습니다. 다음 예제는 Verdict 파이프라인을 WeaveModel 로 래핑하는 방법을 보여줍니다:

import asyncio
import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Initialize Weave with your project name
# highlight-next-line
weave.init("verdict_demo")

# highlight-next-line
class TextQualityEvaluator(weave.Model):
    judge_prompt: str
    pipeline_name: str

# highlight-next-line
    @weave.op()
    async def predict(self, text: str) -> dict:
        pipeline = Pipeline(name=self.pipeline_name)
        pipeline = pipeline >> JudgeUnit().prompt(self.judge_prompt)
        
        data = Schema.of(text=text)
        result = pipeline.run(data)
        
        return {
            "text": text,
            "quality_score": result.score if hasattr(result, 'score') else result,
            "evaluation_prompt": self.judge_prompt
        }

model = TextQualityEvaluator(
    judge_prompt="Rate the quality of this text on a scale of 1-10: {source.text}",
    pipeline_name="text_quality_evaluator"
)

text = "This is a well-written and informative piece of content that provides clear value to readers."

prediction = asyncio.run(model.predict(text))

# if you're in a Jupyter Notebook, run:
# prediction = await model.predict(text)

print(prediction)

이 코드는 파이프라인 구조와 평가 결과를 모두 보여주는 Weave UI에서 시각화할 수 있는 모델을 생성합니다.

평가

평가는 평가 파이프라인 자체의 성능을 측정하는 데 도움이 됩니다. weave.Evaluation 클래스를 사용하면 Verdict 파이프라인이 특정 작업이나 데이터셋에서 얼마나 잘 수행되는지 캡처할 수 있습니다:

import asyncio
import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Initialize Weave
# highlight-next-line
weave.init("verdict_demo")

# Create evaluation model
class SentimentEvaluator(weave.Model):
    @weave.op()
    async def predict(self, text: str) -> dict:
        pipeline = Pipeline()
        pipeline = pipeline >> JudgeUnit().prompt(
            "Classify sentiment as positive, negative, or neutral: {source.text}"
        )
        
        data = Schema.of(text=text)
        result = pipeline.run(data)
        
        return {"sentiment": result}

# Test data
texts = [
    "I love this product, it's amazing!",
    "This is terrible, worst purchase ever.",
    "The weather is okay today."
]
labels = ["positive", "negative", "neutral"]

examples = [
    {"id": str(i), "text": texts[i], "target": labels[i]}
    for i in range(len(texts))
]

# Scoring function
@weave.op()
def sentiment_accuracy(target: str, output: dict) -> dict:
    predicted = output.get("sentiment", "").lower()
    return {"correct": target.lower() in predicted}

model = SentimentEvaluator()

evaluation = weave.Evaluation(
    dataset=examples,
    scorers=[sentiment_accuracy],
)

scores = asyncio.run(evaluation.evaluate(model))
# if you're in a Jupyter Notebook, run:
# scores = await evaluation.evaluate(model)

print(scores)

이는 Verdict 파이프라인이 다양한 테스트 케이스에서 어떻게 수행되는지 보여주는 평가 추적을 생성합니다.

모범 사례

성능 모니터링

Weave는 모든 파이프라인 작업에 대한 타이밍 정보를 자동으로 캡처합니다. 이를 사용하여 성능 병목 현상을 식별할 수 있습니다:

import weave
from verdict import Pipeline, Layer
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# highlight-next-line
weave.init("verdict_demo")

# Create a pipeline that might have performance variations
pipeline = Pipeline()
pipeline = pipeline >> Layer([
    JudgeUnit().prompt("Quick evaluation: {source.text}"),
    JudgeUnit().prompt("Detailed analysis: {source.text}"),  # This might be slower
], 2)

data = Schema.of(text="Sample text for performance testing")

# Run multiple times to see timing patterns
for i in range(3):
    with weave.attributes({"run_number": i}):
        result = pipeline.run(data)

오류 처리

Weave는 파이프라인 실행 중에 발생하는 예외를 자동으로 캡처합니다:

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# highlight-next-line
weave.init("verdict_demo")

pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Process: {source.invalid_field}")  # This will cause an error

data = Schema.of(text="Sample text")

try:
    result = pipeline.run(data)
except Exception as e:
    print(f"Pipeline failed: {e}")
    # Error details are captured in Weave trace

Weave를 Verdict와 통합함으로써 AI 평가 파이프라인에 대한 포괄적인 관찰 가능성을 얻어 평가 워크플로우를 더 쉽게 디버깅, 최적화 및 이해할 수 있습니다.

Introduction

Iteration

Evaluation

Productionization

Integrations

Enterprise

Tools & Resources

시작하기

호출 메타데이터 추적

추적

파이프라인 추적 예제

구성

사용자 정의 트레이서와 Weave

모델 및 평가

평가

모범 사례

성능 모니터링

오류 처리

Introduction

Iteration

Evaluation

Productionization

Integrations

Enterprise

Tools & Resources

​시작하기

​호출 메타데이터 추적

​추적

​파이프라인 추적 예제

​구성

​사용자 정의 트레이서와 Weave

​모델 및 평가

​평가

​모범 사례

​성능 모니터링

​오류 처리

시작하기

호출 메타데이터 추적

추적

파이프라인 추적 예제

구성

사용자 정의 트레이서와 Weave

모델 및 평가

평가

모범 사례

성능 모니터링

오류 처리