検索拡張生成(Retrieval Augmented Generation、RAG)は、カスタム知識ベースにアクセスできる生成AIアプリケーションを構築する一般的な方法です。 この例では、ドキュメントを取得するための検索ステップを含む例を示します。これを追跡することで、アプリをデバッグし、LLMコンテキストに取り込まれたドキュメントを確認できます。 また、LLMジャッジを使用して評価する方法も示します。 Evals hero 以下をチェックしてください RAG++コース エンジニア向けの実践的なRAGテクニックについてより高度に学ぶには、Weights & Biases、CohereおよびWeaviateによる本番環境対応のソリューションを学び、パフォーマンスを最適化し、コストを削減し、アプリケーションの精度と関連性を向上させることができます。

1. 知識ベースの構築

まず、記事の埋め込みを計算します。通常は記事の埋め込みを一度計算し、埋め込みとメタデータをデータベースに保存しますが、ここでは簡単にするためにスクリプトを実行するたびに計算しています。
from openai import OpenAI
import weave
from weave import Model
import numpy as np
import json
import asyncio

articles = [
    "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
    "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
    "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
    "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
    "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
    "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
    "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the moon's surface while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
]

def docs_to_embeddings(docs: list) -> list:
    openai = OpenAI()
    document_embeddings = []
    for doc in docs:
        response = (
            openai.embeddings.create(input=doc, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        document_embeddings.append(response)
    return document_embeddings

article_embeddings = docs_to_embeddings(articles) # Note: you would typically do this once with your articles and put the embeddings & metadata in a database

2. RAGアプリの作成

次に、検索関数をget_most_relevant_document デコレータでラップし、weave.op() クラスを作成します。Model を呼び出してweave.init('rag-qa') 関数のすべての入力と出力を追跡し、後で検査できるようにします。
from openai import OpenAI
import weave
from weave import Model
import numpy as np
import asyncio

# highlight-next-line
@weave.op()
def get_most_relevant_document(query):
    openai = OpenAI()
    query_embedding = (
        openai.embeddings.create(input=query, model="text-embedding-3-small")
        .data[0]
        .embedding
    )
    similarities = [
        np.dot(query_embedding, doc_emb)
        / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in article_embeddings
    ]
    # Get the index of the most similar document
    most_relevant_doc_index = np.argmax(similarities)
    return articles[most_relevant_doc_index]

# highlight-next-line
class RAGModel(Model):
    system_message: str
    model_name: str = "gpt-3.5-turbo-1106"

# highlight-next-line
    @weave.op()
    def predict(self, question: str) -> dict: # note: `question` will be used later to select data from our evaluation rows
        from openai import OpenAI
        context = get_most_relevant_document(question)
        client = OpenAI()
        query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
        Context:
        \"\"\"
        {context}
        \"\"\"
        Question: {question}"""
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            response_format={"type": "text"},
        )
        answer = response.choices[0].message.content
        return {'answer': answer, 'context': context}

# highlight-next-line
weave.init('rag-qa')
model = RAGModel(
    system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
)
model.predict("What significant result was reported about Zealand Pharma's obesity trial?")

3. LLMジャッジによる評価

アプリケーションを評価する簡単な方法がない場合、一つのアプローチはLLMを使用してその側面を評価することです。ここでは、LLMジャッジを使用してコンテキストの精度を測定しようとする例を示します。与えられた回答に到達する際にコンテキストが有用だったかどうかを確認するようにプロンプトを設定します。このプロンプトは人気のある RAGASフレームワークから拡張されています。

スコアリング関数の定義

前回の 評価パイプラインの構築チュートリアル と同様に、アプリをテストするための一連のサンプル行とスコアリング関数を定義します。スコアリング関数は1行を取り、評価します。入力引数は行の対応するキーと一致する必要があるため、question はここでは行辞書から取得されます。output はモデルの出力です。モデルへの入力は、その入力引数に基づいて例から取得されるため、question もここで同様です。私たちは async 関数を使用しているため、並列で高速に実行されます。asyncについての簡単な紹介が必要な場合は、こちらで見つけることができます。
from openai import OpenAI
import weave
import asyncio

# highlight-next-line
@weave.op()
async def context_precision_score(question, output):
    context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: {question}
    context: {context}
    answer: {answer}
    verdict: """
    client = OpenAI()

    prompt = context_precision_prompt.format(
        question=question,
        context=output['context'],
        answer=output['answer'],
    )

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    response_message = response.choices[0].message
    response = json.loads(response_message.content)
    return {
        "verdict": int(response["verdict"]) == 1,
    }

questions = [
    {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
    {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
    {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
    {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
    {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
    {"question": "Which company achieved the first U.S. moon landing since 1972?"},
    {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
]
# highlight-next-line
evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
# highlight-next-line
asyncio.run(evaluation.evaluate(model)) # note: you'll need to define a model to evaluate

Optional: Defining a Scorer クラス

一部のアプリケーションでは、カスタム評価クラスを作成したい場合があります - 例えば、特定のパラメータ(チャットモデル、プロンプトなど)を持つ標準化された LLMJudge クラスを作成し、各行の特定のスコアリングと集計スコアの特定の計算を行う場合です。そのためにWeaveは、すぐに使用できる Scorer クラスのリストを定義し、カスタム Scorer を簡単に作成できるようにしています - 以下では、カスタム class CorrectnessLLMJudge(Scorer) を作成する方法を見ていきます。 高レベルでは、カスタムScorerを作成するステップは非常に簡単です:
  1. から継承するカスタムクラスを定義する weave.flow.scorer.Scorer
  2. 関数をオーバーライドし、score を追加して@weave.op() 関数の各呼び出しを追跡したい場合
    • この関数は output 引数を定義する必要があり、モデルの予測がここに渡されます。モデルが「None」を返す可能性がある場合に備えて、タイプを Optional[dict] として定義します。
    • 残りの引数は一般的な Any または dict にするか、weave.Evaluate クラスを使用してモデルの評価に使用されるデータセットから特定の列を選択することができます - これらは preprocess_model_input が使用される場合、単一行の列名またはキーと完全に同じ名前である必要があります。
  3. Optional: 関数をオーバーライドしてsummarize 集計スコアの計算をカスタマイズします。デフォルトでは、カスタム関数を定義しない場合、Weaveは weave.flow.scorer.auto_summarize 関数を使用します。
    • この関数には @weave.op() デコレータが必要です。
from weave import Scorer

class CorrectnessLLMJudge(Scorer):
    prompt: str
    model_name: str
    device: str

    @weave.op()
    async def score(self, output: Optional[dict], query: str, answer: str) -> Any:
        """Score the correctness of the predictions by comparing the pred, query, target.
        Args:
            - output: the dict that will be provided by the model that is evaluated
            - query: the question asked - as defined in the dataset
            - answer: the target answer - as defined in the dataset
        Returns:
            - single dict {metric name: single evaluation value}"""

        # get_model is defined as general model getter based on provided params (OpenAI,HF...)
        eval_model = get_model(
            model_name = self.model_name,
            prompt = self.prompt
            device = self.device,
        )
        # async evaluation to speed up evaluation - this doesn't have to be async
        grade = await eval_model.async_predict(
            {
                "query": query,
                "answer": answer,
                "result": output.get("result"),
            }
        )
        # output parsing - could be done more reobustly with pydantic
        evaluation = "incorrect" not in grade["text"].strip().lower()

        # the column name displayed in Weave
        return {"correct": evaluation}

    @weave.op()
    def summarize(self, score_rows: list) -> Optional[dict]:
        """Aggregate all the scores that are calculated for each row by the scoring function.
        Args:
            - score_rows: a list of dicts. Each dict has metrics and scores
        Returns:
            - nested dict with the same structure as the input"""

        # if nothing is provided the weave.flow.scorer.auto_summarize function is used
        # return auto_summarize(score_rows)

        valid_data = [x.get("correct") for x in score_rows if x.get("correct") is not None]
        count_true = list(valid_data).count(True)
        int_data = [int(x) for x in valid_data]

        sample_mean = np.mean(int_data) if int_data else 0
        sample_variance = np.var(int_data) if int_data else 0
        sample_error = np.sqrt(sample_variance / len(int_data)) if int_data else 0

        # the extra "correct" layer is not necessary but adds structure in the UI
        return {
            "correct": {
                "true_count": count_true,
                "true_fraction": sample_mean,
                "stderr": sample_error,
            }
        }
これをスコアラーとして使用するには、初期化して scorers 引数としてEvaluationに渡します:
evaluation = weave.Evaluation(dataset=questions, scorers=[CorrectnessLLMJudge()])

4. すべてをまとめる

RAGアプリで同じ結果を得るには:
  • LLM呼び出しと検索ステップ関数を weave.op()
  • (オプション) Model サブクラスを predict 関数とアプリの詳細で作成する
  • 評価するための例を収集する
  • 1つの例をスコアリングするスコアリング関数を作成する
  • 使用する Evaluation クラスで例に対して評価を実行する
NOTE: 時にはEvaluationsの非同期実行がOpenAI、Anthropicなどのモデルのレート制限をトリガーすることがあります。これを防ぐために、並列ワーカーの数を制限する環境変数を設定できます。例えば WEAVE_PARALLELISM=3 ここでは、コード全体を示します。
from openai import OpenAI
import weave
from weave import Model
import numpy as np
import json
import asyncio

# Examples we've gathered that we want to use for evaluations
articles = [
    "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
    "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
    "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if it's stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
    "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
    "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
    "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
    "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the surface of the moon while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
]

def docs_to_embeddings(docs: list) -> list:
    openai = OpenAI()
    document_embeddings = []
    for doc in docs:
        response = (
            openai.embeddings.create(input=doc, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        document_embeddings.append(response)
    return document_embeddings

article_embeddings = docs_to_embeddings(articles) # Note: you would typically do this once with your articles and put the embeddings & metadata in a database

# We've added a decorator to our retrieval step
# highlight-next-line
@weave.op()
def get_most_relevant_document(query):
    openai = OpenAI()
    query_embedding = (
        openai.embeddings.create(input=query, model="text-embedding-3-small")
        .data[0]
        .embedding
    )
    similarities = [
        np.dot(query_embedding, doc_emb)
        / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in article_embeddings
    ]
    # Get the index of the most similar document
    most_relevant_doc_index = np.argmax(similarities)
    return articles[most_relevant_doc_index]

# We create a Model subclass with some details about our app, along with a predict function that produces a response
# highlight-next-line
class RAGModel(Model):
    system_message: str
    model_name: str = "gpt-3.5-turbo-1106"

# highlight-next-line
    @weave.op()
# highlight-next-line
    def predict(self, question: str) -> dict: # note: `question` will be used later to select data from our evaluation rows
        from openai import OpenAI
        context = get_most_relevant_document(question)
        client = OpenAI()
        query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
        Context:
        \"\"\"
        {context}
        \"\"\"
        Question: {question}"""
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            response_format={"type": "text"},
        )
        answer = response.choices[0].message.content
        return {'answer': answer, 'context': context}

# highlight-next-line
weave.init('rag-qa')
# highlight-next-line
model = RAGModel(
    system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
)

# Here is our scoring function uses our question and output to product a score
# highlight-next-line
@weave.op()
# highlight-next-line
async def context_precision_score(question, output):
    context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: {question}
    context: {context}
    answer: {answer}
    verdict: """
    client = OpenAI()

    prompt = context_precision_prompt.format(
        question=question,
        context=output['context'],
        answer=output['answer'],
    )

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    response_message = response.choices[0].message
    response = json.loads(response_message.content)
    return {
        "verdict": int(response["verdict"]) == 1,
    }

questions = [
    {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
    {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
    {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
    {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
    {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
    {"question": "Which company achieved the first U.S. moon landing since 1972?"},
    {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
]

# We define an Evaluation object and pass our example questions along with scoring functions
# highlight-next-line
evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
# highlight-next-line
asyncio.run(evaluation.evaluate(model))

結論

アプリケーションの異なるステップ(この例では検索ステップなど)に可観測性を組み込む方法を学びました。 また、LLMジャッジのようなより複雑なスコアリング関数を構築して、アプリケーションの応答を自動評価する方法も学びました。