검색 증강 생성(Retrieval Augmented Generation, RAG)은 맞춤형 지식 베이스에 접근할 수 있는 생성형 AI 애플리케이션을 구축하는 일반적인 방법입니다. 이 예제에서는 문서를 가져오는 검색 단계가 있는 예시를 보여드리겠습니다. 이를 추적함으로써 앱을 디버깅하고 어떤 문서가 LLM 컨텍스트로 가져와졌는지 확인할 수 있습니다. 또한 LLM 평가자를 사용하여 이를 평가하는 방법도 보여드리겠습니다. Evals hero 다음을 확인해보세요 RAG++ 과정 엔지니어를 위한 실용적인 RAG 기술에 대한 더 심층적인 내용을 알아보세요. Weights & Biases, Cohere 및 Weaviate의 프로덕션 수준의 솔루션을 통해 성능을 최적화하고, 비용을 절감하며, 애플리케이션의 정확성과 관련성을 향상시키는 방법을 배울 수 있습니다.

1. 지식 베이스 구축하기

먼저, 우리의 기사에 대한 임베딩을 계산합니다. 일반적으로 이 작업은 기사와 함께 한 번 수행하고 임베딩과 메타데이터를 데이터베이스에 저장하지만, 여기서는 간단하게 스크립트를 실행할 때마다 이 작업을 수행합니다.
from openai import OpenAI
import weave
from weave import Model
import numpy as np
import json
import asyncio

articles = [
    "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
    "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
    "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
    "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
    "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
    "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
    "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the moon's surface while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
]

def docs_to_embeddings(docs: list) -> list:
    openai = OpenAI()
    document_embeddings = []
    for doc in docs:
        response = (
            openai.embeddings.create(input=doc, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        document_embeddings.append(response)
    return document_embeddings

article_embeddings = docs_to_embeddings(articles) # Note: you would typically do this once with your articles and put the embeddings & metadata in a database

2. RAG 앱 만들기

다음으로, 우리의 검색 함수를 get_most_relevant_document 데코레이터로 감싸고 weave.op() 클래스를 생성합니다. Model 를 호출하여 weave.init('rag-qa') 나중에 검사할 수 있도록 함수의 모든 입력과 출력을 추적하기 시작합니다.
from openai import OpenAI
import weave
from weave import Model
import numpy as np
import asyncio

# highlight-next-line
@weave.op()
def get_most_relevant_document(query):
    openai = OpenAI()
    query_embedding = (
        openai.embeddings.create(input=query, model="text-embedding-3-small")
        .data[0]
        .embedding
    )
    similarities = [
        np.dot(query_embedding, doc_emb)
        / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in article_embeddings
    ]
    # Get the index of the most similar document
    most_relevant_doc_index = np.argmax(similarities)
    return articles[most_relevant_doc_index]

# highlight-next-line
class RAGModel(Model):
    system_message: str
    model_name: str = "gpt-3.5-turbo-1106"

# highlight-next-line
    @weave.op()
    def predict(self, question: str) -> dict: # note: `question` will be used later to select data from our evaluation rows
        from openai import OpenAI
        context = get_most_relevant_document(question)
        client = OpenAI()
        query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
        Context:
        \"\"\"
        {context}
        \"\"\"
        Question: {question}"""
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            response_format={"type": "text"},
        )
        answer = response.choices[0].message.content
        return {'answer': answer, 'context': context}

# highlight-next-line
weave.init('rag-qa')
model = RAGModel(
    system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
)
model.predict("What significant result was reported about Zealand Pharma's obesity trial?")

3. LLM 평가자를 사용한 평가

애플리케이션을 평가하는 간단한 방법이 없을 때, 한 가지 접근법은 LLM을 사용하여 그 측면을 평가하는 것입니다. 여기 LLM 평가자를 사용하여 컨텍스트가 주어진 답변에 도달하는 데 유용했는지 확인하도록 프롬프트를 작성하여 컨텍스트 정밀도를 측정하려는 예시가 있습니다. 이 프롬프트는 인기 있는 RAGAS 프레임워크에서 보강되었습니다.

점수 매기기 함수 정의하기

우리가 평가 파이프라인 구축 튜토리얼에서 했던 것처럼, 우리 앱을 테스트할 예시 행들의 집합과 점수 매기기 함수를 정의할 것입니다. 우리의 점수 매기기 함수는 한 행을 가져와 평가합니다. 입력 인수는 행의 해당 키와 일치해야 하므로 questionquery 여기는 행 사전에서 가져옵니다. outputprediction은 모델의 출력입니다. 모델에 대한 입력은 입력 인수에 기반하여 예시에서 가져올 것이므로 questionquery 여기도 마찬가지입니다. 우리는 asyncasync 함수를 사용하고 있으므로 병렬로 빠르게 실행됩니다. async에 대한 간단한 소개가 필요하다면, 여기에서 찾을 수 있습니다.
from openai import OpenAI
import weave
import asyncio

# highlight-next-line
@weave.op()
async def context_precision_score(question, output):
    context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: {question}
    context: {context}
    answer: {answer}
    verdict: """
    client = OpenAI()

    prompt = context_precision_prompt.format(
        question=question,
        context=output['context'],
        answer=output['answer'],
    )

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    response_message = response.choices[0].message
    response = json.loads(response_message.content)
    return {
        "verdict": int(response["verdict"]) == 1,
    }

questions = [
    {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
    {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
    {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
    {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
    {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
    {"question": "Which company achieved the first U.S. moon landing since 1972?"},
    {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
]
# highlight-next-line
evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
# highlight-next-line
asyncio.run(evaluation.evaluate(model)) # note: you'll need to define a model to evaluate

Optional: Defining a ScorerScorer 클래스

일부 애플리케이션에서는 맞춤형 평가 클래스를 만들고 싶을 수 있습니다 - 예를 들어 표준화된 LLMJudgeScorer 클래스가 특정 매개변수(예: 채팅 모델, 프롬프트), 각 행의 특정 점수 매기기, 그리고 집계 점수의 특정 계산과 함께 생성되어야 하는 경우입니다. 이를 위해 Weave는 사용 준비가 된 ScorerScorer 클래스 목록을 정의하고 맞춤형 ScorerScorer를 쉽게 만들 수 있게 합니다 - 다음에서 맞춤형 class CorrectnessLLMJudge(Scorer)Scorer를 만드는 방법을 볼 것입니다. 높은 수준에서 맞춤형 Scorer를 만드는 단계는 꽤 간단합니다:
  1. Scorer에서 상속받는 맞춤형 클래스를 정의합니다 weave.flow.scorer.Scorer
  2. score 함수를 재정의하고 score@weave.op 데코레이터를 추가하세요 @weave.op()함수의 각 호출을 추적하고 싶다면
    • 이 함수는 outputprediction 인수를 정의해야 하며, 여기에 모델의 예측이 전달됩니다. 우리는 이를 타입 Optional[dict]Optional[str]로 정의합니다 모드가 “None”을 반환할 수 있는 경우를 대비해서.
    • 나머지 인수는 일반적인 Any**kwargs 또는 dict*args일 수 있거나, weave.EvaluateDataset 클래스를 사용하여 모델을 평가하는 데 사용되는 데이터셋에서 특정 열을 선택할 수 있습니다 - 이들은 preprocess_model_inputto_pandas()로 전달된 후 단일 행의 열 이름이나 키와 정확히 동일한 이름을 가져야 합니다.
  3. Optional:aggregate_score 함수를 재정의하여 summarize집계 점수 계산을 사용자 정의합니다. 기본적으로 Weave는 사용자 정의 함수를 정의하지 않으면 weave.flow.scorer.auto_summarizemean 함수를 사용합니다.
    • 이 함수는 @weave.op()@weave.op 데코레이터를 가져야 합니다.
from weave import Scorer

class CorrectnessLLMJudge(Scorer):
    prompt: str
    model_name: str
    device: str

    @weave.op()
    async def score(self, output: Optional[dict], query: str, answer: str) -> Any:
        """Score the correctness of the predictions by comparing the pred, query, target.
        Args:
            - output: the dict that will be provided by the model that is evaluated
            - query: the question asked - as defined in the dataset
            - answer: the target answer - as defined in the dataset
        Returns:
            - single dict {metric name: single evaluation value}"""

        # get_model is defined as general model getter based on provided params (OpenAI,HF...)
        eval_model = get_model(
            model_name = self.model_name,
            prompt = self.prompt
            device = self.device,
        )
        # async evaluation to speed up evaluation - this doesn't have to be async
        grade = await eval_model.async_predict(
            {
                "query": query,
                "answer": answer,
                "result": output.get("result"),
            }
        )
        # output parsing - could be done more reobustly with pydantic
        evaluation = "incorrect" not in grade["text"].strip().lower()

        # the column name displayed in Weave
        return {"correct": evaluation}

    @weave.op()
    def summarize(self, score_rows: list) -> Optional[dict]:
        """Aggregate all the scores that are calculated for each row by the scoring function.
        Args:
            - score_rows: a list of dicts. Each dict has metrics and scores
        Returns:
            - nested dict with the same structure as the input"""

        # if nothing is provided the weave.flow.scorer.auto_summarize function is used
        # return auto_summarize(score_rows)

        valid_data = [x.get("correct") for x in score_rows if x.get("correct") is not None]
        count_true = list(valid_data).count(True)
        int_data = [int(x) for x in valid_data]

        sample_mean = np.mean(int_data) if int_data else 0
        sample_variance = np.var(int_data) if int_data else 0
        sample_error = np.sqrt(sample_variance / len(int_data)) if int_data else 0

        # the extra "correct" layer is not necessary but adds structure in the UI
        return {
            "correct": {
                "true_count": count_true,
                "true_fraction": sample_mean,
                "stderr": sample_error,
            }
        }
이것을 점수 매기기로 사용하려면, 초기화하고 scorersscorers 인수에 전달하면 됩니다 다음과 같이 `Evaluation에:
evaluation = weave.Evaluation(dataset=questions, scorers=[CorrectnessLLMJudge()])

4. 모두 종합하기

RAG 앱에서 동일한 결과를 얻으려면:
  • LLM 호출 및 검색 단계 함수를 @weave.op로 감싸세요 weave.op()
  • (선택 사항) ModelApp predict하위 클래스를 생성하고 run 함수와 앱 세부 정보를 포함시키세요
  • 평가할 예시를 수집하세요
  • 하나의 예시에 점수를 매기는 점수 매기기 함수를 만드세요
  • Evaluation Evaluation클래스를 사용하여 예시에 대한 평가를 실행하세요
NOTE:때로는 Evaluations의 비동기 실행이 OpenAI, Anthropic 등의 모델에서 속도 제한을 트리거할 수 있습니다. 이를 방지하기 위해 병렬 작업자의 수를 제한하는 환경 변수를 설정할 수 있습니다. 예: WEAVE_PARALLELISM=3WEAVE_PARALLEL_WORKERS=4. 여기서는 코드 전체를 보여드립니다.
from openai import OpenAI
import weave
from weave import Model
import numpy as np
import json
import asyncio

# Examples we've gathered that we want to use for evaluations
articles = [
    "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
    "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
    "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if it's stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
    "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
    "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
    "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
    "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the surface of the moon while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
]

def docs_to_embeddings(docs: list) -> list:
    openai = OpenAI()
    document_embeddings = []
    for doc in docs:
        response = (
            openai.embeddings.create(input=doc, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        document_embeddings.append(response)
    return document_embeddings

article_embeddings = docs_to_embeddings(articles) # Note: you would typically do this once with your articles and put the embeddings & metadata in a database

# We've added a decorator to our retrieval step
# highlight-next-line
@weave.op()
def get_most_relevant_document(query):
    openai = OpenAI()
    query_embedding = (
        openai.embeddings.create(input=query, model="text-embedding-3-small")
        .data[0]
        .embedding
    )
    similarities = [
        np.dot(query_embedding, doc_emb)
        / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in article_embeddings
    ]
    # Get the index of the most similar document
    most_relevant_doc_index = np.argmax(similarities)
    return articles[most_relevant_doc_index]

# We create a Model subclass with some details about our app, along with a predict function that produces a response
# highlight-next-line
class RAGModel(Model):
    system_message: str
    model_name: str = "gpt-3.5-turbo-1106"

# highlight-next-line
    @weave.op()
# highlight-next-line
    def predict(self, question: str) -> dict: # note: `question` will be used later to select data from our evaluation rows
        from openai import OpenAI
        context = get_most_relevant_document(question)
        client = OpenAI()
        query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
        Context:
        \"\"\"
        {context}
        \"\"\"
        Question: {question}"""
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            response_format={"type": "text"},
        )
        answer = response.choices[0].message.content
        return {'answer': answer, 'context': context}

# highlight-next-line
weave.init('rag-qa')
# highlight-next-line
model = RAGModel(
    system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
)

# Here is our scoring function uses our question and output to product a score
# highlight-next-line
@weave.op()
# highlight-next-line
async def context_precision_score(question, output):
    context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: {question}
    context: {context}
    answer: {answer}
    verdict: """
    client = OpenAI()

    prompt = context_precision_prompt.format(
        question=question,
        context=output['context'],
        answer=output['answer'],
    )

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    response_message = response.choices[0].message
    response = json.loads(response_message.content)
    return {
        "verdict": int(response["verdict"]) == 1,
    }

questions = [
    {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
    {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
    {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
    {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
    {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
    {"question": "Which company achieved the first U.S. moon landing since 1972?"},
    {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
]

# We define an Evaluation object and pass our example questions along with scoring functions
# highlight-next-line
evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
# highlight-next-line
asyncio.run(evaluation.evaluate(model))

결론

우리는 이 예제의 검색 단계와 같은 애플리케이션의 다양한 단계에 관찰 가능성을 구축하는 방법을 배웠습니다. 또한 애플리케이션 응답의 자동 평가를 위한 LLM 평가자와 같은 더 복잡한 점수 매기기 함수를 구축하는 방법도 배웠습니다.