An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows

In this tutorial, we implement a complete workflow for building, tracing, and evaluating an LLM pipeline using Opik. We structure the system step-by-step, beginning with a lightweight model, adding prompt-based planning, creating a dataset, and finally running automated evaluations. As we move through each snippet, we see how Opik helps us track every function span, visualize the pipeline’s behavior, and measure output quality with clear, reproducible metrics. By the end, we have a fully instrumented QA system that we can extend, compare, and monitor with ease. Check out the FULL CODES here.

!pip install -q opik transformers accelerate torch


import torch
from transformers import pipeline
import textwrap


import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio


device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")


opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

We set up our environment by installing the required libraries and initializing Opik. We load the core modules, detect the device, and configure our project so that every trace flows into the correct workspace. We lay the foundation for the rest of the tutorial. Check out the FULL CODES here.

llm = pipeline(
   "text-generation",
   model="distilgpt2",
   device=device,
)


def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
   result = llm(
       prompt,
       max_new_tokens=max_new_tokens,
       do_sample=True,
       temperature=0.3,
       pad_token_id=llm.tokenizer.eos_token_id,
   )[0]["generated_text"]
   return result[len(prompt):].strip()

We load a lightweight Hugging Face model and create a small helper function to generate text cleanly. We prepare the LLM to operate locally without external APIs. This gives us a reliable and reproducible generation layer for the rest of the pipeline. Check out the FULL CODES here.

plan_prompt = Prompt(
   name="hf_plan_prompt",
   prompt=textwrap.dedent("""
       You are an assistant that creates a plan to answer a question
       using ONLY the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Return exactly 3 bullet points as a plan.
   """).strip(),
)


answer_prompt = Prompt(
   name="hf_answer_prompt",
   prompt=textwrap.dedent("""
       You answer based only on the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Plan:
       {{plan}}


       Answer the question in 2–4 concise sentences.
   """).strip(),
)

We define two structured prompts using Opik’s Prompt class. We control the planning phase and answering phase through clear templates. This helps us maintain consistency and observe how structured prompting impacts model behavior. Check out the FULL CODES here.

DOCS = {
   "overview": """
       Opik is an open-source platform for debugging, evaluating,
       and monitoring LLM and RAG applications. It provides tracing,
       datasets, experiments, and evaluation metrics.
   """,
   "tracing": """
       Tracing in Opik logs nested spans, LLM calls, token usage,
       feedback scores, and metadata to inspect complex LLM pipelines.
   """,
   "evaluation": """
       Opik evaluations are defined by datasets, evaluation tasks,
       scoring metrics, and experiments that aggregate scores,
       helping detect regressions or issues.
   """,
}


@track(project_name=PROJECT_NAME, type="tool", name="retrieve_context")
def retrieve_context(question: str) -> str:
   q = question.lower()
   if "trace" in q or "span" in q:
       return DOCS["tracing"]
   if "metric" in q or "dataset" in q or "evaluate" in q:
       return DOCS["evaluation"]
   return DOCS["overview"]

We construct a tiny document store and a retrieval function that Opik tracks as a tool. We let the pipeline select context based on the user’s question. This allows us to simulate a minimal RAG-style workflow without needing an actual vector database. Check out the FULL CODES here.

@track(project_name=PROJECT_NAME, type="llm", name="plan_answer")
def plan_answer(context: str, question: str) -> str:
   rendered = plan_prompt.format(context=context, question=question)
   return hf_generate(rendered, max_new_tokens=80)


@track(project_name=PROJECT_NAME, type="llm", name="answer_from_plan")
def answer_from_plan(context: str, question: str, plan: str) -> str:
   rendered = answer_prompt.format(
       context=context,
       question=question,
       plan=plan,
   )
   return hf_generate(rendered, max_new_tokens=120)


@track(project_name=PROJECT_NAME, type="general", name="qa_pipeline")
def qa_pipeline(question: str) -> str:
   context = retrieve_context(question)
   plan = plan_answer(context, question)
   answer = answer_from_plan(context, question, plan)
   return answer


print("Sample answer:\n", qa_pipeline("What does Opik help developers do?"))

We bring together planning, reasoning, and answering in a fully traced LLM pipeline. We capture each step with Opik’s decorators so we can analyze spans in the dashboard. By testing the pipeline, we confirm that all components integrate smoothly. Check out the FULL CODES here.

client = Opik()


dataset = client.get_or_create_dataset(
   name="HF_Opik_QA_Dataset",
   description="Small QA dataset for HF + Opik tutorial",
)


dataset.insert([
   {
       "question": "What kind of platform is Opik?",
       "context": DOCS["overview"],
       "reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.",
   },
   {
       "question": "What does tracing in Opik log?",
       "context": DOCS["tracing"],
       "reference": "Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.",
   },
   {
       "question": "What are the components of an Opik evaluation?",
       "context": DOCS["evaluation"],
       "reference": "An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.",
   },
])

We create and populate a dataset inside Opik that our evaluation will use. We insert multiple question–answer pairs that cover different aspects of Opik. This dataset will serve as the ground truth for our QA evaluation later. Check out the FULL CODES here.

equals_metric = Equals()
lev_metric = LevenshteinRatio()


def evaluation_task(item: dict) -> dict:
   output = qa_pipeline(item["question"])
   return {
       "output": output,
       "reference": item["reference"],
   }

We define the evaluation task and select two metrics—Equals and LevenshteinRatio—to measure model quality. We ensure the task produces outputs in the exact format required for scoring. This connects our pipeline to Opik’s evaluation engine. Check out the FULL CODES here.

evaluation_result = evaluate(
   dataset=dataset,
   task=evaluation_task,
   scoring_metrics=[equals_metric, lev_metric],
   experiment_name="HF_Opik_QA_Experiment",
   project_name=PROJECT_NAME,
   task_threads=1,
)


print("\nExperiment URL:", evaluation_result.experiment_url)

We run the evaluation experiment using Opik’s evaluate function. We keep the execution sequential for stability in Colab. Once complete, we receive a link to view the experiment details inside the Opik dashboard. Check out the FULL CODES here.

agg = evaluation_result.aggregate_evaluation_scores()


print("\nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.items():
   print(metric_name, "=>", stats)

We aggregate and print the evaluation scores to understand how well our pipeline performs. We inspect the metric results to see where outputs align with references and where improvements are needed. This closes the loop on our fully instrumented LLM workflow.

In conclusion, we set up a small but fully functional LLM evaluation ecosystem powered entirely by Opik and a local model. We observe how traces, prompts, datasets, and metrics come together to give us transparent visibility into the model’s reasoning process. As we finalize our evaluation and review the aggregated scores, we appreciate how Opik lets us iterate quickly, experiment systematically, and validate improvements in a structured and reliable way.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.