Tech Wavo
  • Home
  • Technology
  • Computers
  • Gadgets
  • Mobile
  • Apps
  • News
  • Financial
  • Stock
Tech Wavo
No Result
View All Result

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows

Tech Wavo by Tech Wavo
November 21, 2025
in News
0


In this tutorial, we implement a complete workflow for building, tracing, and evaluating an LLM pipeline using Opik. We structure the system step-by-step, beginning with a lightweight model, adding prompt-based planning, creating a dataset, and finally running automated evaluations. As we move through each snippet, we see how Opik helps us track every function span, visualize the pipeline’s behavior, and measure output quality with clear, reproducible metrics. By the end, we have a fully instrumented QA system that we can extend, compare, and monitor with ease. Check out the FULL CODES here.

!pip install -q opik transformers accelerate torch


import torch
from transformers import pipeline
import textwrap


import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio


device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")


opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

We set up our environment by installing the required libraries and initializing Opik. We load the core modules, detect the device, and configure our project so that every trace flows into the correct workspace. We lay the foundation for the rest of the tutorial. Check out the FULL CODES here.

llm = pipeline(
   "text-generation",
   model="distilgpt2",
   device=device,
)


def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
   result = llm(
       prompt,
       max_new_tokens=max_new_tokens,
       do_sample=True,
       temperature=0.3,
       pad_token_id=llm.tokenizer.eos_token_id,
   )[0]["generated_text"]
   return result[len(prompt):].strip()

We load a lightweight Hugging Face model and create a small helper function to generate text cleanly. We prepare the LLM to operate locally without external APIs. This gives us a reliable and reproducible generation layer for the rest of the pipeline. Check out the FULL CODES here.

plan_prompt = Prompt(
   name="hf_plan_prompt",
   prompt=textwrap.dedent("""
       You are an assistant that creates a plan to answer a question
       using ONLY the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Return exactly 3 bullet points as a plan.
   """).strip(),
)


answer_prompt = Prompt(
   name="hf_answer_prompt",
   prompt=textwrap.dedent("""
       You answer based only on the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Plan:
       {{plan}}


       Answer the question in 2–4 concise sentences.
   """).strip(),
)

We define two structured prompts using Opik’s Prompt class. We control the planning phase and answering phase through clear templates. This helps us maintain consistency and observe how structured prompting impacts model behavior. Check out the FULL CODES here.

DOCS = {
   "overview": """
       Opik is an open-source platform for debugging, evaluating,
       and monitoring LLM and RAG applications. It provides tracing,
       datasets, experiments, and evaluation metrics.
   """,
   "tracing": """
       Tracing in Opik logs nested spans, LLM calls, token usage,
       feedback scores, and metadata to inspect complex LLM pipelines.
   """,
   "evaluation": """
       Opik evaluations are defined by datasets, evaluation tasks,
       scoring metrics, and experiments that aggregate scores,
       helping detect regressions or issues.
   """,
}


@track(project_name=PROJECT_NAME, type="tool", name="retrieve_context")
def retrieve_context(question: str) -> str:
   q = question.lower()
   if "trace" in q or "span" in q:
       return DOCS["tracing"]
   if "metric" in q or "dataset" in q or "evaluate" in q:
       return DOCS["evaluation"]
   return DOCS["overview"]

We construct a tiny document store and a retrieval function that Opik tracks as a tool. We let the pipeline select context based on the user’s question. This allows us to simulate a minimal RAG-style workflow without needing an actual vector database. Check out the FULL CODES here.

@track(project_name=PROJECT_NAME, type="llm", name="plan_answer")
def plan_answer(context: str, question: str) -> str:
   rendered = plan_prompt.format(context=context, question=question)
   return hf_generate(rendered, max_new_tokens=80)


@track(project_name=PROJECT_NAME, type="llm", name="answer_from_plan")
def answer_from_plan(context: str, question: str, plan: str) -> str:
   rendered = answer_prompt.format(
       context=context,
       question=question,
       plan=plan,
   )
   return hf_generate(rendered, max_new_tokens=120)


@track(project_name=PROJECT_NAME, type="general", name="qa_pipeline")
def qa_pipeline(question: str) -> str:
   context = retrieve_context(question)
   plan = plan_answer(context, question)
   answer = answer_from_plan(context, question, plan)
   return answer


print("Sample answer:\n", qa_pipeline("What does Opik help developers do?"))

We bring together planning, reasoning, and answering in a fully traced LLM pipeline. We capture each step with Opik’s decorators so we can analyze spans in the dashboard. By testing the pipeline, we confirm that all components integrate smoothly. Check out the FULL CODES here.

client = Opik()


dataset = client.get_or_create_dataset(
   name="HF_Opik_QA_Dataset",
   description="Small QA dataset for HF + Opik tutorial",
)


dataset.insert([
   {
       "question": "What kind of platform is Opik?",
       "context": DOCS["overview"],
       "reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.",
   },
   {
       "question": "What does tracing in Opik log?",
       "context": DOCS["tracing"],
       "reference": "Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.",
   },
   {
       "question": "What are the components of an Opik evaluation?",
       "context": DOCS["evaluation"],
       "reference": "An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.",
   },
])

We create and populate a dataset inside Opik that our evaluation will use. We insert multiple question–answer pairs that cover different aspects of Opik. This dataset will serve as the ground truth for our QA evaluation later. Check out the FULL CODES here.

equals_metric = Equals()
lev_metric = LevenshteinRatio()


def evaluation_task(item: dict) -> dict:
   output = qa_pipeline(item["question"])
   return {
       "output": output,
       "reference": item["reference"],
   }

We define the evaluation task and select two metrics—Equals and LevenshteinRatio—to measure model quality. We ensure the task produces outputs in the exact format required for scoring. This connects our pipeline to Opik’s evaluation engine. Check out the FULL CODES here.

evaluation_result = evaluate(
   dataset=dataset,
   task=evaluation_task,
   scoring_metrics=[equals_metric, lev_metric],
   experiment_name="HF_Opik_QA_Experiment",
   project_name=PROJECT_NAME,
   task_threads=1,
)


print("\nExperiment URL:", evaluation_result.experiment_url)

We run the evaluation experiment using Opik’s evaluate function. We keep the execution sequential for stability in Colab. Once complete, we receive a link to view the experiment details inside the Opik dashboard. Check out the FULL CODES here.

agg = evaluation_result.aggregate_evaluation_scores()


print("\nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.items():
   print(metric_name, "=>", stats)

We aggregate and print the evaluation scores to understand how well our pipeline performs. We inspect the metric results to see where outputs align with references and where improvements are needed. This closes the loop on our fully instrumented LLM workflow.

In conclusion, we set up a small but fully functional LLM evaluation ecosystem powered entirely by Opik and a local model. We observe how traces, prompts, datasets, and metrics come together to give us transparent visibility into the model’s reasoning process. As we finalize our evaluation and review the aggregated scores, we appreciate how Opik lets us iterate quickly, experiment systematically, and validate improvements in a structured and reliable way.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.
Previous Post

The best Shark vacuum just crashed to a record-low price ahead of Black Friday

Next Post

Free App Enables AirPods Features On Android And Linux

Next Post
Apple To Delay New AirPods Max Release And Focus On The AirPods Pro 3

Free App Enables AirPods Features On Android And Linux

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Free App Enables AirPods Features On Android And Linux

by Tech Wavo
November 21, 2025
0
Apple To Delay New AirPods Max Release And Focus On The AirPods Pro 3
Gadgets

Apple’s AirPods, traditionally known for offering full functionality only within the Apple ecosystem, can now provide many of their best...

Read more

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows

by Tech Wavo
November 21, 2025
0
An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows
News

In this tutorial, we implement a complete workflow for building, tracing, and evaluating an LLM pipeline using Opik. We structure...

Read more

The best Shark vacuum just crashed to a record-low price ahead of Black Friday

by Tech Wavo
November 21, 2025
0
The best Shark vacuum just crashed to a record-low price ahead of Black Friday
Computers

If you're hunting for a vacuum cleaner deal in the early Black Friday sales, you're in luck – there's an...

Read more

Get up to 52 percent off portable power stations

by Tech Wavo
November 21, 2025
0
EcoFlow’s early Black Friday deals include 42 percent off portable power stations
Computers

The EcoFlow Black Friday sale is in full swing, knocking thousands of dollars off portable power stations and their accessories....

Read more

Site links

  • Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of use
  • Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of use

No Result
View All Result
  • Home
  • Technology
  • Computers
  • Gadgets
  • Mobile
  • Apps
  • News
  • Financial
  • Stock