Objective overview with each tool listed.
TL;DR
- A curated list of open-source tools for LLM observability in 2025.
- Each entry includes installation, core features, and integration notes.
- Tools covered: Traceloop, Langfuse, Helicone, Lunary, Phoenix (Arize AI), TruLens, Portkey, PostHog, Keywords AI, Langsmith, Opik, and OpenLIT.
Why LLM Observability Matters
Observability for large language models enables you to:
- Trace individual token or prompt calls across microservices
- Monitor cost and latency by endpoint or model version
- Detect errors, timeouts, and anomalous behavior (e.g., hallucinations)
- Correlate embeddings, retrieval calls, and final outputs in RAG pipelines
An OpenTelemetry-compliant SDK for tracing and metrics in LLM applications.
- Installation:
pip install traceloop-sdk
- Configuration:
from traceloop.sdk import Traceloop
# Initialize with your app name; can disable batching to see traces immediately
Traceloop.init(app_name="your_app_name", disable_batch=True)
-
Features:
- Span-based telemetry compatible with Jaeger, Zipkin, and any OTLP receiver
- Configurable batch sending and sampling through
init
parameters - Built-in semantic tags for errors, retries, and truncated outputs
-
Integration: Works with LangChain, LlamaIndex, Haystack, and native OpenAI SDKs via automatic instrumentation
2. Langfuse
A modular observability and logging framework tailored to LLM chains.
- Installation:
pip install langfuse
- Configuration:
from langfuse import Langfuse
# Initialize with your API key and optional project name
Langfuse.init(api_key="YOUR_API_KEY", project="my_project")
-
Features:
- Structured event logging for prompts, completions, and chain steps
- Built-in integrations for vector stores: Pinecone, Weaviate, FAISS
- Web UI dashboards for chain execution flow and performance metrics
-
Integration: Use decorators (
@Langfuse.trace
) around functions or context managers (with Langfuse.trace()
)
3. Helicone
A proxy-based solution that captures model calls without SDK changes.
- Deployment:
docker run -d -p 8080:8080
-e HELICONE_API_KEY="YOUR_API_KEY"
helicone/proxy:latest
- Configuration: Point your LLM client to the proxy endpoint:
export OPENAI_API_BASE_URL="http://localhost:8080/v1"
-
Features:
- Transparent capture of all API calls via proxy
- Automated cost and latency reporting
- Scheduled email summaries of usage metrics
-
Integration: Place in front of any HTTP-based LLM endpoint; no code changes required
4. Lunary
An observability tool focused on retrieval-augmented generation (RAG).
- Installation:
pip install lunary
- Configuration:
from lunary import Client
client = Client(api_key="YOUR_API_KEY")
-
Features:
- Traces embedding queries and similarity scores
- Correlates retrieval latency with generation latency
- Interactive dashboards for query versus context alignment
-
Integration: Use
client.trace_rag()
context manager around RAG pipeline execution
A monitoring and anomaly-detection service for LLM metrics.
- Setup:
npm install @arize-ai/phoenix
- Configuration:
import { Phoenix } from "@arize-ai/phoenix";
const phoenix = new Phoenix({
apiKey: "YOUR_API_KEY",
organization: "YOUR_ORG_ID",
environment: "production"
});
-
Features:
- Automatic drift detection across model versions
- Alerting on latency and error rate thresholds
- A/B testing support for comparative analysis
-
Integration: Inject
phoenix.logInference()
calls around model invocation to log inference events
6. TruLens
A semantic-evaluation toolkit from Hugging Face.
- Installation:
pip install trulens-eval
- Configuration:
from trulens_eval import Tru
tru = Tru(model_name="your-model-name")
results = tru.run(["prompt1", "prompt2"], metric="coherence")
-
Features:
- Built-in evaluators for coherence, redundancy, toxicity
- Batch evaluation of historical outputs
- Support for custom metric extensions
-
Integration: Use
tru.run()
in evaluation pipelines or CI workflows to monitor output quality
7. Portkey
A CLI-driven profiler for prompt engineering workflows.
- Installation:
npm install -g portkey
- Configuration:
portkey init --api-key YOUR_API_KEY
-
Features:
- Auto-instruments OpenAI, Anthropic, and Hugging Face SDK calls
- Captures system metrics (CPU, memory) alongside token costs
- Local replay mode for comparative benchmarks
-
Usage: Run
portkey audit ./path-to-your-code
to generate a trace report
8. PostHog
A product-analytics platform with an LLM observability plugin.
- Installation:
npm install posthog-node @posthog/plugin-llm
- Configuration:
import PostHog from 'posthog-node';
const posthog = new PostHog('YOUR_PROJECT_API_KEY', { host: 'https://app.posthog.com' });
-
Features:
- Treats each LLM call as an analytics event
- Funnel and cohort analysis on prompt usage
- Alerting on custom error or latency conditions
-
Integration: Use
posthog.capture()
around your model calls to log events; plugin enriches events with LLM metadata
9. Keywords AI
An intent-tagging and alerting tool based on keyword rules.
- Installation:
pip install keywords-ai
- Configuration:
from keywords_ai import Client
client = Client(api_key="YOUR_API_KEY")
intents = client.analyze("Which model should I use for medical diagnosis?")
-
Features:
- Intent classification via configurable keyword lists
- Emits metrics when specified intents (e.g., “legal,” “medical”) occur
- Custom alerting hooks for regulatory workflows
-
Integration: Middleware pattern for any LLM request pipeline, call
client.analyze()
before or after completion
10. Langsmith
The official LangChain observability extension.
- Installation:
pip install langsmith
- Configuration:
from langsmith import Client, trace
client = Client(api_key="YOUR_API_KEY")
@trace(client)
def my_chain(...):
# chain logic here
pass
-
Features:
- Decorators for instrumenting sync/async functions
- Visual chain graphs in Jupyter and CLI reports
- Metadata tagging for run context and environment
-
Integration: Use
@trace(client)
decorator orwith trace(client):
context manager around LangChain executions
Lightweight community projects for minimal-overhead instrumentation.
-
Opik (JavaScript SDK, ~10 KB):
- Installation:
npm install @opik/sdk
- Configuration:
import { Opik } from "@opik/sdk"; const opik = new Opik({ apiKey: "YOUR_API_KEY" }); opik.track("prompt text", { model: "gpt-4", tokens: 120 });
-
OpenLIT (Python, <2 ms overhead):
- Installation:
pip install openlit
- Configuration:
from openlit import tracer tracer.configure(service_name="my_service") tracer.trace_llm("text-davinci-003", prompt="Hello world")
Conclusion & Next Steps
- Identify your primary observability needs (tracing, cost reporting, RAG metrics, semantic evaluation).
- Select one or more tools from this list based on compatibility and feature focus.
- Integrate and monitor within staging before rolling out to production.
- Compare metrics and adjust sampling rates or alert thresholds to balance overhead and insight.
FAQ
Q1: Which tool emits OpenTelemetry spans?
A1: Traceloop (OpenLLMetry) and OpenLIT both emit OTLP-compatible spans.Q2: How can I capture cost reports without code changes?
A2: Helicone operates as a proxy in front of your LLM endpoint and generates cost reports automatically.Q3: What’s the easiest way to trace RAG pipelines?
A3: Lunary captures embedding and retrieval metrics alongside generation latency in a single dashboard.Q4: Can I analyze LLM calls as product-analytics events?
A4: Yes—PostHog’s LLM plugin treats each API call as an event for funnel and cohort analysis.Q5: Are there lightweight front-end options for prompt observability?
A5: Opik’s JavaScript SDK (≈10 KB) can be embedded in web applications for real-time prompt tracking.