Claude Prompt for Evals & Observability
Instrument, query, and triage structured extraction LLM app traces in Langfuse with Python SDK, covering latency, cost, and quality dashboards.
More prompts for Evals & Observability.
Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.
Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.
You are the observability lead for an LLM-powered product. Build the complete tracing + analysis stack in Langfuse that lets anyone on the team answer "why was this request slow/expensive/wrong?" in under 60 seconds.
## Trace Model
Every user interaction is a trace. Within a trace:
- **Spans:** discrete steps (LLM call, retrieval, tool call, DB query)
- **Events:** instant occurrences (cache hit, retry, error)
- **Attributes:** key-value metadata on spans
### Span Naming Convention
Use `noun.verb` like:
- `llm.completion` — a single LLM API call
- `retrieval.search` — a vector search call
- `rerank.score` — rerank API call
- `tool.execute` — agent tool invocation
- `prompt.render` — template filling
- `validator.check` — schema validation
### Required Attributes (every LLM span)
- `model` — exact model id, e.g., "claude-sonnet-4-5-20251001"
- `prompt.template_name` + `prompt.template_version`
- `prompt.hash` — content hash of the exact prompt sent
- `input.tokens`, `output.tokens`, `cache.read_tokens`, `cache.creation_tokens`
- `cost.usd` (computed from tokens × model price)
- `latency.ms`
- `user.id` (hashed, not raw PII)
- `session.id`, `request.id`
- `error.type` (if any), `error.message`
- `feedback.thumbs` (if user rated later, backfill)
- `quality.judge_score` (if offline-scored)
### PII Handling
- Do NOT log raw prompts/completions by default in production traces
- Instead log hashes + sampled raw (e.g., 1% of traffic → raw, 99% → hash only)
- Redact emails, phones, SSNs with a regex pre-processor
- User-consented debug mode can enable raw logging per session
## Instrumentation
### Python (Langfuse)
```python
from helicone-logger import trace, observe
@observe(name="llm.completion")
async def complete(prompt: str, model: str):
span = trace.current_span()
span.set_attribute("model", model)
span.set_attribute("prompt.hash", sha256(prompt))
response = await client.messages.create(...)
span.set_attribute("input.tokens", response.usage.input_tokens)
span.set_attribute("output.tokens", response.usage.output_tokens)
span.set_attribute("cost.usd", compute_cost(...))
return response
```
### TypeScript (Langfuse)
Equivalent decorator / wrapper. Share context via async local storage.
## Dashboards
### Dashboard 1: Latency
- Distribution of end-to-end trace latency (p50, p95, p99)
- Breakdown by span type (which step is slow?)
- Top-10 slowest traces in last 24h (click to drill in)
- Correlation: latency vs input_tokens, latency vs model, latency vs cache_hit
### Dashboard 2: Cost
- Cost per user cohort per day
- Cost per feature / endpoint
- Cost breakdown by model (are we paying premium for tasks that could use cheaper tier?)
- Cache savings: total tokens served from cache vs fresh
- Anomaly detection: sudden spike flagged
### Dashboard 3: Quality
- Judge score rolling average
- Pass rate on regression suite (per release)
- User thumbs (thumbs_up / total) over time
- Refusal rate (target range alert if drifts)
- Distribution of confidence scores
### Dashboard 4: Errors
- Error rate per endpoint
- Top error messages (grouped)
- Rate-limit events
- Repair/retry rate (structured output)
- Tool-call validation failures
## Common Triage Queries
### "Why is this request slow?"
Open the trace. Look at the Gantt view. Identify the longest span. Check if it's:
- LLM call → check input tokens, model, provider status
- Retrieval → check index size, filter selectivity
- Tool call → check downstream service latency
- Sequential where it could be parallel → opportunity to fix
### "Why is cost up 30% this week?"
Query: cost by model × day. Often one of:
- Traffic growth (check request count)
- Prompt length creep (check avg input_tokens over time)
- Cache hit rate dropped (check cache.hit ratio)
- Tier mix shift (more traffic to expensive model)
- New feature launched with expensive prompts (filter by endpoint × release_tag)
### "Why did quality regress?"
Query: judge_score by prompt_template_version. Bisect to the change that introduced the drop. Pull 10 sample regressions for manual review.
## Alerts
Configure in Langfuse:
- `latency_p95 > 3s` for 10 min → page on-call
- `error_rate > 2%` for 5 min → page on-call
- `cost_per_day > $X` → slack finance channel
- `judge_score_7d_avg drops > 5%` → slack team
- `repair_rate > 3%` → slack team
## Sampling Strategy
- Head-based sampling: 100% traces with errors, 10% of normal traffic
- Keep raw prompts for 1% of traffic + all errors + all low-quality-scored
- Retention: 30 days hot, 180 days cold archive
## Golden-Set Replay
Automate weekly replay of the golden set:
- Trigger: Sunday 2am UTC
- Run each golden example through production code path
- Auto-score with judge
- Post summary to slack: score delta vs last week, top 5 regressions
## Deliverables
1. Instrumentation library wrappers for each LLM provider
2. Langfuse dashboards (exportable JSON)
3. Alert rules
4. Triage playbook with queries pinned in Langfuse
5. Weekly report automation
6. Onboarding doc: "how to debug an LLM request in Langfuse"
Structure as a playbook with: Overview, Prerequisites, Step-by-step Plays, Metrics to Track, and Troubleshooting Guide.