Prompts/AI Engineering & LLM Apps/Evals & Observability

FreeAI Engineering & LLM Apps🟠 Claude

Trace Analysis Playbook for structured extraction LLM App in Langfuse (Python)

Claude Prompt for Evals & Observability

Instrument, query, and triage structured extraction LLM app traces in Langfuse with Python SDK, covering latency, cost, and quality dashboards.

Related prompts

More prompts for Evals & Observability.

Browse all AI Engineering & LLM Apps →

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.

🟠Claude

1911517

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.

🤖Any Model

521515

AI Engineering & LLM Apps

Premium

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.

🤖Any Model

1841514

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.

💬ChatGPT

1901513

AI Engineering & LLM Apps

Premium

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.

🤖Any Model

3751512

AI Engineering & LLM Apps

Premium

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.

💬ChatGPT

3091508

You are the observability lead for an LLM-powered product. Build the complete tracing + analysis stack in Langfuse that lets anyone on the team answer "why was this request slow/expensive/wrong?" in under 60 seconds. ## Trace Model Every user interaction is a trace. Within a trace: - **Spans:** discrete steps (LLM call, retrieval, tool call, DB query) - **Events:** instant occurrences (cache hit, retry, error) - **Attributes:** key-value metadata on spans ### Span Naming Convention Use `noun.verb` like: - `llm.completion` — a single LLM API call - `retrieval.search` — a vector search call - `rerank.score` — rerank API call - `tool.execute` — agent tool invocation - `prompt.render` — template filling - `validator.check` — schema validation ### Required Attributes (every LLM span) - `model` — exact model id, e.g., "claude-sonnet-4-5-20251001" - `prompt.template_name` + `prompt.template_version` - `prompt.hash` — content hash of the exact prompt sent - `input.tokens`, `output.tokens`, `cache.read_tokens`, `cache.creation_tokens` - `cost.usd` (computed from tokens × model price) - `latency.ms` - `user.id` (hashed, not raw PII) - `session.id`, `request.id` - `error.type` (if any), `error.message` - `feedback.thumbs` (if user rated later, backfill) - `quality.judge_score` (if offline-scored) ### PII Handling - Do NOT log raw prompts/completions by default in production traces - Instead log hashes + sampled raw (e.g., 1% of traffic → raw, 99% → hash only) - Redact emails, phones, SSNs with a regex pre-processor - User-consented debug mode can enable raw logging per session ## Instrumentation ### Python (Langfuse) ```python from helicone-logger import trace, observe @observe(name="llm.completion") async def complete(prompt: str, model: str): span = trace.current_span() span.set_attribute("model", model) span.set_attribute("prompt.hash", sha256(prompt)) response = await client.messages.create(...) span.set_attribute("input.tokens", response.usage.input_tokens) span.set_attribute("output.tokens", response.usage.output_tokens) span.set_attribute("cost.usd", compute_cost(...)) return response ``` ### TypeScript (Langfuse) Equivalent decorator / wrapper. Share context via async local storage. ## Dashboards ### Dashboard 1: Latency - Distribution of end-to-end trace latency (p50, p95, p99) - Breakdown by span type (which step is slow?) - Top-10 slowest traces in last 24h (click to drill in) - Correlation: latency vs input_tokens, latency vs model, latency vs cache_hit ### Dashboard 2: Cost - Cost per user cohort per day - Cost per feature / endpoint - Cost breakdown by model (are we paying premium for tasks that could use cheaper tier?) - Cache savings: total tokens served from cache vs fresh - Anomaly detection: sudden spike flagged ### Dashboard 3: Quality - Judge score rolling average - Pass rate on regression suite (per release) - User thumbs (thumbs_up / total) over time - Refusal rate (target range alert if drifts) - Distribution of confidence scores ### Dashboard 4: Errors - Error rate per endpoint - Top error messages (grouped) - Rate-limit events - Repair/retry rate (structured output) - Tool-call validation failures ## Common Triage Queries ### "Why is this request slow?" Open the trace. Look at the Gantt view. Identify the longest span. Check if it's: - LLM call → check input tokens, model, provider status - Retrieval → check index size, filter selectivity - Tool call → check downstream service latency - Sequential where it could be parallel → opportunity to fix ### "Why is cost up 30% this week?" Query: cost by model × day. Often one of: - Traffic growth (check request count) - Prompt length creep (check avg input_tokens over time) - Cache hit rate dropped (check cache.hit ratio) - Tier mix shift (more traffic to expensive model) - New feature launched with expensive prompts (filter by endpoint × release_tag) ### "Why did quality regress?" Query: judge_score by prompt_template_version. Bisect to the change that introduced the drop. Pull 10 sample regressions for manual review. ## Alerts Configure in Langfuse: - `latency_p95 > 3s` for 10 min → page on-call - `error_rate > 2%` for 5 min → page on-call - `cost_per_day > $X` → slack finance channel - `judge_score_7d_avg drops > 5%` → slack team - `repair_rate > 3%` → slack team ## Sampling Strategy - Head-based sampling: 100% traces with errors, 10% of normal traffic - Keep raw prompts for 1% of traffic + all errors + all low-quality-scored - Retention: 30 days hot, 180 days cold archive ## Golden-Set Replay Automate weekly replay of the golden set: - Trigger: Sunday 2am UTC - Run each golden example through production code path - Auto-score with judge - Post summary to slack: score delta vs last week, top 5 regressions ## Deliverables 1. Instrumentation library wrappers for each LLM provider 2. Langfuse dashboards (exportable JSON) 3. Alert rules 4. Triage playbook with queries pinned in Langfuse 5. Weekly report automation 6. Onboarding doc: "how to debug an LLM request in Langfuse" Structure as a playbook with: Overview, Prerequisites, Step-by-step Plays, Metrics to Track, and Troubleshooting Guide.

Trace Analysis Playbook for structured extraction LLM App in Langfuse (Python)

Related prompts

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

Trace Analysis Playbook for structured extraction LLM App in Langfuse (Python)

Related prompts

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

Tags

Who this is for