Claude Prompt for RAG Pipelines
Implement RAG-Fusion to improve retrieval recall for API reference docs using jina-embeddings-v3 + multi-vector (per chunk).
More prompts for RAG Pipelines.
Implement query decomposition to improve retrieval recall for support tickets using jina-embeddings-v3 + multi-vector (per chunk).
Production RAG recipe: recursive character chunking, mxbai-embed-large embeddings, Redis Vector storage, Voyage rerank-2 reranking. Includes retrieval evals.
Production RAG recipe: semantic (embedding-based) chunking, stella_en_1.5B_v5 embeddings, Chroma storage, mxbai-rerank-large reranking. Includes retrieval evals.
Hybrid BM25 + dense retrieval architecture with Cohere Rerank 3.5 cross-encoder reranking, tuned for customer interview transcripts.
Production RAG recipe: token-based sliding window chunking, stella_en_1.5B_v5 embeddings, Weaviate storage, mxbai-rerank-large reranking. Includes retrieval evals.
Production RAG recipe: token-based sliding window chunking, cohere-embed-multilingual-v3 embeddings, pgvector storage, Cohere Rerank 3.5 reranking. Includes retrieval evals.
You are a retrieval quality specialist. The user's raw query often fails to retrieve the right chunks because it uses different vocabulary, is under-specified, or is multi-hop. Implement RAG-Fusion as a pre-retrieval step for a RAG system over API reference docs.
## Problem Statement
Our baseline retriever is multi-vector (per chunk) with jina-embeddings-v3 over Chroma. On the golden eval set of 1000 queries, current hit@10 is 0.62. The dominant failure modes are:
- Vocabulary mismatch (user says "get charged" but docs say "billed")
- Multi-hop questions (user asks about X which requires facts A and B)
- Under-specification (user pronouns or ellipsis: "why does it do that?")
- Acronym ambiguity
- Questions phrased negatively ("what are the exceptions to...")
## Transformation: RAG-Fusion
Explain the technique, why it helps, and its failure modes.
### Prompt Template
Produce the exact LLM prompt to implement RAG-Fusion. The prompt must:
- Use Ragas faithfulness judge or a smaller model for cost (specify which and why)
- Take raw_query + optional conversation_history as input
- Output a strictly-formatted result (JSON with transformed_queries: string[])
- Include 3 few-shot examples drawn from the API reference docs domain
- Handle the case where no transformation is needed (pass through)
### JSON Output Schema
```json
{
"strategy_used": "hyde | multi-query | step-back | decompose",
"transformed_queries": ["...", "..."],
"filter_hints": { "date_after": "...", "doc_type": "..." },
"confidence": 0.0
}
```
## Fusion Strategy
After running retrieval with each transformed query, you have multiple candidate sets. Combine them via:
- **Reciprocal Rank Fusion** with k=60 (default, robust)
- OR **weighted score fusion** if you have calibrated scores
- Deduplicate by chunk_id; keep highest-rank occurrence
- Truncate to final top-15
## Evaluation
On the golden set, measure:
- hit@1, hit@5, hit@10 lift vs baseline
- Latency overhead (extra LLM call + parallel retrievals)
- Cost overhead per query
- Failure-mode coverage: how many of the 5 listed failures did we fix?
Run pairwise: for each query, A = baseline retrieval, B = with RAG-Fusion. Use Ragas faithfulness judge to score which result set is more relevant. Target: B wins ≥ 65% of the time AND baseline never wins more than 15% (no regressions).
## Caching
RAG-Fusion adds an LLM call. Cache aggressively:
- Cache key: normalized raw query (lowercase, trim, sort tokens)
- TTL: 24h for stable corpora, 1h for changing corpora
- Store in Redis with memory limit + LRU eviction
## Fallback & Safety
- LLM call times out > 800ms → fall back to raw query
- LLM returns malformed JSON → retry once with repair prompt, then fall back
- Transformed query count > 6 → cap at 6 (latency/cost)
- Log every transformation with trace_id for debugging
## Code Scaffold
Produce `query_transform.py` (or .ts) with:
1. Function signature: `async def transform_query(raw: str, history: list[str] | None) -> TransformResult`
2. Input validation
3. LLM call with timeout + retry
4. JSON parsing with Pydantic validation
5. Metrics emission (latency, cache hit/miss, fallback rate)
6. Unit tests covering: cache hit, cache miss, malformed JSON, timeout, empty history
## Monitoring
Emit traces to Galileo:
- span: query_transform.call
- attributes: strategy_used, num_transformed, cache_hit, latency_ms, model
- linked to parent retrieval span
## Rollout Plan
1. Ship behind feature flag at 1% traffic
2. Compare online metrics (click-through on top result, user satisfaction thumbs)
3. Ramp to 10%, 50%, 100% over 2 weeks if metrics hold
4. Kill switch: env var QUERY_TRANSFORM_ENABLED=false
Structure as a playbook with: Overview, Prerequisites, Step-by-step Plays, Metrics to Track, and Troubleshooting Guide.Replace the bracketed placeholders with your own context before running the prompt:
["...", "..."]— fill in your specific "...", "...".[str]— fill in your specific str.