ChatGPT Prompt for RAG Pipelines
Production RAG recipe: token-based sliding window chunking, arctic-embed-l embeddings, pgvector storage, Cohere Rerank 3.5 reranking. Includes retrieval evals.
More prompts for RAG Pipelines.
Implement query decomposition to improve retrieval recall for support tickets using jina-embeddings-v3 + multi-vector (per chunk).
Production RAG recipe: recursive character chunking, mxbai-embed-large embeddings, Redis Vector storage, Voyage rerank-2 reranking. Includes retrieval evals.
Production RAG recipe: semantic (embedding-based) chunking, stella_en_1.5B_v5 embeddings, Chroma storage, mxbai-rerank-large reranking. Includes retrieval evals.
Hybrid BM25 + dense retrieval architecture with Cohere Rerank 3.5 cross-encoder reranking, tuned for customer interview transcripts.
Production RAG recipe: token-based sliding window chunking, stella_en_1.5B_v5 embeddings, Weaviate storage, mxbai-rerank-large reranking. Includes retrieval evals.
Production RAG recipe: token-based sliding window chunking, cohere-embed-multilingual-v3 embeddings, pgvector storage, Cohere Rerank 3.5 reranking. Includes retrieval evals.
You are a senior AI engineer designing a production RAG pipeline. Produce a complete, implementation-ready spec that a mid-level engineer can ship in one sprint.
## System Context
- **Document corpus:** earnings call transcripts (~100k documents)
- **Query volume:** 2 queries/second peak, 1M daily
- **Latency SLO:** end-to-end p95 under 1.2s
- **Language coverage:** 40+ languages
- **Freshness requirement:** minutes
## Pipeline Architecture
### 1. Ingestion & Parsing
- Parser choice for earnings call transcripts (Unstructured, LlamaParse, pdfplumber, custom): justify tradeoff
- How to preserve tables, code blocks, and hierarchical structure
- OCR fallback strategy for scanned pages (Tesseract vs GPT-4o vision vs Textract)
- Metadata to extract per document: source_url, created_at, author, section_path, doc_type, access_level
- Deduplication: content hash + near-duplicate detection via MinHash/SimHash
### 2. Chunking: token-based sliding window
- Target chunk size: 256 tokens with 0 token overlap
- Boundary rules (where you're allowed and NOT allowed to split)
- How to handle chunks that are too small (merge) and too large (recursive split)
- Special handling for tables, code blocks, lists (do NOT split these)
- Contextual header: prepend document title + section breadcrumb to each chunk
- If chunking strategy is 'contextual retrieval', include the Claude Haiku prompt to generate the 50-100 token contextual prefix per chunk
### 3. Embedding: arctic-embed-l
- Batch size, rate-limit handling, retry policy (exponential backoff + jitter)
- Cost estimate at corpus size (model price × tokens)
- Embedding dimensionality and storage implications
- When to re-embed (model upgrade, chunking change) and migration path
- Normalization: L2-normalize if using dot product; skip if cosine
### 4. Storage: pgvector
- Index configuration (HNSW M, efConstruction, efSearch OR IVF nlist, nprobe)
- Payload schema (which metadata is filterable, which is projected)
- Sharding / namespace strategy for multi-tenancy
- Backup and point-in-time recovery
- Estimated storage cost per 1M chunks
### 5. Retrieval: dense (dot product)
- Top-k at retrieval: 100
- Metadata filters (tenant_id, access_level, date range)
- Hybrid search weights and Reciprocal Rank Fusion k=60
- Query preprocessing: lowercase, stopword handling, acronym expansion
### 6. Reranking: Cohere Rerank 3.5
- Rerank top-100 down to top-5
- Latency budget for rerank step
- When to skip rerank (very high-confidence top result, latency pressure)
- Fallback if reranker API fails
### 7. Answer Synthesis
- Model: Claude Sonnet 4.5 or GPT-4.1 (justify)
- System prompt template with citation requirements
- Output format: answer + citations array [{chunk_id, quote, doc_url}]
- Refusal rules: when retrieval returns low-relevance chunks, refuse rather than hallucinate
## Prompt Template
```
You are a helpful assistant answering questions about earnings call transcripts.
RULES:
1. Use ONLY the context below. If the answer is not present, say "I don't have enough information."
2. Cite every factual claim with [doc_id] markers.
3. Quote short spans verbatim for critical facts.
4. If context conflicts, note the conflict.
CONTEXT:
{retrieved_chunks_with_ids}
QUESTION: {user_question}
ANSWER:
```
## Evaluation Plan
Build a golden set of 500 question+expected-sources pairs from real user queries. Score:
- **Retrieval hit@k** (100, 5, 1): did the correct chunk appear?
- **Context precision** (Ragas): are retrieved chunks relevant?
- **Context recall**: did we retrieve everything needed?
- **Faithfulness**: does the answer stay grounded in context?
- **Answer relevance**: does the answer address the question?
- **Citation accuracy**: do cited chunks actually support the claim?
Target thresholds: hit@10 ≥ 0.90, faithfulness ≥ 0.95, citation accuracy ≥ 0.98.
## Edge Cases to Handle
- Very long documents (> 1M tokens): hierarchical summarization or map-reduce
- Tables with numeric data: extract to markdown tables, preserve headers
- Code chunks: chunk by function/class, never mid-function
- Non-English docs: language detection, multilingual embedding, answer in query language
- PII redaction before storage (emails, SSNs, phone numbers)
- Access-controlled docs: enforce ACL filters at query time, NEVER post-filter after retrieval
## Cost & Latency Estimate
| Component | Latency (p95) | Cost per query |
|-----------|---------------|----------------|
| Query embed | — | — |
| Vector search | — | — |
| Rerank | — | — |
| LLM synthesis | — | — |
| **Total** | — | — |
Fill these in with specific numbers for arctic-embed-l + pgvector + Cohere Rerank 3.5 at 2 QPS.
## Deliverables
1. Code scaffold (Python or TypeScript): ingest.py, chunk.py, embed.py, retrieve.py, synthesize.py, evaluate.py
2. Chunking parameter table (chunk_size, overlap, separators) tuned for earnings call transcripts
3. Prompt template (above) with variables called out
4. Eval harness wired to the golden set with CI gate on the three target thresholds
5. Runbook: what to do when retrieval quality drops, what to do when latency spikes
6. Cost model spreadsheet with ingestion cost, per-query cost, storage cost
- Use precise technical terminology appropriate for the audience
- Include code examples, configurations, or specifications where relevant
- Document assumptions, prerequisites, and dependencies
- Provide error handling and edge case considerations
Present your output in a clear, organized structure with headers (##), subheaders (###), and bullet points. Use bold for key terms.Replace the bracketed placeholders with your own context before running the prompt:
[{chunk_id, quote, doc_url}]— fill in your specific {chunk_id, quote, doc_url}.[doc_id]— fill in your specific doc_id.