AI Prompt for RAG Pipelines
Prompt and verifier for extracting verifiable citations from a RAG answer over multilingual help center articles, scored by Ragas faithfulness judge.
More prompts for RAG Pipelines.
Implement query decomposition to improve retrieval recall for support tickets using jina-embeddings-v3 + multi-vector (per chunk).
Production RAG recipe: recursive character chunking, mxbai-embed-large embeddings, Redis Vector storage, Voyage rerank-2 reranking. Includes retrieval evals.
Production RAG recipe: semantic (embedding-based) chunking, stella_en_1.5B_v5 embeddings, Chroma storage, mxbai-rerank-large reranking. Includes retrieval evals.
Hybrid BM25 + dense retrieval architecture with Cohere Rerank 3.5 cross-encoder reranking, tuned for customer interview transcripts.
Production RAG recipe: token-based sliding window chunking, stella_en_1.5B_v5 embeddings, Weaviate storage, mxbai-rerank-large reranking. Includes retrieval evals.
Production RAG recipe: token-based sliding window chunking, cohere-embed-multilingual-v3 embeddings, pgvector storage, Cohere Rerank 3.5 reranking. Includes retrieval evals.
You are an AI engineer responsible for answer trustworthiness. Users will only trust a RAG assistant if every factual claim is traceable to the source. Design a citation extraction + verification layer on top of an existing RAG pipeline over multilingual help center articles.
## Goal
Every sentence in the final answer that asserts a fact must carry a citation marker like `[c-3]` referring to a retrieved chunk, AND the cited chunk must actually contain the quoted span (Β±small paraphrase).
## System Prompt: Citation-Required Synthesis
```
You are a careful research assistant answering questions about multilingual help center articles.
You will be given CONTEXT chunks. Each chunk has an id like c-1, c-2.
RULES:
1. Every factual claim must be followed by a citation in square brackets, e.g. "The API rate limit is 100 rpm [c-2]."
2. If a claim is supported by multiple chunks, cite all: [c-1][c-4].
3. If the context does NOT support a claim, do NOT make the claim. Say "The provided context does not cover X."
4. Do NOT use outside knowledge. The only facts available are in the CONTEXT.
5. For numeric or named-entity claims, include a short verbatim quote from the chunk in the "quotes" array.
6. Output strict JSON matching the schema.
OUTPUT SCHEMA (JSON):
{
"answer": "string with inline [c-N] citations",
"citations": [
{
"id": "c-N",
"quote": "exact text from chunk c-N that supports this claim",
"chunk_id": "c-N"
}
],
"uncovered_aspects": ["list of parts of the question not answered by context"]
}
```
## Verifier Pass
LLMs sometimes fabricate citations or cite the wrong chunk. Run a deterministic verifier after synthesis:
### Verifier Algorithm
For each `citation` in the answer:
1. Extract the quote and the claimed chunk_id
2. Fetch the chunk text from the retrieval layer
3. Check: is `quote` a substring of chunk (case-insensitive, whitespace-normalized)?
4. If not exact, compute fuzzy match score (e.g., token-set ratio via rapidfuzz) and require β₯ 0.85
5. If still failing, mark the citation as UNVERIFIED
For each `[c-N]` marker in the answer text:
1. Must correspond to a chunk in the context
2. Must appear in the citations array
### Error Handling
- Any UNVERIFIED citation β retry synthesis ONCE with an error feedback prompt ("Your citation for claim X was not found in c-N. Either fix the quote or remove the claim.")
- After retry, if still UNVERIFIED β strip the unsupported sentence and append "[Note: part of this answer could not be verified against the sources and was removed.]"
- Log every unverified citation to OpenTelemetry + Jaeger for offline analysis
## LLM-as-Judge for Citation Accuracy
For offline eval, use Ragas faithfulness judge with this rubric:
```
For each citation in the answer, rate 1-5:
- 5: quote exactly supports the claim
- 4: quote supports the claim with minor paraphrase
- 3: quote is on-topic but doesn't fully support the claim
- 2: quote is from the right chunk but wrong passage
- 1: quote is fabricated or from wrong chunk
Also rate the overall answer:
- coverage: fraction of factual claims that have ANY citation (0.0-1.0)
- precision: fraction of citations rated 4+ (0.0-1.0)
```
Target: coverage β₯ 0.95, precision β₯ 0.95 on golden set of 250 examples.
## Evaluation Harness
Build `eval_citations.py`:
1. Load golden set (question, expected_doc_ids, reference_answer)
2. Run the full RAG pipeline, collect answer + citations
3. Run verifier β compute % verified
4. Run Ragas faithfulness judge rubric β compute coverage + precision
5. Compute aggregate: % answers fully grounded, mean precision
6. Regression gate: if any metric drops > 2 pts vs last main build, fail CI
## Edge Cases
- **Quote spans across chunks:** allow citation of multiple chunks for one claim
- **Tables:** when quoting a row, include the header row in the quote for context
- **Code blocks:** cite entire code fence, don't truncate
- **Paraphrased facts:** if exact quote not found, allow fuzzy match β₯ 0.85, but emit "paraphrased" flag
- **Numbers in different units:** LLM may convert "1GB" β "1000MB" β verifier should accept canonical-form equivalence via a normalizer
- **Non-English sources, English answer:** require both the translated answer and the original-language quote
## Citation UI Contract
The final structured output will be rendered by a frontend. The contract:
- Inline `[c-N]` markers MUST map to `citations[].id`
- `citations[].chunk_id` MUST resolve to a retrievable source via `/api/chunks/{id}`
- `citations[].quote` MUST be < 280 chars (for tooltip display); if longer, hard-truncate with ellipsis
- `uncovered_aspects` will be rendered as a yellow warning card
## Red-Team Scenarios to Test
- Question asks for a specific number; context has similar but wrong number
- Question has two parts; context only covers one
- Retrieval returns irrelevant chunks (adversarial): answer should refuse
- Multilingual chunks: verifier must handle Unicode normalization
- PDF chunks with hyphenation artifacts ("infor-\nmation"): normalize before matching
Produce: system prompt, verifier function in Go with tests, eval harness, golden-set format, and a monitoring dashboard spec for OpenTelemetry + Jaeger tracking citation metrics over time.
- Use precise technical terminology appropriate for the audience
- Include code examples, configurations, or specifications where relevant
- Document assumptions, prerequisites, and dependencies
- Provide error handling and edge case considerationsReplace the bracketed placeholders with your own context before running the prompt:
[c-3]β fill in your specific c-3.[c-2]β fill in your specific c-2.[c-1]β fill in your specific c-1.[c-4]β fill in your specific c-4.[c-N]β fill in your specific c-n.["list of parts of the question not answered by context"]β fill in your specific "list of parts of the question not answered by context".