Prompts/AI Engineering & LLM Apps/Fine-tuning & Model Adaptation

FreeAI Engineering & LLM Apps🟠 Claude

Build an SFT Dataset for function-calling with strict JSON from web-scraped with filtering

Claude Prompt for Fine-tuning & Model Adaptation

End-to-end SFT dataset construction: collection, labeling, cleaning, dedup, contamination check for function-calling with strict JSON.

Related prompts

More prompts for Fine-tuning & Model Adaptation.

Browse all AI Engineering & LLM Apps →

AI Engineering & LLM Apps

Premium

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.

🤖Any Model

1841518

AI Engineering & LLM Apps

Premium

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.

🤖Any Model

331515

AI Engineering & LLM Apps

Premium

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.

💬ChatGPT

2001509

AI Engineering & LLM Apps

Free

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.

🟠Claude

781505

AI Engineering & LLM Apps

Premium

LoRA Fine-Tune of Phi-4 for financial report summarization

Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.

💬ChatGPT

3771503

AI Engineering & LLM Apps

Premium

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.

🤖Any Model

941503

You are a dataset engineer. Build a high-quality SFT dataset for fine-tuning an LLM on function-calling with strict JSON. The model's ability is bottlenecked by dataset quality, not model scale — so obsess over the data. ## North Star 50k high-quality (messages, completion) pairs, deduplicated, contamination-clean, with a transparent data card. ## Collection Strategy ### Primary: web-scraped with filtering - Extraction pipeline: describe the query / export / scrape / API - Rate limits and backoff - Legal / license review: confirm usage rights for training - Expected yield: 40k raw examples ### Secondary: GitHub PR discussions - Use to cover long-tail and edge cases underrepresented in primary - Expected yield: 10k ### Synthetic Augmentation Generate additional examples via Claude Opus 4.5 to cover: - Edge cases from error analysis of baseline model - Underrepresented classes / intents - Counterfactuals (what if user asks the opposite) Use a seed set of 300 real examples. For each, prompt Claude Opus 4.5: ``` You are generating training data for a function-calling with strict JSON model. Given this SEED example: {seed_example} Produce 5 NEW examples that are: - Clearly distinct from the seed (different topic, phrasing, or edge case) - Equally or more challenging - Realistic (a real user might ask this) - Fully self-contained Output strict JSON: { "examples": [ { "input": "...", "output": "..." } ] } ``` **Rejection sampling:** generate 3x what you need, then filter by automated quality + a small human sample review. ## Labeling Protocol ### Annotator Guidelines Write a 2-3 page guideline doc covering: - Task definition with positive and negative examples - Output format specification (verbatim schema) - Disambiguation rules for edge cases (list at least 10) - When to flag as "discard" vs "label" - Inter-annotator agreement target: Cohen's kappa ≥ 0.75 ### Pipeline 1. Each example labeled by 1 annotator 2. 20% sampled for second-annotator review 3. Disagreements resolved by lead annotator 4. Weekly calibration sessions with trickiest cases ### Tooling Use Argilla with: - Keyboard shortcuts for frequent labels - Auto-save - Context panel showing similar past examples - Version-controlled guidelines doc linked from the UI ## Cleaning & Quality Filters ### Deduplication - **Exact dedup:** SHA-256 hash of normalized input - **Near dedup:** MinHash-LSH (128 hashes, threshold 0.85 Jaccard on 5-gram shingles) - Expected dup rate: 8-12% — log and investigate if higher ### Quality Filters Drop examples where any is true: - assistant response < 10 tokens (likely truncated) - assistant response > 4096 tokens (likely runaway) - user message empty or only punctuation - language mismatch (target: English + Spanish) - toxicity score > 0.3 - PII detected (email, phone, SSN) without redaction ### Format Validation For structured-output tasks: - Parse each assistant response against the expected schema - Drop or fix invalid - Track invalid rate — high rate signals annotation training gap ## Contamination Check Critical: the fine-tune eval results are worthless if train leaks into eval. Run: - Exact match between train.input and eval.input - 13-gram overlap between train.assistant and eval.reference - Semantic dedup via embedding similarity > 0.92 → review manually Run against ALL eval sets you'll report on: MMLU, HumanEval, GSM8K, MT-Bench, customer-validated golden set. Document the contamination report in the data card. ## Data Card Produce `data_card.md` with: - Sources with counts and license - Collection window (date range) - Language breakdown - Length distribution (histogram) - Task distribution (if multi-task) - Known biases and limitations - Ethical considerations - Version + changelog ## Splits - Train: 90% (22.5k) - Dev: 5% (250) - Holdout: 5% (500), frozen, never used for tuning Stratify by task subtype so all splits have same distribution. ## Output Format (JSONL) ```jsonl {"id": "ex_0001", "messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}], "source": "primary", "quality": 0.92, "lang": "en"} ``` ## Deliverables 1. Raw collected data + provenance logs 2. Labeling guideline doc + calibration set 3. Cleaning pipeline script (`clean.py`) with tests 4. Contamination report 5. Final train.jsonl, dev.jsonl, holdout.jsonl 6. Data card 7. Reproducibility: scripts + commit hash + environment - Use precise technical terminology appropriate for the audience - Include code examples, configurations, or specifications where relevant - Document assumptions, prerequisites, and dependencies - Provide error handling and edge case considerations

Build an SFT Dataset for function-calling with strict JSON from web-scraped with filtering

Related prompts

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

LoRA Fine-Tune of Phi-4 for financial report summarization

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

Build an SFT Dataset for function-calling with strict JSON from web-scraped with filtering

Related prompts

LoRA Fine-Tune of Qwen 2.5 32B for SQL-from-text generation

Evaluate a Fine-Tuned Gemma 2 27B on code review vs Frontier Models

Evaluate a Fine-Tuned Llama 3.3 70B on JSON extraction vs Frontier Models

Evaluate a Fine-Tuned Mixtral 8x7B on legal clause extraction vs Frontier Models

LoRA Fine-Tune of Phi-4 for financial report summarization

DPO Fine-Tune of Gemma 2 9B for function-calling with strict JSON

How to customize this prompt

Tags

Who this is for