LLM Fine Tuning
Fine-tune large language models on your domain data — LoRA, QLoRA, full fine-tuning, or DPO — when prompting and retrieval have hit their ceiling and you need consistent, cost-efficient behavior at scale.
What is LLM fine-tuning?
LLM fine-tuning is the process of further training a pre-trained language model on your data so it produces consistent, domain-specific behavior that prompting alone cannot reliably achieve. Done right, fine-tuning lets you use a smaller, cheaper model to outperform a larger one on your specific task — at a fraction of the inference cost.
We treat fine-tuning as a tool of last resort, not a default. Most teams asking about fine-tuning would be better served by better prompting, retrieval-augmented generation, or both. When fine-tuning is the right answer, we build it end-to-end: data pipeline, training, evaluation, deployment, and ongoing maintenance.
Key terms used on this page:
- Base model: The pre-trained foundation model you start from — GPT-4o-mini, Claude Haiku, Llama 3.1, Mistral Small, Qwen 2.5, etc.
- Full fine-tuning: Updating every weight in the model on your data. Highest quality ceiling, highest compute and data cost.
- LoRA (Low-Rank Adaptation): Training small adapter matrices on top of a frozen base model. Captures most of the value of full fine-tuning at a fraction of the cost.
- QLoRA: LoRA applied to a quantized (4-bit or 8-bit) base model — fits on a single GPU, ideal for experimentation.
- SFT (Supervised Fine-Tuning): Training on input/output pairs where you have ground-truth answers.
- DPO (Direct Preference Optimization): Aligning a model to preferences (chosen vs. rejected pairs) without a separate reward model. Has largely replaced RLHF for most production use cases.
- RLHF (Reinforcement Learning from Human Feedback): Multi-stage alignment using a reward model and reinforcement learning. More complex and expensive than DPO; still useful for the most sensitive alignment work.
- PEFT (Parameter-Efficient Fine-Tuning): The umbrella category — LoRA, QLoRA, prefix tuning, prompt tuning — that updates only a small fraction of parameters.
- Evaluation harness: A test suite of held-out examples and metrics that runs on every model version to catch regressions.
When does fine-tuning actually make sense?
This is the most important section on this page. Fine-tuning is the right tool for a narrow set of problems and the wrong tool for many of the problems teams bring to us. Fine-tune when:
1. Format consistency matters more than knowledge. You need outputs in an exact JSON schema, code style, or document structure that prompting reliably gets wrong on edge cases.
2. You have a narrow, repeatable task. Classification, extraction, or transformation with a stable taxonomy and 1,000+ labeled examples.
3. Brand voice or domain style is the entire point. Marketing copy, legal language, clinical documentation — places where the prompt cannot fully capture the style and you have a corpus that demonstrates it.
4. You want to drop to a smaller, cheaper model at high volume. Fine-tuned GPT-4o-mini or Llama 3.1 8B can match GPT-4o on narrow tasks at 1/10th the cost. At millions of calls per month this pays for the project several times over.
5. You need an open-weight model deployed in your VPC. Fine-tuning is how you turn a generic open-weight base into something competitive with closed-API models on your task.
Do not fine-tune when:
- The task requires up-to-date knowledge — use retrieval instead.
- You have fewer than a few hundred labeled examples — improve prompting first.
- The base model already handles the task at acceptable quality — you are buying problems for no benefit.
- Your data changes frequently — retraining cost will exceed retrieval-pipeline maintenance.
We have turned down fine-tuning engagements where the right answer was a better prompt, a better retriever, or a different base model. We will tell you the same.
How does the fine-tuning process actually work?
A real fine-tuning project has six phases, and the model training itself is the shortest one:
1. Problem framing and baseline. Define the task, the success metrics, and run an honest baseline with a strong prompted model (Claude Sonnet, GPT-4o). If prompting hits the bar, stop.
2. Data collection and curation. Source examples from production logs, expert annotations, or synthetic generation. Deduplicate, deobfuscate PII, balance classes, and split train / validation / test. This is 50–70% of the project.
3. Evaluation harness. Build the test set and metrics before you train. We score on accuracy, faithfulness, format validity, latency, and cost — not just loss.
4. Method selection. Choose between SFT, DPO, LoRA, QLoRA, full fine-tuning, or some combination. We pick based on data shape, base model, and deployment target.
5. Training and tuning. Hyperparameter sweeps (learning rate, rank, epochs), training runs, evaluation against the held-out set after every epoch.
6. Deployment and monitoring. Quantization for inference, deployment to your runtime (OpenAI, Bedrock, Together.ai, Fireworks, your own GPUs via vLLM or TGI), and monitoring for drift and regressions.
Skipping any phase is the most common failure mode. Teams that rush to training without an evaluation harness ship models they cannot prove are better than the baseline.
How do you choose between LoRA, QLoRA, full fine-tuning, and DPO?
Each method has a different cost/quality/data profile. Here is how we choose:
| Method | Use when | Data needed | Cost | Quality ceiling |
|---|---|---|---|---|
| LoRA SFT | Narrow task, structured outputs, brand voice — most common case | 500–10k examples | Low (single GPU for hours) | 90–95% of full fine-tuning |
| QLoRA SFT | Same as LoRA but on a smaller GPU budget, or for experimentation | 500–10k examples | Lowest | ~90% of full fine-tuning |
| Full fine-tuning | You need every percentage point of quality and have the data and GPUs | 10k–1M+ examples | High (multi-GPU, days) | Highest |
| DPO | Subjective tasks (style, helpfulness, refusal calibration) with preference pairs | 1k–10k preference pairs | Medium | Often surpasses SFT on subjective tasks |
| RLHF | Complex alignment with multiple reward signals — rare in commercial work | 10k+ preference pairs + reward model | Highest | Theoretical ceiling, but DPO usually closes the gap |
| Continued pre-training | Adapting a model to a new language, domain corpus, or codebase | Millions–billions of tokens | Very high | Different category — broadens the base, then fine-tune on top |
How do you build the training data?
Data is where fine-tuning projects succeed or fail. The methods we use:
- Production logs. The richest source — real user inputs and the outputs you want or do not want. We mine logs, label, and curate.
- Expert annotation. Subject-matter experts (lawyers, doctors, financial analysts, content editors) producing or correcting examples. Slow and expensive, but the highest-quality data you can get.
- Synthetic generation with a stronger model. Use Claude Opus or GPT-4o to generate candidate examples, then have humans review and accept. Distillation from a stronger model into a smaller fine-tuned one is one of the highest-ROI techniques in 2026.
- Bootstrapping with rejection sampling. Generate many candidates with the base model, score them with a critic model or rule, keep only the high-scoring ones for training.
- Preference pair construction (for DPO). Pair a "chosen" response with a "rejected" one — either from human raters, side-by-side comparisons of two models, or a strong critic model.
We almost always combine these: a few hundred expert-annotated examples to anchor the distribution, then synthetic expansion to thousands, then preference pairs for DPO on top.
Should you build, buy, or partner for LLM fine-tuning?
The fine-tuning ecosystem has matured fast. Here is the honest comparison of platforms and approaches:
| Option | Best for | Method support | Speed | Cost | Lock-in |
|---|---|---|---|---|---|
| OpenAI fine-tuning API (GPT-4o, GPT-4o-mini, GPT-3.5) | Closed-API workloads, fastest path to a working fine-tune | SFT, DPO (limited) | Hours–days | USD 25–100 per training run + premium inference | High — model lives in OpenAI |
| Anthropic fine-tuning (via AWS Bedrock, Claude 3 Haiku) | Production Claude workloads, regulated industries | SFT | Days | Higher than OpenAI but tied to Bedrock | High |
| Together.ai | Open-weight LoRA / full fine-tuning, fastest open-source path | LoRA, full SFT, DPO | Hours | USD 5–500 per run depending on size | Low — export weights anywhere |
| Fireworks AI | Production-grade open-weight serving with fine-tuning | LoRA, SFT | Hours | Competitive | Low |
| Mosaic ML / Databricks | Enterprise full fine-tuning, large-scale, integrated with data warehouse | Full SFT, continued pretraining, DPO | Days | High — meant for serious volume | Medium — Databricks ecosystem |
| AWS SageMaker JumpStart | AWS-native deployment, regulated workloads, VPC | LoRA, full SFT | Days | AWS pricing — moderate to high | Medium — AWS lock-in |
| Hugging Face TRL / PEFT (self-hosted) | Maximum control, research, novel methods | All methods | Days–weeks | Just GPU cost | None — you own everything |
| Lamini | Managed end-to-end with focus on enterprise | SFT, DPO, RLHF-style | Days | High | Medium |
| Predibase | Managed LoRA serving with low-latency adapters | LoRA, SFT | Hours | Moderate | Medium |
| Build in-house on raw GPUs | Mature ML org with infra team | All | Slow to start | Lowest at scale | None |
| Partner-built (our model) | You want fine-tuning done right without building an ML team | All — we pick per project | 6–14 weeks | Predictable, IP retained | None — you own weights and code |
How do you evaluate a fine-tuned model?
A fine-tune that "looks better" in spot checks is not a fine-tune you should ship. We score every model version on a held-out test set with a mix of:
| Metric | Measures | How |
|---|---|---|
| Task accuracy | Did the model produce the right output on labeled examples? | Exact match, F1, or LLM-as-judge with a strong critic |
| Format validity | Does output parse against the expected schema? | JSON schema validation, regex, parser |
| Faithfulness | For RAG-style tasks, does the answer match the provided context? | Ragas, custom rubric |
| Style adherence | Does it match the brand voice or required register? | LLM-as-judge with a style rubric |
| Refusal calibration | Does it refuse the right things and answer the rest? | Red-team prompt suite |
| Regression vs. base | Did fine-tuning hurt anything the base model did well? | Held-out general-capability suite |
| Cost and latency | Per-call cost and p95 latency at projected volume | Load testing |
What does a fine-tuning engagement look like with us?
A typical engagement runs 6 to 14 weeks:
- Weeks 1–2: Problem framing, baseline measurement with strong prompting, decision on whether to fine-tune at all.
- Weeks 2–6: Data sourcing, labeling, synthetic expansion, evaluation harness construction.
- Weeks 6–9: Method selection, training runs, hyperparameter sweeps, evaluation iteration.
- Weeks 9–11: Deployment to your runtime (OpenAI, Bedrock, Together, Fireworks, your VPC), load testing, cost validation.
- Weeks 11–14: Production rollout with shadow mode, monitoring, drift detection, and a retraining playbook.
Outcomes we hold ourselves to: a fine-tuned model that beats the prompted baseline on your evaluation harness, a documented data and training pipeline you can rerun, a deployment with monitoring, and a clear cost model at projected volume.
After launch, we usually keep a small retainer for periodic retraining, base-model upgrades (when GPT-5 or Claude 5 ships, you will want to re-evaluate), and adjacent task fine-tunes.
What does LLM fine-tuning cost?
For a single fine-tuning project end-to-end, expect USD 35,000 to USD 150,000, weighted heavily toward data work. Multi-task or multi-model platforms (a fine-tuning pipeline you will rerun monthly) run USD 100,000 to USD 350,000.
Compute cost for the training run itself is usually small relative to engineering — USD 50 to USD 5,000 per run for LoRA on common base models, USD 1,000 to USD 50,000 for full fine-tunes on larger models.
Inference economics depend on the path you choose:
- OpenAI fine-tuned GPT-4o-mini: roughly 2x base model token price.
- Anthropic Claude Haiku fine-tuned (Bedrock): premium over base.
- Open-weight LoRA on Together.ai / Fireworks: USD 0.20–2.00 per million tokens depending on model size.
- Self-hosted on your GPUs: amortized GPU cost — most efficient at very high volume.
We always model unit economics before training. The right answer is sometimes "do not fine-tune" — and we will say so.
For pricing on adjacent services, see our Pricing page.
Frequently asked questions about LLM fine-tuning
Should we fine-tune a model or just use prompting and retrieval?
Start with prompting plus retrieval. Fine-tune only when you have a stable, repeatable task with clear failure modes that prompting cannot solve — formatting consistency, brand voice, narrow classification, or significant inference-cost reduction at high volume. About 80% of fine-tuning projects we are asked to scope should not be fine-tuning projects at all.
What is the difference between LoRA, QLoRA, and full fine-tuning?
Full fine-tuning updates every weight in the model — highest ceiling, highest cost, requires GPUs and significant data. LoRA (Low-Rank Adaptation) trains small adapter matrices on top of a frozen base model — 90% of the quality at 5% of the cost. QLoRA is LoRA on a quantized base model — fits on a single consumer GPU. We use LoRA or QLoRA for almost everything; full fine-tuning is justified maybe 10% of the time.
What about RLHF and DPO — do we need those?
Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) align a model to preferences rather than ground-truth labels — useful when the task is subjective (style, helpfulness, safety) and you have preference pairs. DPO is usually the right choice now: simpler, more stable, and cheaper than full RLHF. You probably do not need RLHF; you might want DPO.
How much training data do we actually need?
For LoRA fine-tuning of a closed model (OpenAI, Anthropic) on a narrow task, 500 to 5,000 high-quality examples is typically enough. For DPO, 1,000 to 10,000 preference pairs. For full fine-tuning of an open-weight model on a broader task, 10,000 to 100,000 examples. Data quality matters more than quantity — 1,000 carefully curated examples beat 10,000 noisy ones almost every time.
Should we fine-tune OpenAI, Anthropic, or an open-weight model?
OpenAI fine-tuning (GPT-4o, GPT-4o-mini) is the easiest path — managed infrastructure, decent cost, but you are locked in. Anthropic fine-tuning is in limited release on Bedrock and worth considering for production Claude workloads. For full control and best unit economics, fine-tune an open-weight model (Llama 3.x, Mistral, Qwen, DeepSeek) on Together.ai, Fireworks, Mosaic ML/Databricks, AWS SageMaker JumpStart, or Hugging Face TRL/PEFT. We pick based on data sensitivity, deployment requirements, and projected volume.
How long does a fine-tuning project take?
Six to fourteen weeks end-to-end. Most of the time is data — sourcing, labeling, deduplicating, splitting train/val/test, building the evaluation harness. The actual training run is hours to a day or two. The mistake we see is teams that rush to training and find out their data was the bottleneck.
What does fine-tuning cost to run in production?
Fine-tuned OpenAI / Anthropic models are typically 1.5x to 6x the base model price per token. Self-hosted fine-tuned open-weight models on Together.ai or Fireworks run USD 0.20–2.00 per million tokens. The right comparison is total cost at projected volume — fine-tuning often pays for itself within months at high volume because you can drop to a smaller, cheaper base model.
Ready to Transform Your Business with AI?
Let's discuss how our AI solutions can drive growth, reduce costs, and create competitive advantages for your organization.
Schedule a Consultation