LLM Fine Tuning

What is LLM fine-tuning?

LLM fine-tuning is the process of further training a pre-trained language model on your data so it produces consistent, domain-specific behavior that prompting alone cannot reliably achieve. Done right, fine-tuning lets you use a smaller, cheaper model to outperform a larger one on your specific task — at a fraction of the inference cost.

We treat fine-tuning as a tool of last resort, not a default. Most teams asking about fine-tuning would be better served by better prompting, retrieval-augmented generation, or both. When fine-tuning is the right answer, we build it end-to-end: data pipeline, training, evaluation, deployment, and ongoing maintenance.

Key terms used on this page:

Base model: The pre-trained foundation model you start from — GPT-4o-mini, Claude Haiku, Llama 3.1, Mistral Small, Qwen 2.5, etc.
Full fine-tuning: Updating every weight in the model on your data. Highest quality ceiling, highest compute and data cost.
LoRA (Low-Rank Adaptation): Training small adapter matrices on top of a frozen base model. Captures most of the value of full fine-tuning at a fraction of the cost.
QLoRA: LoRA applied to a quantized (4-bit or 8-bit) base model — fits on a single GPU, ideal for experimentation.
SFT (Supervised Fine-Tuning): Training on input/output pairs where you have ground-truth answers.
DPO (Direct Preference Optimization): Aligning a model to preferences (chosen vs. rejected pairs) without a separate reward model. Has largely replaced RLHF for most production use cases.
RLHF (Reinforcement Learning from Human Feedback): Multi-stage alignment using a reward model and reinforcement learning. More complex and expensive than DPO; still useful for the most sensitive alignment work.
PEFT (Parameter-Efficient Fine-Tuning): The umbrella category — LoRA, QLoRA, prefix tuning, prompt tuning — that updates only a small fraction of parameters.
Evaluation harness: A test suite of held-out examples and metrics that runs on every model version to catch regressions.

When does fine-tuning actually make sense?

This is the most important section on this page. Fine-tuning is the right tool for a narrow set of problems and the wrong tool for many of the problems teams bring to us. Fine-tune when:

1. Format consistency matters more than knowledge. You need outputs in an exact JSON schema, code style, or document structure that prompting reliably gets wrong on edge cases.

2. You have a narrow, repeatable task. Classification, extraction, or transformation with a stable taxonomy and 1,000+ labeled examples.

3. Brand voice or domain style is the entire point. Marketing copy, legal language, clinical documentation — places where the prompt cannot fully capture the style and you have a corpus that demonstrates it.

4. You want to drop to a smaller, cheaper model at high volume. Fine-tuned GPT-4o-mini or Llama 3.1 8B can match GPT-4o on narrow tasks at 1/10th the cost. At millions of calls per month this pays for the project several times over.

5. You need an open-weight model deployed in your VPC. Fine-tuning is how you turn a generic open-weight base into something competitive with closed-API models on your task.

Do not fine-tune when:

The task requires up-to-date knowledge — use retrieval instead.
You have fewer than a few hundred labeled examples — improve prompting first.
The base model already handles the task at acceptable quality — you are buying problems for no benefit.
Your data changes frequently — retraining cost will exceed retrieval-pipeline maintenance.

We have turned down fine-tuning engagements where the right answer was a better prompt, a better retriever, or a different base model. We will tell you the same.

How does the fine-tuning process actually work?

A real fine-tuning project has six phases, and the model training itself is the shortest one:

1. Problem framing and baseline. Define the task, the success metrics, and run an honest baseline with a strong prompted model (Claude Sonnet, GPT-4o). If prompting hits the bar, stop.

2. Data collection and curation. Source examples from production logs, expert annotations, or synthetic generation. Deduplicate, deobfuscate PII, balance classes, and split train / validation / test. This is 50–70% of the project.

3. Evaluation harness. Build the test set and metrics before you train. We score on accuracy, faithfulness, format validity, latency, and cost — not just loss.

4. Method selection. Choose between SFT, DPO, LoRA, QLoRA, full fine-tuning, or some combination. We pick based on data shape, base model, and deployment target.

5. Training and tuning. Hyperparameter sweeps (learning rate, rank, epochs), training runs, evaluation against the held-out set after every epoch.

6. Deployment and monitoring. Quantization for inference, deployment to your runtime (OpenAI, Bedrock, Together.ai, Fireworks, your own GPUs via vLLM or TGI), and monitoring for drift and regressions.

Skipping any phase is the most common failure mode. Teams that rush to training without an evaluation harness ship models they cannot prove are better than the baseline.

How do you choose between LoRA, QLoRA, full fine-tuning, and DPO?

Each method has a different cost/quality/data profile. Here is how we choose:

Method	Use when	Data needed	Cost	Quality ceiling
LoRA SFT	Narrow task, structured outputs, brand voice — most common case	500–10k examples	Low (single GPU for hours)	90–95% of full fine-tuning
QLoRA SFT	Same as LoRA but on a smaller GPU budget, or for experimentation	500–10k examples	Lowest	~90% of full fine-tuning
Full fine-tuning	You need every percentage point of quality and have the data and GPUs	10k–1M+ examples	High (multi-GPU, days)	Highest
DPO	Subjective tasks (style, helpfulness, refusal calibration) with preference pairs	1k–10k preference pairs	Medium	Often surpasses SFT on subjective tasks
RLHF	Complex alignment with multiple reward signals — rare in commercial work	10k+ preference pairs + reward model	Highest	Theoretical ceiling, but DPO usually closes the gap
Continued pre-training	Adapting a model to a new language, domain corpus, or codebase	Millions–billions of tokens	Very high	Different category — broadens the base, then fine-tune on top

The pattern we follow most often: LoRA SFT for narrow tasks, DPO on top when style and preference matter, full fine-tuning only when LoRA hits a clear ceiling and the volume justifies the GPU spend. Continued pre-training is rare and reserved for organizations with proprietary corpora that the base model has never seen.

How do you build the training data?

Data is where fine-tuning projects succeed or fail. The methods we use:

Production logs. The richest source — real user inputs and the outputs you want or do not want. We mine logs, label, and curate.
Expert annotation. Subject-matter experts (lawyers, doctors, financial analysts, content editors) producing or correcting examples. Slow and expensive, but the highest-quality data you can get.
Synthetic generation with a stronger model. Use Claude Opus or GPT-4o to generate candidate examples, then have humans review and accept. Distillation from a stronger model into a smaller fine-tuned one is one of the highest-ROI techniques in 2026.
Bootstrapping with rejection sampling. Generate many candidates with the base model, score them with a critic model or rule, keep only the high-scoring ones for training.
Preference pair construction (for DPO). Pair a "chosen" response with a "rejected" one — either from human raters, side-by-side comparisons of two models, or a strong critic model.

We almost always combine these: a few hundred expert-annotated examples to anchor the distribution, then synthetic expansion to thousands, then preference pairs for DPO on top.

Should you build, buy, or partner for LLM fine-tuning?

The fine-tuning ecosystem has matured fast. Here is the honest comparison of platforms and approaches:

Option	Best for	Method support	Speed	Cost	Lock-in
OpenAI fine-tuning API (GPT-4o, GPT-4o-mini, GPT-3.5)	Closed-API workloads, fastest path to a working fine-tune	SFT, DPO (limited)	Hours–days	USD 25–100 per training run + premium inference	High — model lives in OpenAI
Anthropic fine-tuning (via AWS Bedrock, Claude 3 Haiku)	Production Claude workloads, regulated industries	SFT	Days	Higher than OpenAI but tied to Bedrock	High
Together.ai	Open-weight LoRA / full fine-tuning, fastest open-source path	LoRA, full SFT, DPO	Hours	USD 5–500 per run depending on size	Low — export weights anywhere
Fireworks AI	Production-grade open-weight serving with fine-tuning	LoRA, SFT	Hours	Competitive	Low
Mosaic ML / Databricks	Enterprise full fine-tuning, large-scale, integrated with data warehouse	Full SFT, continued pretraining, DPO	Days	High — meant for serious volume	Medium — Databricks ecosystem
AWS SageMaker JumpStart	AWS-native deployment, regulated workloads, VPC	LoRA, full SFT	Days	AWS pricing — moderate to high	Medium — AWS lock-in
Hugging Face TRL / PEFT (self-hosted)	Maximum control, research, novel methods	All methods	Days–weeks	Just GPU cost	None — you own everything
Lamini	Managed end-to-end with focus on enterprise	SFT, DPO, RLHF-style	Days	High	Medium
Predibase	Managed LoRA serving with low-latency adapters	LoRA, SFT	Hours	Moderate	Medium
Build in-house on raw GPUs	Mature ML org with infra team	All	Slow to start	Lowest at scale	None
Partner-built (our model)	You want fine-tuning done right without building an ML team	All — we pick per project	6–14 weeks	Predictable, IP retained	None — you own weights and code

The pattern that works: OpenAI fine-tuning for the fastest production path on closed APIs; Together.ai or Fireworks for open-weight LoRA at modest scale; Mosaic / Databricks or SageMaker for enterprise-scale or regulated workloads; self-hosted Hugging Face TRL only when you have a real ML platform team. We pick the platform per project based on data sensitivity, deployment target, and projected volume — there is no single right answer.

How do you evaluate a fine-tuned model?

A fine-tune that "looks better" in spot checks is not a fine-tune you should ship. We score every model version on a held-out test set with a mix of:

Metric	Measures	How
Task accuracy	Did the model produce the right output on labeled examples?	Exact match, F1, or LLM-as-judge with a strong critic
Format validity	Does output parse against the expected schema?	JSON schema validation, regex, parser
Faithfulness	For RAG-style tasks, does the answer match the provided context?	Ragas, custom rubric
Style adherence	Does it match the brand voice or required register?	LLM-as-judge with a style rubric
Refusal calibration	Does it refuse the right things and answer the rest?	Red-team prompt suite
Regression vs. base	Did fine-tuning hurt anything the base model did well?	Held-out general-capability suite
Cost and latency	Per-call cost and p95 latency at projected volume	Load testing

The last row is the one teams skip. A fine-tune that is 2% more accurate but 3x slower is rarely a win in production.

What does a fine-tuning engagement look like with us?

A typical engagement runs 6 to 14 weeks:

Weeks 1–2: Problem framing, baseline measurement with strong prompting, decision on whether to fine-tune at all.
Weeks 2–6: Data sourcing, labeling, synthetic expansion, evaluation harness construction.
Weeks 6–9: Method selection, training runs, hyperparameter sweeps, evaluation iteration.
Weeks 9–11: Deployment to your runtime (OpenAI, Bedrock, Together, Fireworks, your VPC), load testing, cost validation.
Weeks 11–14: Production rollout with shadow mode, monitoring, drift detection, and a retraining playbook.

Outcomes we hold ourselves to: a fine-tuned model that beats the prompted baseline on your evaluation harness, a documented data and training pipeline you can rerun, a deployment with monitoring, and a clear cost model at projected volume.

After launch, we usually keep a small retainer for periodic retraining, base-model upgrades (when GPT-5 or Claude 5 ships, you will want to re-evaluate), and adjacent task fine-tunes.

What does LLM fine-tuning cost?

For a single fine-tuning project end-to-end, expect USD 35,000 to USD 150,000, weighted heavily toward data work. Multi-task or multi-model platforms (a fine-tuning pipeline you will rerun monthly) run USD 100,000 to USD 350,000.

Compute cost for the training run itself is usually small relative to engineering — USD 50 to USD 5,000 per run for LoRA on common base models, USD 1,000 to USD 50,000 for full fine-tunes on larger models.

Inference economics depend on the path you choose:

OpenAI fine-tuned GPT-4o-mini: roughly 2x base model token price.
Anthropic Claude Haiku fine-tuned (Bedrock): premium over base.
Open-weight LoRA on Together.ai / Fireworks: USD 0.20–2.00 per million tokens depending on model size.
Self-hosted on your GPUs: amortized GPU cost — most efficient at very high volume.

We always model unit economics before training. The right answer is sometimes "do not fine-tune" — and we will say so.

For pricing on adjacent services, see our Pricing page.

Frequently asked questions about LLM fine-tuning

Should we fine-tune a model or just use prompting and retrieval?

Start with prompting plus retrieval. Fine-tune only when you have a stable, repeatable task with clear failure modes that prompting cannot solve — formatting consistency, brand voice, narrow classification, or significant inference-cost reduction at high volume. About 80% of fine-tuning projects we are asked to scope should not be fine-tuning projects at all.

What is the difference between LoRA, QLoRA, and full fine-tuning?

Full fine-tuning updates every weight in the model — highest ceiling, highest cost, requires GPUs and significant data. LoRA (Low-Rank Adaptation) trains small adapter matrices on top of a frozen base model — 90% of the quality at 5% of the cost. QLoRA is LoRA on a quantized base model — fits on a single consumer GPU. We use LoRA or QLoRA for almost everything; full fine-tuning is justified maybe 10% of the time.

What about RLHF and DPO — do we need those?

Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) align a model to preferences rather than ground-truth labels — useful when the task is subjective (style, helpfulness, safety) and you have preference pairs. DPO is usually the right choice now: simpler, more stable, and cheaper than full RLHF. You probably do not need RLHF; you might want DPO.

How much training data do we actually need?

For LoRA fine-tuning of a closed model (OpenAI, Anthropic) on a narrow task, 500 to 5,000 high-quality examples is typically enough. For DPO, 1,000 to 10,000 preference pairs. For full fine-tuning of an open-weight model on a broader task, 10,000 to 100,000 examples. Data quality matters more than quantity — 1,000 carefully curated examples beat 10,000 noisy ones almost every time.

Should we fine-tune OpenAI, Anthropic, or an open-weight model?

OpenAI fine-tuning (GPT-4o, GPT-4o-mini) is the easiest path — managed infrastructure, decent cost, but you are locked in. Anthropic fine-tuning is in limited release on Bedrock and worth considering for production Claude workloads. For full control and best unit economics, fine-tune an open-weight model (Llama 3.x, Mistral, Qwen, DeepSeek) on Together.ai, Fireworks, Mosaic ML/Databricks, AWS SageMaker JumpStart, or Hugging Face TRL/PEFT. We pick based on data sensitivity, deployment requirements, and projected volume.

How long does a fine-tuning project take?

Six to fourteen weeks end-to-end. Most of the time is data — sourcing, labeling, deduplicating, splitting train/val/test, building the evaluation harness. The actual training run is hours to a day or two. The mistake we see is teams that rush to training and find out their data was the bottleneck.

What does fine-tuning cost to run in production?

Fine-tuned OpenAI / Anthropic models are typically 1.5x to 6x the base model price per token. Self-hosted fine-tuned open-weight models on Together.ai or Fireworks run USD 0.20–2.00 per million tokens. The right comparison is total cost at projected volume — fine-tuning often pays for itself within months at high volume because you can drop to a smaller, cheaper base model.