LLM Fine-Tuning Developer Guide 2026: When RAG Is Not Enough

I shipped a code review tool eight months ago. The foundation model was good at finding bugs. It was not good at matching the opinionated style of the codebase it was reviewing: our internal naming conventions, our preferred error handling patterns, the way our team thinks about interface design versus implementation detail.

I tried prompt engineering first. Added examples, added constraints, added a long system prompt describing the style guide. It helped. It also degraded under longer contexts, and it cost a lot of tokens to ship the style context on every single call.

I tried RAG. Retrieved relevant style guide sections and included them. Better on specifics, still wrong on tone. The model knew the rules but did not internalize them. It would correctly identify that a variable name violated the convention and then suggest a replacement that violated a different convention.

I fine-tuned a smaller open weights model on a few thousand examples of good and bad code review comments from our actual history. The resulting model was smaller, cheaper per token, faster, and produced reviews that actually sounded like they came from a senior engineer on our team rather than a generic assistant that had read the style guide.

That is the gap fine-tuning fills. Not knowledge. Behavior.

The Real Decision: Fine-Tuning vs RAG vs Prompt Engineering

These three approaches are often presented as alternatives to each other. They are not. They solve different problems, and the most common mistake is reaching for fine-tuning when one of the others would have worked.

Prompt engineering changes how the model reasons about a task. Better prompts, better examples, better instructions. The limit is context length and the fact that the model has to re-read everything on every call. If the behavior you want can be expressed in a few hundred tokens of instruction, this is almost always the right answer. It is free (in the training sense), reversible, and you can iterate in hours.

RAG gives the model access to information it was not trained on. If the model keeps hallucinating because it does not know your product’s API, your internal terminology, or recent events, RAG fills that gap. It retrieves relevant context from a vector database and includes it in the prompt. The model’s behavior does not change; what it has access to does.

Fine-tuning changes the model itself. The weights update to encode new behavior patterns. Not facts. Behavior: how the model writes, what it emphasizes, what format it defaults to, how it handles domain-specific edge cases that appear nowhere in the general training data.

The question that tells you which approach to use: is the problem that the model does not know something, or that it does not behave the way you need it to?

If knowledge: RAG or prompt engineering.

If behavior: fine-tuning.

The line is not always clean. Sometimes the behavior problem is downstream of a knowledge gap. But thinking through this distinction before you start will save you weeks.

When Fine-Tuning Actually Makes Sense

Fine-tuning is the right tool in a specific set of situations. It is not the right tool for most situations.

Format and style consistency. The model needs to produce output in a very specific format or voice, every time, without extensive prompting. JSON schema alone does not solve this. The model should just know. This is the code review case, and it comes up in document generation, customer-facing copy generation, and any product where output quality needs to feel like it came from a trained expert.

Domain jargon and specialized vocabulary. Medical, legal, financial, and deeply technical domains have terminology that general models handle poorly. A model fine-tuned on real clinical notes will not confuse “acute” (sudden onset) with “severe.” A model fine-tuned on legal contracts will not hallucinate clause structures that do not exist. The knowledge is not enough; the model needs to have seen it used correctly many times.

Reducing prompting cost at scale. If you are calling a model a million times a day with a 3,000-token system prompt for context that never changes, the token cost of that system prompt is enormous. Fine-tuning that context into the model means you no longer need to ship it on every call. This is one of the more underappreciated ROI cases for fine-tuning.

Reliable refusals or compliance behavior. If you need a model to always refuse certain categories of requests, no matter how the request is phrased, fine-tuning is more reliable than a system prompt. A system prompt can be bypassed. Fine-tuned behavior cannot be circumvented without access to the weights.

When fine-tuning is the wrong answer:

The information changes frequently. Fine-tuned behavior is baked into weights. If the domain knowledge you are trying to encode updates monthly, you will be retraining constantly. Use RAG.
You have less than a few hundred high-quality examples. Fine-tuning on small, low-quality datasets produces models that are confidently wrong in domain-specific ways. This is worse than the base model.
You do not have the infrastructure to evaluate whether the fine-tuned model is actually better. Shipping a fine-tuned model without evals is shipping blind.

Full Fine-Tuning vs LoRA vs QLoRA

The three main methods differ in what they update and how much memory they require.

Full fine-tuning updates all the model’s weights. It is the most powerful approach and the most resource-intensive. For a 7B parameter model, full fine-tuning typically requires 4-6 GPUs with 80GB VRAM each. For a 70B model, you need a cluster. This is the approach research labs use. For most product developers, it is overkill.

LoRA (Low-Rank Adaptation) is the practical default for 2026. Instead of updating all the weights, it adds small adapter matrices to specific layers and trains only those. The original weights are frozen. The result is close to full fine-tuning quality at a fraction of the compute cost. A 7B model fine-tuned with LoRA runs comfortably on a single A100 80GB GPU. The adapters are also small (often 100-500MB versus several GB for the full model delta), which makes them easier to version and swap.

QLoRA (Quantized LoRA) takes this further by quantizing the base model to 4-bit precision before adding the LoRA adapters. A 13B model becomes trainable on a single consumer GPU with 24GB VRAM. The trade-off is slightly lower quality than LoRA at full precision, and quantization can introduce subtle behavior changes in the base model that interact unexpectedly with the fine-tuned adapter.

For most product developers in 2026, LoRA is the right starting point. It is well-supported by every training library, produces reliable results, and the adapters are easy to manage. Move to QLoRA if you are constrained on GPU memory and cannot afford A100 access.

Picking a Base Model

The base model you fine-tune on matters more than the fine-tuning technique. A well-chosen base model with minimal fine-tuning beats a poorly-chosen base model with extensive tuning.

Llama 3.3 70B is the current workhorse for serious fine-tuning projects. Strong across coding, reasoning, and instruction following. The 70B scale gives you enough capacity to encode complex domain behavior. The open license lets you deploy without API cost. For most product use cases, this is where I would start.

Qwen 2.5 Coder 32B is worth considering specifically for code tasks. It has significantly stronger code generation and understanding than general-purpose models at the same scale, which matters if you are fine-tuning for a coding assistant, code review tool, or anything that generates or analyzes code.

Mistral Small 3 and Phi-4 are the options if inference cost is the primary concern. These are 7B-class models that punch above their weight. Fine-tuning them is faster and cheaper, and inference is genuinely fast. The trade-off is lower ceiling on complex reasoning and less headroom for domain specialization.

Gemma 3 has strong multimodal capabilities if your use case involves images alongside text. The 9B and 27B variants fine-tune well on consumer hardware.

A useful heuristic: pick the smallest model that can handle the task reliably without fine-tuning, then fine-tune that one. Bigger is not always better for fine-tuning, especially when the goal is behavior shaping rather than capability expansion. If a 7B model gets the format right 60% of the time, fine-tuning it to get there 95% of the time is a better ROI than fine-tuning a 70B model that already got it right 85% of the time.

Data Preparation: The Part That Determines Everything

The single biggest factor in whether a fine-tuned model is good is the training data. Not the technique. Not the hyperparameters. The data.

This is also the part most tutorials spend the least time on.

What good fine-tuning data looks like:

Input-output pairs that represent the exact behavior you want the model to learn
Sufficient variety to generalize, not just memorize the examples
No ambiguous or contradictory examples (the model cannot learn from conflicting signals)
The same distribution as real production inputs (not a cleaner, simpler version)

For the code review case, this meant pulling 2,400 actual review comments from our git history, having three senior engineers label each one as “model behavior we want” or “model behavior we do not want,” discarding the negative examples, and using only the positive ones as training targets. The inputs were real code diffs. The outputs were real review comments. Nothing synthetic.

The minimum you need:

A few hundred high-quality examples is enough to see meaningful behavior change in a LoRA fine-tune. A few thousand is enough to get reliable, generalizable behavior. Tens of thousands is where you see the full ceiling of what the technique can do.

If you do not have enough real data, synthetic data from a more capable model can bridge the gap. Generate examples using GPT-5 or Claude Opus, have domain experts review and correct them, and use the corrected examples as training data. This is slower than it sounds but produces better results than training on unchecked synthetic data.

Format the data for your training library. Most training pipelines use either JSONL with input/output pairs or the instruction-tuning format (system/user/assistant triples). Pick the format your library expects and be consistent. Mixing formats causes subtle issues that are hard to trace.

Training: Where and How

Unless you have on-premise GPU infrastructure, you will be renting compute.

Modal is where I send people who want the simplest experience. You write a Python function, decorate it with @app.function(gpu="A100"), and Modal handles everything else: container setup, GPU provisioning, scaling, teardown. For one-off training runs, the cost is predictable and the setup is minimal.

RunPod is the cheaper option if you are comfortable managing more of the infrastructure yourself. Spot instances on RunPod can run a QLoRA fine-tune for a fraction of the cost of on-demand, with the tradeoff that your run can be interrupted. For training jobs under four hours, spot is usually fine.

Hugging Face AutoTrain is the lowest-code option. You upload your dataset, pick a base model, configure a few parameters, and click run. The results are not always as good as a hand-tuned training setup, but it is the right place to validate whether your data is good enough before investing in a proper training pipeline.

Fireworks AI and Together AI both offer fine-tuning APIs where you upload your data and they handle the compute. Higher cost per run than self-managed compute, but no infrastructure to manage. Good for teams that want fine-tuning without adding GPU ops to their responsibilities.

The training library I reach for is Axolotl, which handles LoRA and QLoRA fine-tuning with reasonable defaults and active development. For pure simplicity, the Hugging Face trl library’s SFTTrainer is also solid and well-documented.

A typical LoRA fine-tune on a 7B model with 2,000 examples takes about two hours on a single A100 80GB. A 13B model with 5,000 examples takes four to six hours. These numbers shift with sequence length and batch size, but they give you a cost-estimate baseline.

Evaluating the Fine-Tuned Model

This is the step that separates teams that ship useful fine-tuned models from teams that ship models that are slightly worse than the base in ways they do not notice.

You need a hold-out evaluation set. A hundred to two hundred examples that were never in the training data, with ground truth labels for what a correct output looks like. Before you train, run the base model against this set and record the score. After training, run the fine-tuned model against the same set. If the fine-tuned model does not clearly outperform the base on your eval set, something is wrong with the data or the training setup.

What you are measuring depends on the task. For format consistency, it is a rule-based check: does the output match the required schema every time? For style and quality tasks, it usually requires LLM-as-judge: a more capable model scoring each output against your quality criteria. For code tasks, you can use structured output extraction to pull specific fields and run automated assertions.

One thing to check that people often skip: regression on general capability. A fine-tuned model can lose general reasoning ability if the training data is too narrow or the training is too aggressive (too many epochs, too high a learning rate). Run the fine-tuned model against a benchmark or two that measures capability outside your domain. If it dropped significantly, the training configuration needs adjustment.

The observability and eval patterns that apply to general agents apply here too, but the eval assertions can be tighter. You are measuring specific behavior against specific examples, not open-ended quality.

Cost Math

Fine-tuning has two cost components: training and inference.

Training cost is a one-time expense per version. A LoRA fine-tune on a 7B model for two hours on an A100 80GB costs roughly $5-20 depending on the provider and instance type. A 70B model fine-tune runs $50-200 per training run. These are not repeating costs. You pay them when you train a new version.

Inference cost is where fine-tuning often pays for itself. If you were previously passing a 3,000-token system prompt on every call to encode the behavior, and fine-tuning eliminates that system prompt, you save those tokens on every call forever.

At 1 million calls per month with a 3,000-token system prompt eliminated: 3 billion tokens saved. At standard API pricing for a mid-tier model, that is $1,500-4,500 per month in avoided token costs. A $50 training run pays for itself in the first day. This is the LLM cost optimization case that actually shifts the math significantly.

The comparison gets more complex when you factor in inference infrastructure for self-hosted models. Running your own fine-tuned 7B model means you are paying for GPU inference instead of API calls. At moderate volume (under a million calls per day), API inference on the base model is usually cheaper than the infrastructure to self-host. At high volume, the crossover happens quickly.

Deployment Options

Once you have a fine-tuned model and adapter weights, you have a few ways to run it.

API providers with fine-tuning support (Fireworks AI, Together AI, Replicate) let you upload adapter weights and call the fine-tuned model via an API. This is the fastest path if you were already using one of these providers.

Self-hosted inference with vLLM is the right call at scale. vLLM serves LoRA adapters dynamically, meaning you can run multiple fine-tuned variants of the same base model without loading separate copies of the weights. This is efficient at the infrastructure level and gives you the most control over latency and throughput.

Ollama works well for local development and low-volume internal tools. If you built a developer tool that your team uses, running the fine-tuned model locally via Ollama is often the right call. No API costs, no network latency, no data leaving the machine.

One operational detail that bites people: LoRA adapters are tied to a specific base model version. If you update the base model, the adapter does not automatically work with the new version. Pin the base model version in your deployment, and plan for retraining adapters when you want to upgrade the base.

What Fine-Tuning Cannot Fix

Knowing when not to fine-tune is as useful as knowing how.

Fine-tuning will not fix a bad prompt. If the base model is confused about what you want because the instructions are ambiguous, adding training examples on top of ambiguous inputs trains the model to be confidently wrong in a specific way. Fix the prompt first.

Fine-tuning will not give a small model capabilities it does not have. A 7B model fine-tuned on reasoning examples does not become a 70B model. The training encodes behavior patterns; it does not change the underlying capacity for multi-step reasoning or long-context understanding. If the base model cannot handle your task even with perfect prompting, fine-tuning will not save it.

Fine-tuning will not solve a data quality problem. The model learns from what you show it. Bad labels, inconsistent examples, data that does not represent real production inputs: all of these produce a fine-tuned model that behaves badly in production in specific, hard-to-debug ways.

The first thing I do when someone tells me their fine-tuned model is underperforming is ask to see the training data. Nine times out of ten, the problem is there.

Start With the Evaluation

The advice that would have saved me the most time: before you touch training infrastructure, build the evaluation set.

Define what “better” means for your use case. Collect 150 examples of inputs with ground truth outputs. Write the evaluation logic to score a model’s output against each example. Then run the base model against your eval set.

If the base model scores above 85% on your eval, your problem is probably solvable with prompt engineering. If it scores 50-70%, you likely have a behavior problem that fine-tuning can fix. If it scores below 40%, check whether the problem is in the task definition or the data before you assume more training is the answer.

Fine-tuning is a tool for a specific job. It is not a general performance booster, a shortcut around good data, or a way to get a 7B model to reason like a 70B. Used correctly, on the right problem, with good training data and a solid evaluation set, it is one of the most useful tools in the LLM development toolkit.

Build the eval first. Train second. You will know exactly what you got.