Preference Tuning
It Post training alignment method that teaches a model from relative judgments instead of absolute answers. You don't hand the model a gold label saying "this is the correct response." You hand it a pair — one response marked chosen, one marked rejected — and the model learns to shift probability mass toward the kind of answer people prefer and away from the kind they don't. That makes it the right tool precisely where correctness is subjective: tone, politeness, brand voice, style, safety posture, clarity — cases where many valid outputs exist and "better" can only be defined by comparison

When should you use Preference tuning?
Example applications where DPO is particularly effective in aligning AI responses include:
- Enhancing Conversational AI Responses
- Improving Code Generation Quality & Style
- Ensuring Compliance with Legal, Ethical & Safety Standards
- Controlling Brand Voice, Professionalism, & Tone
- Customizing Creative Outputs & User Experience
How it Works
- Step 1: Generate Responses
The process begins with an unaligned or partially trained language model (typically right after Supervised Fine-Tuning). The Action: A prompt is fed into the LLM, and the model generates two distinct candidate answers (Response A and Response B) by varying its internal sampling settings. The Goal: To capture different potential angles, phrasing strategies, or factual structures for the exact same input instruction.
- Step 2: Get Human (or AI) Feedback
Once the responses are generated, they are evaluated by an external judge—traditionally a human annotator, though increasingly an advanced LLM (AI Feedback / RLAIF). The Action: The reviewer looks at both completions side-by-side and chooses which one handles the prompt more safely, accurately, or helpfully. The Output: In this specific example, Response A is marked with a red cross (Rejected) because it might contain a flaw, while Response B is marked with a green checkmark (Chosen) as the preferred variant.
- Step 3: Create a Preference Dataset
Individual evaluation decisions are aggregated into a highly structured data matrix. The Action: The prompt along with its corresponding evaluated response pair is saved into an offline tabular training file. The Output: Each clean row in this database contains a specific triplet: [Prompt, Chosen Response, Rejected Response]. Thousands or millions of these rows are collected to establish a broad baseline of what high-quality model behavior looks like.
- Step 4: Update LLM Weights directly(The DPO Algorithm)
This is the core innovation of DPO. Instead of using reinforcement learning loops or building an intermediate scoring model, DPO processes the offline dataset using standard supervised learning techniques. The Action: The training algorithm feeds both the chosen and rejected text paths back through the LLM simultaneously to examine the model's current token-probability distributions ($P$). The Math Engine: The DPO loss function forces a mathematical adjustment: it actively pushes the probability curve of the Chosen path higher while compressing the probability curve of the Rejected path down.
The End Result: The internal parameters ($\theta$) of the LLM are updated directly, ensuring that the next time the model encounters a similar prompt, it will natively favour the formatting, tone, and logical path of the preferred answers.

Training Job for DPO
#OpenAI-style DPO job
ft = client.fine_tuning.jobs.create(
model=base_model,
training_file=train_file_id,
validation_file=val_file_id,
method={
"type": "dpo",
"dpo": {"hyperparameters": {
"n_epochs": 2,
"beta": 0.1,
"batch_size": 8,
}},
},
)
- The one hyperparameter that matters β
β is the KL knob inherited from equation (1) — it controls how far the tuned model may drift from the reference. In OpenAI's fine-tuning API it ranges 0–2; Google's Gemini preference-tuning jobs recommend 0.01–0.5 (and β = 0 stops learning entirely). The intuition is a spectrum:
β → 0 · aggressive chases new preferences hard; stylistic shift, collapse risk

Issue with DPO
- Out-of-Distribution (OOD)Vulnerability: DPO is highly sensitive to the distribution shift between the base model's outputs and the preference dataset. It can easily find biased solutions that over-allocate probability to unseen, low-quality, or completely out-of-distribution responses.
- Failure on Complex Tasks: In challenging domains like code generation (benchmarked on APPS and CodeContests), DPO completely failed to improve upon the base Supervised Fine-Tuning (SFT) model, sometimes even degrading to a 0% pass rate by producing meaningless code snippets.

The mental modeL for Preference tuning
One pipeline, two stages, increasing refinement SFT can only ever push the model toward the examples it is shown. It has no way to express what to move away from, because it never sees a "bad" answer. DPO closes that gap: every training example is a pair, so the model gets a positive signal (the chosen response) and a negative signal (the rejected one) at the same time.

Why DPO completes SFT?
This reduces the magnitude of weight updates during DPO, stabilizing training and preventing overfitting." The combined SFT-then-DPO workflow "converges faster and yields higher-quality results.

SFT vs RFT vs Preference Tuning(DPO) - when to pick which ?


A big issue for agent builders:
Multi-turn breaks vanilla DPO and any other alignment methods. Everything above assumed single-turn: one prompt, one response. Agents — web navigation, tool use, multi-step workflows — violate the derivation in two ways. First, the reward arrives only at the end of a trajectory, so credit assignment across a dozen actions is ambiguous. Second, and more fundamentally, the partition function Z Z that so elegantly canceled in equation (4) only canceled because it depended on a static prompt. In a multi-turn environment it depends on the ever-changing state, it no longer cancels, and standard DPO acquires a length-dependent bias



